In my previous blogs, I have discussed how big data provides significant advantages for the public sector – from helping agencies fight crime to providing event-driven operational intelligence. This week, I would like to discuss some big data technologies that are seeing fast adoption in the marketplace.
Traditional data warehouses integrate and manage clean, structured relational data from diverse sources, and mine that data for actionable intelligence using business intelligence (BI) tools. However, challenges arise when agencies have to process unstructured data including full motion video, emails, voice, social networks, sensor-enabled facilities, web and biometrics data. By far, the technology getting the most attention in the market for analytic use cases involving unstructured big data at rest is Hadoop.
The primary usage of Hadoop is for processing massive amounts of data in a scalable manner. Hadoop enables agencies to organize and process large amounts of data while keeping it on the data storage cluster. Typical enterprise use cases include business analytics, extraction, transformation and loading /data warehousing, log analysis and web search engines. Most early Hadoop adopters rely on the native Hadoop Distributed File System (HDFS) configured with direct attached storage (DAS). Hadoop can serve as a scalable processing engine that uses its native HDFS to handle big data applications.
As with any “hot” technology, well-known vendors are getting into the act and partnering with other companies to fill gaps in their portfolio. In addition, numerous companies are adding management and support capabilities as add-ons to open source distributions. NetApp recently debuted the NetApp open solution for Hadoop (NOSH) Rack. The NetApp integrated big data solution brings together networking, compute and storage resources in the same box and is well optimized for Hadoop deployments and data-driven workloads utilizing NetApp’s E-series storage and ONTAP software. Oracle is also jumping on the Hadoop bandwagon. Along with its big data appliance, Oracle bundles an open source distribution of Apache Hadoop including the HDFS and other components that helps agencies analyze unfiltered data.
Amazon has the only established cloud-based Hadoop MapReduce implementation with its Elastic MapReduce feature, available on Amazon S3. Other providers, such as Google, Oracle, IBM and Microsoft also have cloud-linked big data initiatives that are either in their infancy or prerelease stages. Microsoft, for example, has a Hadoop distribution that is currently available as a community technology preview and integrates with both Windows Server (on premises) and Windows Azure (in the cloud). This unique endorsement of an open source project like Hadoop is a sign of the increasing maturity of these types of implementations and confidence among developers.
When it comes to big data, Hadoop is considered one of the best technologies that can handle raw, unstructured data as well as provide a reliable, fast and scalable storage system that enables agencies to run complex parallel queries on multiple machines with a great amount of flexibility. Agencies using Hadoop can also leverage a basic SQL interface called Hive to answer simple questions, and a high level language called Pig for defining transformations.
Hadoop adoption is increasingly gaining traction in the public sector and there are numerous Hadoop clusters in production by the DoD, Department of Energy and several law enforcement agencies. GSA’s Hadoop-based solution, USAsearch.gov, serves millions of citizens with more than 550 government websites faster than ever before. Are you ready to leverage the power of big data in your agency? Time has come to start thinking about big data and how it can help your agency support mission outcomes. Let me know your thoughts. Follow me on Twitter at GTSI_Architect.