Big data has big implications for knowledge management
Big data has a number of other attributes besides scalability that are sure to augment its appeal. "Big data is self-healing," Dowlaty explains. "If a node fails, it will automatically activate a segment of data that has been backed up on another node." It can also handle search using a different paradigm from that used in most search engines. "Sometimes there is not time to index incoming text," says Dowlaty. "In that case, mining key words and providing word clouds or other forms of visualization allow real-time queries on the fly." The same techniques can be used to filter information to separate the "noise from the news." No doubt, big data is a technology to watch as a powerful enabler for knowledge management.
Big data buzzwords
Apache Hadoop—an open source distributed computing platform available from Apache (apache.org), which includes the Hadoop Distributed File System (HDFS) and an implementation of MapReduce.
Apache HBase—open source distributed storage for big data (e.g., over a billion rows and over a million columns) on clusters of commodity hardware.
Apache Hive—a data warehouse solution from Apache that provides data extraction/transformation/load (ETL), access to files and query execution.
Cassandra—a distributed database developed at Facebook (facebook.com) and owned by DataStax, now integrated with Hadoop to provide an analytic platform for big data.
Cloudera—a participant in Hadoop that has developed a commercial distribution bundle that includes the source code and other features in one package.
Greenplum—a division of EMC that provides an analytic platform, a data computing appliance, a database and other products for big data analysis.
Hadoop Distributed File System (HDFS)—the file system used by Hadoop for distributed data storage. Hortonworks—a commercialized open source platform based on Hadoop for storing, processing and analyzing big data.
MapReduce—technology developed by Google (google.com) and used by Hadoop for its parallel data processing. It is the core technology behind the big data engines. In the "map" step, input data is distributed to multiple nodes for computation, and in the "reduce" step, the results are collected to produce the answer to the initial question.
MongoDB—a high-performance, open source, NoSQL database written in C++.
NoSQL—a group of non-relational, distributed, open source, scalable databases designed for Web-scale use. Over 100 such products are listed on nosql-database.org, including Apache HBase, Cassandra, Amazon SimpleDB and MongoDB.
Big content
Content providers are also tuning in to big data. "We are having conversations with our clients about bringing our archives into their big data initiatives to combine structured and unstructured data," says David Chivers, VP of Factiva. A product of Dow Jones, Factiva publishes business news and information on a wide variety of topics ranging from the energy industry to logistics management.
With more than 1 billion articles from 35,000 sources available from Factiva's information repositories, clients can combine an information feed with internal structured data to help determine causal relationships between news events and financial outcomes or other metrics. "Our new delivery capabilities, called Total Access and introduced in 2011, include integration solutions designed to take in information as one set and present indicators on a dashboard," says Chivers. "One client is using it as an early notification system to identify risks and opportunities in sectors they are watching."