What is Big Data?
To begin to grapple with the concept of Big Data, you have to start thinking in zettabytes. A zettabyte is a unit of digital information that is equivalent in size to a billion terabytes. And a terabyte, in turn, is the equivalent of a billion kilobytes. To amass a zettabyte of data, you'd need to compile approximately half of all of the information transmitted by broadcast technology (television, radio, and even GPS) to all of humankind during an entire calendar year, according to a study by The University of Southern California. Or, according to the linguist Mark Liberman of The University of Pennsylvania, if you recorded all human speech that had ever been spoken, the audio file would require a modest 42 zettabytes of storage space. That, in short, is a lot of data.
But the growth in the amount of data that our massive infrastructure of information technology is capable of collecting presents a growing concern for all manner of organizations. The more data you collect, the more unwieldy that data becomes, and the more challenging it may be to learn anything useful from it. Hence, Big Data and the challenges related to capturing, storing, processing, analyzing, and leveraging it have become significant concerns among technology and information companies. There are plenty of strategies, tools, and experts out there, but making sense of it all can be as daunting as trying to comprehend the size of a zettabyte.
Big Data: The How and the Why
Big Data, according to one definition put forth by the IT research and consulting firm Gartner, is "a popular term used to acknowledge the exponential growth, availability and use of information in the data-rich landscape of tomorrow." This definition contains within it many of the keys to thinking about Big Data.
The first important element to understand about Big Data is its rate of growth. Calling the growth of data "exponential" is by no means hyperbole. In the early days of computing, data storage could typically be described and measured on the order of megabytes. As the ability of technological tools to generate, process, and store data has skyrocketed, the cost of doing so has plummeted. According to research done by NPR, a gigabyte's worth of data storage in 1980 cost $210,000; today, it costs 15 cents.
The availability of data is the other major factor that has contributed to the rise of massive data sets. The technological revolution of the 20th century served to move an immensely broad spectrum of human behavior into the digital realm: shopping, healthcare, weather prediction, transportation, communication, and even our social lives. All of these activities-and many more besides them-were once aspects of life that could only be measured, cataloged, and studied through a cumbersome paper trail, if there was any trail to follow at all. Today, every one of these activities is conducted largely, if not entirely, within the realm of the digital, making it vastly easier to record and analyze the data that are associated with them. Facebook, Groupon, our doctor's office, our cable company, and our cell phone-these are all rich sources of information about how we each live our lives, and how the world works as a whole. So there is a lot of data out there, and it's waiting for someone-anyone-with the right tools to analyze it. The Big Data challenge at hand is twofold: How to access it, and how to analyze it.
Finding a Standard Framework for Big Data Management
Content managers have long been grappling with the challenge of managing Big Data once they've captured it-and the challenge has led developers to create entirely new distributed computing frameworks to contend with today's data management needs.
One emerging industry standard for managing Big Data is Apache Hadoop, a project spearheaded by open source advocate Doug Cutting. Hadoop allows for the distributed processing of large data sets across clusters of computers using a simple programming model. According to The Apache Software Foundation, Hadoop, which was named after Cutting's son's toy elephant, "is designed to scale up from single servers to thousands of machines, each offering local computation and storage."
The Hadoop Wiki on the Apache website distills the framework's function as follows: "Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster."
Gartner search analysis showed that the term "Hadoop" was one of the most searched on the firm's website in 2011, reflecting its rise as an important computational framework for managing massive data sets. A survey by Ventana Research suggested that more than half of large-scale data users are considering integrating Hadoop-based solutions within their information environments.
You've Captured Big Data-Now What?
One of the richest and most ubiquitous sources of Big Data is the internet itself. Much of the information that is constantly being uploaded, streamed, and otherwise fed onto the web is rich in value to all sorts of organizations, particularly as that data relates to consumer and economic trends and behaviors. There is no shortage of companies touting solutions for searching, collecting, and analyzing Big Data from the web.
Connotate has been in the web analytics business since 2000, but it has lately been honing its focus on developing technology to analyze and produce efficient and meaningful results for Big Data. Shortly after naming Keith Cooper its new CEO in late February of 2012, the company announced that it had acquired Fetch Technologies, a competitor in the realm of real-time internet data analysis.
The goal of the merger, according to Cooper, is nothing short of being able to extract "any data from any site any time"-a Big (Data) ambition. The emphasis on real-time web analytics is a function of the realities of doing business: Information changes quickly, and decision making, in turn, must change with it. The right tool, says Cooper, can scan "every news website once an hour and determine whether there's new news, and bring back the precise data you're looking for," whether it's mortgage prices, changes to a company's executive staff, or anything else you'd care to keep tabs on that might affect your organization's decision making.
As is the case with much of information technology, the future of Big Data analysis seems to be headed into the cloud. Greg Merkle, the vice president and creative director of Dow Jones, pointed to the SaaS (software-as-a-service) model of solutions as an emerging trend in Big Data. "It's an agile way to get the results you're looking for," Merkle says, "and it's 100% in the cloud." Companies such as Quantivo Corp. are developing self-service SaaS Big Data analysis tools.
Big Data has emerged in recent years as a catch-all term for the challenges of organizing and analyzing digital information. But behind the hype, there's the reality that data really is getting ever bigger and ever more complex. The tools that organizations develop and employ to manage that data will need to continue to function on larger scales, at faster speeds, and with greater sophistication. Every organization stands to gain in one way or another from the proliferation of information that is created and collected every day.
This article was adapted from an article by Michael J. LoPresti that was published in Intranets, a newsletter published by Information Today, Inc., Medford, NJ. For more information, visit www.infotoday.com and click on "Newsletters."