BIG DATA interviews with the experts
Q Lamont: What does IDC expect from big data in terms of growth of the market?
A Vesset: We expect the market to reach $16.9 billion by 2015, up from $3.2 billion in 2010. Different segments will grow at different rates—we expect the annual growth in software to be about 40 percent, just under 30 percent for servers, and about 60 percent for storage. There are legitimate and appropriate uses right now across industries, as well as some growth due to hype and the fear of being left behind. But the demand for analytics in general is strong—the traditional data warehouse market grew 18 percent during 2011. Whether through new technologies such as Hadoop or mature technologies such as business intelligence solutions, companies want to use the data they collect in order to support business decisions.
Q Lamont: What is the fundamental difference between Hadoop and relational databases?
A Zedlewski: They operate in different ways and have different applications. Relational databases are structured with well-defined schema. When data sets are constantly changing, analysis using databases is difficult because they were designed for optimization of repeated queries. Hadoop breaks the information up into different blocks, does not need a defined schema and is designed for flexibility and experimentation. It is ideal for looking for patterns in data and dealing with unpredictable data sets.
Q Lamont: Is Hadoop easier to use than other technologies for this type of application?
A Zedlewski: That depends on what you are comparing. If you compare the complexity of building a petabyte scale Hadoop system with building a petabyte scale RBMS, Hadoop is the much simpler technology. It's free to download, runs on any kind of hardware or cloud platform and doesn't require a lot of pre-planning of data models. On the other hand, if you compare Hadoop to say, downloading MySQL, sure Hadoop is more complex, but then they are not serving comparable functions. For users, it's still the case that there are more tools and interfaces in the traditional database world than there are for Hadoop and there's a large industry of DBAs that are familiar with them. Also, if you are using Hadoop for advanced analytics, or "data science," you can get supply constrained because there are only so many people who are both math savvy and big data savvy. Cloudera is working to improve Hadoop in both of these areas.
Q Lamont: Are there limitations to Hadoop versus relational databases in terms of speed?
A Vesset: Today, Hadoop is good for storing large data sets and for performing batch processing on that data. Typically, a small number of data scientists are analyzing data held in Hadoop clusters. Sometimes, data pre-processed in Hadoop is migrated to relational analytic databases for further ad hoc analysis. Compared to traditional relational databases, though, Hadoop is not good at having many people query at the same time and getting instant responses. If that is your goal, a data warehouse based on a relational database is a better solution. There is increasing realization that at least for now, the two are complementary, and that each one has a role.
Q Lamont: How would you compare performance of Hadoop to that of other data processing systems?
A Zedlewski: Hadoop has great scale, performance and elasticity properties when it comes to processing. Processing data might mean aggregating log data, reconciling trades or calculating value at risk. It is not uncommon to see data processing times collapse from hours to minutes when these processes are implemented in Hadoop. That's the power of an architecture based on scale-out commodity hardware. On the other hand, users who are accustomed to BI tools that respond in seconds are a different story. BI tools are designed to answer a finite range of questions for which the queries and answers are already defined, so the response can be much faster. We recently released the public beta version of Cloudera Impala, which is the first real-time query engine for Hadoop. It does not yet have an extensive feature set that matches that of advanced SQL but is an effective tool for exploratory analyses.
Q Lamont: How would you describe the flexibility of Hadoop compared to relational databases or data warehouses?
A Zedlewski: If you ask developers how long it will take them to add a field or dimension to a database to answer a question that does not fit the original schema, you'll typically hear it takes from a few weeks to a few months. You have to pull that new field or dimension data off of archive storage, you have to append it to all your historical data, you have to update your data dictionary, you have to update your ETL jobs and you may also have to update parts of your batch reporting infrastructure. With a big data approach, you can keep all the original data granularity and add certain information in later, rebuilding it on the fly. Adding new variables and new dimensions is not a problem.
Q Lamont: Can you give an example of a specific current use case?
A Vesset: In Denmark, a wind turbine manufacturer was testing different sites to see if they were feasible locations for a turbine. The company used to have to spend a year gathering data about wind speeds and other environmental factors. Now, they get huge data sets from the National Weather Service and analyze it themselves using big data solutions. As a result, they have cut a year off their decision process and have saved money because they don't need to collect the data themselves. The results are as accurate as the ones they were getting from their previous method.
Bhambhri: At the University of Ontario, IBM's big data solutions were used in a project called Data Baby. Premature babies are routinely connected to sensors that collect all kinds of data that provide insights into their medical condition. The volume of information is staggering-1,000 pieces of data per second, which was being aggregated into several ratings that the doctors and nurses check on during their rounds. However, infants were coming down with infections despite positive health indicators. In this big data application, patterns of data were correlated with the development of subsequent infections. Based on these patterns, researchers were able to predict the potential onset of infections 24 hours in advance, allowing preventive treatment. This work is now being used at other hospitals to predict events such as brain hemorrhaging in stroke patients. When you can process information this quickly, a lot of options open up.