Big data: New options for implementation
“We typically evaluate networks of over 1,000 proteins in order to find the 20 or so that would be most likely to affect the disease,” says Jonny Wray, head of discovery informatics at e-Therapeutics. “We wanted to accommodate a growing staff and to remove the computational bottlenecks from our research process.” Wray had previously used a product called In-Memory Data Fabric from GridGain, and considered it a good fit for e-Therapeutics’ needs.
The GridGain In-Memory Data Fabric is an IT infrastructure layer that resides between applications and data sources. It provides data access and processing, and because it uses in-memory processing, its processing rate is high. GridGain also allows the use of commodity hardware, resulting in a low-cost system despite the high-performance capability.
GridGain connects the e-Therapeutics network pharmacology platform and the underlying databases used in its analyses, which include both NoSQL and relational databases. “We did not want to spend time creating the infrastructure for distributed computing,” Wray says, “so we chose a product that had solved that problem.” With GridGain in place, e-Therapeutics has increased its analytic speed by a factor of 20; analyses that previously took three weeks can now be completed within a day or two.
Reliability, speed and cost
Speed was not the only factor that Wray considered in selecting GridGain, however. “We wanted to increase the reliability of our system,” Wray explains. “Previously, we did not have as robust redundancy as we do now. With GridGain, we have a system that rarely goes down.” GridGain is transparent to the users, who continue to interact with the same platform. However, they obtain their results much more quickly now.
The key feature of GridGain In-Memory Data Fabric is its in-memory analytics. “Today’s computers can store terabytes of data in RAM,” says Abe Kleinfeld, CEO of GridGain. “RAM is inexpensive now, and accessing data in RAM is five thousand to a million times faster than accessing data in disk.” GridGain achieves that speed while still using commodity hardware, greatly reducing total costs.
In March, GridGain made its code available under an open source license, and the GridGain In-Memory Data Fabric was subsequently accepted as a project called Apache Ignite, managed by the Apache Software Foundation. This is an incubator project for high-performance, integrated and distributed in-memory platforms for large data sets. As part of the Apache community, GridGain’s code will allow other users to provide enhancements and extensions. The open source code can be downloaded at no charge. GridGain also offers a commercial enterprise version of its data fabric that provides enterprise features and professional support services.
“Most large enterprises need to support large and growing customer bases, and to be capable of real-time analytics for big data,” Kleinfeld says. “In-memory processing allows for both, dealing with high volumes of data very quickly, and handling sophisticated analytics such as complex event processing.” In one online trading risk management application, GridGain’s software was used to process 1 billion transactions per second. Comparable results cannot be achieved with other analytical techniques, according to GridGain.
Liberating data from silos and pushing up the speed of analysis are critical factors in making the most of the data supply chain. Ideally the focus will not be on the enabling technology but on the desired outcomes. “When business units are working together toward the same goal, data can be transformed into information, and continuous improvement through big data-driven analytics projects is possible,” says Dell’Anno.