A Big Data Architecture for Search
Big data has the potential to transform the way we all do business. The gradual merging of enterprise search into the big data world is not a one-way street. For sure, the technologies, skills and expertise, built up over two decades by folks in the enterprise search space, are critical to the effective use of unstructured content in big data applications. Various recent acquisitions testify to this. As a company, we have resisted the temptation to simply slap on a big data label, and follow fashion. Instead, we've asked questions such as, "How can we use big data technologies to improve search systems?"
Big Data 101
The notion of big data originates from transactional data sources, such as log files. During the 2000s, companies like Google and Amazon saw the hidden value in log files that could be mined through analysis. Google created a technique called map/reduce, and Doug Cutting, now with Cloudera, initiated an open source system called Hadoop. In essence, this enables big problems to be split into many smaller problems, processed on distributed hardware, and then merged back together again. Combine this with the elastic availability of cloud-based computing power, and we have a platform for tackling big problems, which is available to anyone. So what are the big issues in the search space?
Search system users typically point to poor relevancy, or lack of access to important content sources as their pressing problems. IT professionals may be looking for improved agility and flexibility, to keep pace with change.
If you asked us, we'd point to the area of content processing. That is, the analysis and preparation of content, prior to indexing. This important area of search systems is much neglected. Problems such as poor results relevancy are often solved through improved content processing. Almost all of the cool user interface features that users love, such as search navigators, are based on metadata that has been captured, derived or generated through content processing.
In the near future, the most important improvements in search relevancy and ?utility will be driven by innovation in ?content processing. This means advanced statistical analysis, latent semantic indexing, link and citation mapping, and a wide range of other text processing techniques.
How Can Big Data Help?
Hadoop enables us to contemplate much more detailed analyses. Content processing involves token (word-level) analysis. An enterprise corpus of ten million documents contains billions of tokens, and calculations that involve cross-referencing such a large number of items pose a big data problem. We can also use big data methods to greatly improve agility in developing and maintaining search systems.
Getting content processing right involves a lot of iteration. Make changes in this area, and it is usually necessary to re-index. But in a typical enterprise search scenario, re-indexing 10 million documents will take weeks, because the content must be re-crawled, and the rate at which content can be sucked out of repositories is the limiting factor.
In our new architecture, we create a secure cache of the content, in Hadoop, and update this cache as documents are updated in the source repositories. This simple change typically reduces the time taken to re-index from weeks to hours, enabling the tempo of development to be transformed. Hadoop then enables us to exploit this additional agility, and run sophisticated text analyses quickly and efficiently, laying the foundations for highly relevant and effective search systems despite continuous growth in the volume of enterprise content.
Here are a couple of examples. We have a customer with 90 million patents in their data set. Cross-referencing these for citation analysis, and for other sophisticated comparison purposes, involves much content processing. However, the result is a rich, relevant and substantially more productive search experience for users.
We are also working with a recruitment company to statistically match job vacancies against millions of CVs, and compute a match percentage, taking into account a wide range of factors such as geographic proximity, salary expectations, skills and expertise. Transformational productivity gains for professional recruiters can be achieved.
Not only does this architecture provide a platform for building better search systems, it can also provide content processing services (such as the normalization, cleansing and enrichment of unstructured content), for business insight applications. This is because we've split the content processing tasks away from the core search engine, enabling it to work with any application. We call it structuring the unstructured.
An application-independent content processing layer, running on Hadoop, will, we believe, become a must-have infrastructure item in the near future.
We are already deploying this new architecture for customers using a range of leading search engines, including the GSA, Solr and SharePoint 2013.