Automating perception
By Tony McKinley
Advanced information retrieval software is an exciting technology in knowledge management. Its promises—super-human understanding of vast skies of information—and the exotic terms used to describe it—taxonomy, ontology, correlation mining—are so dazzling they might be blinding the market to what the software actually does. While advanced information retrieval is often referred to as categorization systems technology, I have coined a term for it: "automating perception."
In this article, three industry pioneers help us understand the quest to automate perception. They represent vendors in the three primary sectors of the space: traditional search, categorization/classification, and a combination of both. Convera, which first traded on NASDAQ in 1980 as Excalibur, is that traditional search vendor that has achieved durable, high-tech success with its information retrieval systems. ClearForest represents the new categorization/classification genre, while Stratify combines both sectors, and represents new evolutions in the never-ending search for faster, better, smarter. While Convera and ClearForest agree their technology could and should work together, Stratify sees itself more as a total solution.
Convera
Claude Vogel, founder of Semio and now chief scientist at Convera, begins, "After 9/11, the market globally evolved to information consolidation, to find connections between events. As the whole world became concerned with connecting the dots, our industry changed focus to tracking networks of events and people." The phrase "networks of events and people" describes a common goal of all the vendors: to establish patterns in data flows so massive they gray out into random, but are not random.
Vogel mentions another term that is central: “discovery search,” which he patented in 1996 and is now a major concern. Vogel says the three companies in the article are aptly grouped “because all three of these converging applications need each other. The entire industry is moving from search to discovery."
"Convera runs on thousands of servers, real-time, many languages, we grind massive amounts of data. If we add data mining on top of it, we grind terabytes," Vogel explains. He distinguishes between the terms categorization and classification as follows. Categorization “tags info for what it is" in the most commonly understood terms, while classification is done on the fly to match the user's needs.
The image emerges of a solid system in place to control the chaos of so much data, aka categorization, and the joy of being able to effortlessly form the data to suit the mission, aka classification.
"Individual people want to deal with their own problems, to organize unique representations of the world,” says Vogel, “but without a common view of the world, you cannot predict what's going to happen when you use the information, so you need a sense of order. Our competitive advantage is we can grind terabytes and pull out the semantic signature of documents, the DNA of documents."
The word "semantic" sounds complicated, but in terms of search, it simply means that search terms are expanded into their fuller expressions, like you would find in a dictionary, rather than just a thesaurus. Convera excels at light-speed classification of documents—"10K docs X 10K categories in a few seconds."
Vogel offers Convera’s terabyte-grinding categorization powerhouse as the stable generator of a reliable, ergo stable, taxonomy—common categories that everyone knows and trusts. And on top of that, Vogel invites classification (nee data mining) vendors to perform their magic on that refined material.
"For enterprise viability, you must be scalable to terabytes,” says Vogel. “Our goal is to integrate these multiple functions into one process. We want to attract third-party companies, like ClearForest, in data mining. Pattern detection finds the plot, the connections, allows us users to do correlation mining.”
ClearForest
Barak Pridor, CEO of ClearForest, says, “Categorization is only a piece. Categorization is a document-level operation. The inner document operations are where you identify and tag the gold.
“Search is good when you know what you are looking for; when you don’t know what you are looking for, you need discovery.” For example, he adds, Reuters market wrap-up contains tens or hundreds of nuggets of information that are relevant, all of which would be lost in one big category if the categorization was done at the document level.
“All you can do with unstructured documents is you can read them,” Pridor says, “and all search does is help you find them.” He compares the difference between search and categorization/mining to the difference between RDBMS operating systems like Oracle, and the business-critical applications that are built on top of them.
“Data mining users identified patterns in warehouses of data, and this was used for business intelligence, business analytics,” he says. “We are in the business of structuring unstructured information, to bring the kind of business intelligence access to unstructured data that has long been available with structured data.”
The technology could, for example, help an intelligence analyst with 10 million documents look for specific people who have been to specific places in the last few years.”
Pridor offers another example: “Compugen is a bio-informatics company that needs to streamline the scientific discovery process. To do that, it needs to identify dead-ends fast, in any phase of its future investments. Mining the world’s largest online DB, MedLine, as dynamic source data, the company uses ClearForest in the global, up-to-the-minute literature to find patterns mentioning certain genes, tissues, organs, diseases, molecules of information that would otherwise be impossible to perceive.”
As much as we hear about ROI in the KM market, Pridor explains his company’s mission in other terms. “It’s not productivity savings we are going for,” he says. “It is to find insights that would never come to life otherwise. Our mission is to bridge the huge gap between information and action.”
Stratify
Ramana Venkata, co-founder and CTO of Stratify, starts out at a satellite view, “We create a new layer on top of unstructured data, the fastest growing data in the enterprise, and convert it to a structured format.”
Compared to information retrieval, Venkata says, “Just to think of search is narrow. Years ago people did comparative analyses of IBM DB2, Oracle, etc. Nobody does that anymore; they look at the ERP, CRM and all the apps built on top of the database technology.”
Venkata continues, “We act as the foundation platform for converting unstructured into structured data. In the enterprise, structured data is under the control of the CFO, but unstructured data is the responsibility of all the departments.” By bringing order to chaos, Venkata envisions new business process applications that incorporate intelligence from every process in the enterprise.
“Search is just one application,” Venkata continues, “and databases are the largest family of applications. We can insert data into any application. More importantly, we can perform analysis, extract reports, we can leverage the data in unstructured databases.”
He emphasizes, “The critical part is the tools to manage your taxonomy. We have the best taxonomy structure because we can build a brand new taxonomy, or take an existing taxonomy, or combine both methods. We can then apply these specific taxonomies.
“Our unique differentiator is that we offer automated tools to refine the taxonomy to maintain relevance. We ensure relevance in an ongoing fashion by providing human review at every step as we create our categories.”
Stratify uses many advanced IR algorithms to categorize documents, and users can dial up the specific algorithms that work best on specific topics. According to Venkata, “Knight Ridder originally went with another vendor, but they used only one algorithm to categorize incoming data. For example, individual sports could be distinguished easily, like football vs. baseball, but college vs. pro was too fine a distinction.” By offering such precision-tweaking tools, Stratify can direct to ever-finer distinctions in the way information is converted from unstructured to something more like traditionally manageable data in a RDBMS.
Longtime watchers of the information retrieval industry are excited by this new generation. While KM-automation sounds a little a silly (how can you automate thinking?), these vendors are capable of breakthrough applications in multidimensional data mining. Unstructured documents—things you can read rather than numbers you can sort and calculate—are the raw materials in this explosive segment of the KM industry. As Convera, ClearForest and Stratify show, this technology provides a quantum bang in connecting dots worldwide.
Tony McKinley is with Input Solutions (inputsolutions.com), e-mail tony@inputsolutions.com, 610-647-5570. He is the author of “From Paper to Web” (imagebiz.com).