Your Search Solution is as Good as Your Metadata
Modern enterprise search engines need to rely heavily on metadata to improve document ranking and to provide rich search navigation.The success of Internet search, and in particular of Google Internet search, has raised the expectations of enterprise search users when searching within their own enterprise, often causing frustration for enterprise users. Many enterprise search users expect the enterprise search solutions to work much better within their enterprise compared to the leading Internet search engines. After all, the Internet search companies search billions of Web pages and most of them contain small text that is unstructured, free form and mostly not well written. In contrast, the number of documents within an enterprise typically is much smaller, enterprise documents have been written with a lot of attention and yet many users wonder why their enterprise search solution does not work as well as internet search. So, what is happening?
The answer to this question lies in the availability of high-quality metadata.
On the Internet, the number of documents is very large (billions of pages). The large-scale nature of the Internet, and its topology which allows text to be linked to Web pages, indirectly results in metadata that has been manually created on an extremely large scale. For example, if someone links somewhere on the Internet the text “research for cancer treatment” to a particular page on a National Institute of Health (NIH) page, one can take the text as metadata about the NIH page. Those links are created manually and are therefore high-quality user-created metadata. Google, for example, relies heavily on such metadata to rank the search results.
Within enterprises, although the number of documents is typically much smaller (from hundreds of thousands of documents to millions of documents), the nature of enterprise documents does not naturally yield to linking between documents and therefore does not yield to the natural creation of document metadata. Typical enterprise documents consist of Microsoft Office documents, PDF documents, etc., and reside in the document collection completely isolated between each other. Contrary to the Internet, there are no textual links between documents and no implicitly created metadata that a search engine can use.
In order to achieve high accuracy within an enterprise search solution, metadata has to be explicitly created, very much in the methodology used by traditional publishing organizations. Automatic metadata generation solutions are required for successful search deployment. There are several kinds of metadata that can be generated automatically: keywords, entities and topics. Keywords and entities are words and concepts (such as people, location, dates, products types, relationships between entities, etc.) that are mentioned in the document which are relevant to the enterprise business. Keywords and entities are automatically extracted from documents by an automatic entities extraction system, and topics are automatically associated with each document by an automatic categorization system. Each of these systems requires management tools that allow a small number of enterprise librarians to configure each system, reflecting the nature and content of the business.
There are several types of tools for creating and managing metadata: tools for maintaining vocabularies and relationships between words (such as Teragram Ontology Management Tool), tools for defining entities and concepts (such
as Teragram Concepts Extraction Management Tool) and tools for defining and generating sets of topics (such as Teragram Taxonomy Management Tool). Associated with those metadata management tools are the automatic metadata generation software: automatic keywords extraction software, automatic concepts extraction software and automatic categorization software.
Real Life Examples
The benefits of a metadata generation system can be described best with two examples of organizations that chose
automatic metadata generation solutions (automatic keywords, entities and categorization solutions) to automatically tag their documents to support their search solution: the Homeland Security Digital Library (HSDL) and the World Bank. An electronic repository of scholarly works and Department of Defense-written articles, the HSDL receives a constant influx of new content—and new topics altogether. For example, when avian flu emerged, an entirely new set of metatags had to be created, and documents on biological outbreaks had to be updated with these new tags. What these technologies enabled HSDL to do was automate this process.
Much like the HSDL, the World Bank required a better way to retrieve the millions of documents stored in its repositories. By automatically applying metatags to documents in several different file formats, the World Bank was able to create a system that worked across multiple languages for better search retrieval.
These two examples bring me back to my original point: tools for creating and maintaining metadata have now become the central piece of a successful enterprise search deployment.
Teragram Corporation is the maker of multilingual automatic metadata generation and metadata management solutions which have been adopted by the largest news, media, publishing and enterprise organizations. Founded in 1997 by innovators in the field of computational linguistics, Teragram offers the speed, accuracy and global language support that customers and partners demand to retrieve and organize growing volumes of digital information. Teragram serves customers across the publishing, pharmaceutical, telecommunications and financial industries, including ABCNews, AOL, Ariba, Ask Jeeves, Associated Press, CNN, Factiva, Ebsco Publishing, Ebay, FAST Search & Transfer, Forbes.com, InfoSpace, Naval Post Graduate School, NYTimes Digital, OneSource, Reed Business Information, Ricoh, Sony, Verity, WashingtonPost.com, WoltersKluwer, the World Bank and Yahoo!. For more information please contact 617-576-6800 or visit
www.teragram.com/info.