Transforming information retrieval on the Web: a new direction
Information overload is a cliché in knowledge management circles, but the term is rarely explained. Because we cannot absorb more information than we are inherently able to take in, we are not so much drowning in data as zoned out to streams of surface facts. That is, digital media offers access to more data than we have time to synthesize and integrate into our lives.
Noted information designer Richard Saul Wurman popularized the idea of “information anxiety.” That describes the ever-growing gap between what we understand and what we feel we should understand. For instance, we read facts in the newspaper without comprehending their significance, look at art images without perceiving their importance and listen to political speeches without hearing the speaker’s message.
The volume of data available on the Internet and the way data is delivered impede our ability to understand what we retrieve. Specifically, keyword searching often retrieves an overwhelming number of documents. Typically, a user types a few keywords into a query box, crosses his or her fingers and clicks the “Go” button. Algorithms churn, and screen after screen of confusing “hits” appear.
This situation has produced two trends in consumer and corporate portals: the increased use of taxonomies where previously reliance had been placed solely on search engines, and the rise of vertical portals devoted to a specific topic or domain. Parallel trends exist in new research projects.
Specific user groups or communities of interest are joined together by hypertext links. The information produced by those social networks automatically clusters together into hierarchies. Innovative technologies seek to order the Web by automatically constructing taxonomies. In particular, studies shows that Google-style page rankings can automatically chunk content by topic. By identifying social networks on the Web, researchers automatically sort information into taxonomies.
Other scientists are building software that transforms the Web into a knowledge-based system. New technology converts the Web into a series of expert systems, an aggregation of vertical portals that users can query.
This research examines how the Web-as-brain metaphor can redefine information retrieval on the Internet. In brief, the Web can simulate intelligence by reproducing the kinds of physical connections that exist in human and animal brains.
The Web is structured according to social networks
Some scientists argue for improved content management of the Web through Google-style link analysis. A number of research projects have explored the complex link topology of the Web. The Web is growing exponentially, and it is often experienced by users as a confusing tangle of disjointed information. But the Web does have an inherent, if inchoate, organization based on hypertext links. It is organized into hubs (pages that link to Web sites of interest) and authorities (Web sites of interest that are linked to by hubs). For example, a page entitled “My Favorite Links” would be a hub. Arts and Letters Daily, a vertical portal devoted to contemporary culture, is a hub that links readers to book reviews, editorials and essays of merit. In contrast, a reliable and important source of information, such as the SEC’s Edgar Database of Corporate Information, is an authority.
Hubs and authorities are interrelated: Hubs point to many authorities, and an authority is pointed to by many hubs. A densely connected set of hubs and authorities constitutes a Web community. Hypertext links reveal social networks. Studies reveal that the Web’s underlying structure of hubs and authorities is a direct consequence of how Web page authors link to topically related sites.
The fundamental assumption of Google-style page ranking systems is that a link to another page constitutes a “vote” for that page or Web site. Certainly not all links function as a form of endorsement. For instance, some links help users navigate a Web site (“Home Page”) and other links are advertisements (“Click here for over 700 styles of sunglasses.”) However, the vast majority of links point to valuable resources. Links pull together hundreds of millions of pages into webs of knowledge. Users often explore the Internet by serendipitously clicking on links, so hypertext technology connects people who have never met into communities of interest.
Identifying hubs and authorities reveals taxonomies
Dividing the Web into hubs and authorities automatically separates chunks of information according to subject. Researchers can sift through the Web and identify broad search topics. For example, searches for information on “gun control” will automatically result in two sets of results--one for gun control and one opposed. The linked structure of the Web can be a key source of information about digital content.
In fact, topics spontaneously appear on the Web in a hierarchical (parent-child) relationship. Communities on the Web are ordered into a tree structure. That means there can be multiple subgroups for a single community. For example, searching for “Physics” may reveal a number of smaller Web communities: academic departments, professional societies and research on string theory. While most directories determine subjects a priori, link analysis technology allows topics to bubble up from the Web.
These conclusions about the underlying structure of the Web have been supported by subsequent research. A study published in Nature magazine revealed that the Web is divided into four parts. It’s structured like a bow tie with dangling ribbons.
The knot of the bow tie is the tightly interconnected core of the Web. It is the heart of the Web. The left side of the bow tie consists of pages that link into the core but can’t be reached from the core itself. These pages are often new or obscure Web sites. The right side of the bow tie consists of Web sites that are linked to from the core but they don’t link back to the knot of the bow tie. These pages are usually commercial Web sites that contain internal links only. Finally, ribbons shoot off from the central structure of the Web. More than 90% of all Web pages are connected through hypertext links.
Structural analysis of digital information has entered the marketplace. Google, Quiver and other search services analyze hypertext links to determine relevant content, but the taxonomy research described above is not commercially available. Studies find that taxonomies compiled via page ranking technology can be competitive with handcrafted directories like Yahoo) and Infoseek (now part of the Go Network. This is an exciting idea for ordering information on the Web, but translating this technology into a commercially viable service or product has not occurred.
Taxonomies contextualize search results
Taxonomies are an important part of any knowledge management solution. A taxonomy functions like a road map to information. The purpose of a directory is to facilitate associate learning and casual browsing. Directories illustrate the conceptual relationships (i.e., equivalence, associative and hierarchical connections) that exist between items:
- equivalence relationships: directories include synonyms (e.g., book, text and monograph are terms that mean the same thing).;
- associative relationships: directories include cross-references (e.g., at Amazon amazon.com, the sidebar that reads: “Customers also bought”). Frequently used in an e-commerce context to cross-sell or upsell products to customers.;
- hierarchical relationships: directories include whole/part or genus/species relationship (e.g., a taxonomy devoted to animals may have a top category for “Domestic Pets” that is subdivided into “Cats” and further divided by cat breeds “Siamese Cat, Maxi Cat, etc.”).;
Searching for information is dynamic, interactive and iterative. It is not a binary enterprise. Users make a first attempt to learn something, refine their query, search for more information, refine their search and so on. Hierarchies and most other structures of organization do not foreclose the iterative process of information seeking, but help guide users through the back-and-forth procedure of looking for new information.
Automation is key
While taxonomies add background knowledge and structure to search results, human editors can’t keep up with the ever-increasing size of the Web. Momentum within the research community is with automating the classification process as much as possible. Manually indexing all the information contained on the Web is simply not feasible. As the amount and complexity of digital information increases, human labor is not scalable.
The web-as-brain metaphor: merging processing power and memory
The other important research trend concerns the Web-as-brain metaphor. Some scientists argue that the structure of the Web mirrors the organization of human and animal brains. The brain’s architecture, a highly connected network of neurons joined by synapses, is responsible for important functions such as perceptions, learning, etc. The basic idea is that Web pages act as neurons and hypertext links act as synapses. Web pages exist in complex patterns and hypertext links direct the flow of information from one page to the next.
Instead of centralized search engines indexing Web content, the network could become a collection of intelligent vertical portals. For example, the Active Semantic Memory (ASM) is a software system that simulates intelligence by mimicking human and animal brains. Developed by mathematician Jacob Yadegar, the ASM refigures the Web into a linked series of vertical portals (e.g., science, art, medicine, technology, etc.). These topic-specific Web sites would contain taxonomies that copy the structure and function of neural networks.
As mentioned above, conventional taxonomies ease information access by providing a common language and facilitate casual browsing, associative learning. In contrast, the ASM offers a more complex type of information retrieval by:
- generating answers to factual questions (“Is the Uffizi Gallery located in Florence?”);;
- performing comparative analysis (“What is the cheapest flight to Heathrow Airport leaving LAX in two weeks, the cheapest three star hotel in the Mayfair district of London and the closest railway station to central London?”); and;
- making diagnoses (What’s this rash on my arm?).
The ASM taxonomies analyze concepts as clusters of attributes. Attributes, in turn, are composed of values that describe each attribute (e.g. “Apple” is the concept, “Taste” is an attribute of “Apple” and “Sour” is a value for the attribute “Taste”). Values are given a probability and ranked in query results. It is at the level of values that computation occurs within the taxonomy.;
Furthermore, queries are also broken down into their constitute parts and handed off to various hierarchies with a specific vertical portal.
Taking the travel example mentioned above, a vertical portal devoted to Travel would hand off the question regarding airfare to the airfare hierarchy, the question regarding hotels to the hotel hierarchy and so on. The net result of a query is either a list of answers to proposed questions or links to relevant documents and Web sites.
Instead of a monolithic, static collection of digital documents, the Web could be ordered into a knowledge-based system. Although part of a long-term study and not ready for commercial release, the ASM is an example of how the Web-as-brain metaphor is sparking creative research projects.
One future application for ASM: Mobile products require answers not links
The ASM dovetails with the special information retrieval demands of wireless technology. Many commentators argue that mobile, wireless products will change the protocols of searching. On the Internet, searching generates links to Web sites. With mobile devices, users simply cannot browse sites in the traditional sense. With wireless technology, searching is about finding the right answer. Mobile devices, because of their small screens and hard-to-use keypads, limit the kind of content Web sites can send to users. Wireless products will take the user closer to the information needed to solve a problem.
Katherine C. Adams is an information architect and free-lance writer, e-mail kadams@mohomine.com.