Rebooting the information refinery
Search
Consider something as simple as search. The search application is approximately as old as computing itself. We started up the machines, and sure enough, before very long we couldn’t find the results that we were interested in. Text complicated the picture that had been designed for numbers. So computer scientists sought and developed a system that would take words and give them a kind of reliable structure. Specifically, they designed the inverted index and the Boolean query language. Keyword search was born. It was an early, simplistic refinery for what soon became more than 80% of the world’s data.
Unfortunately, keyword search was doomed to failure by the permanently ambiguous nature of its raw material: language. The simple refinery could not keep up with the vagaries of language. Adding to the challenges for search, the nature of the data that people wanted to search was changing from plain text to richer media, such as audio, pictures/images, streaming media such as film and television, and, ultimately, to social media.
Enter web search. The first web search engines put keyword search on huge machines, with predictably awful results. Some of the approaches to the limitations of early web search literally stepped backward in time, away from computer science. Most of the “search engines” pre-Google utilized human editors to mask the woeful inaccuracy of the keyword refinery. The humans hand-curated popular information (such as headlines and items about health and disease, celebrities, and sports) so people could replace searching with browsing a kind of kiosk of breaking news and entertainment. Techies on these sites busied themselves writing business rules to help the remaining keyword searches disguise their problems with context, misspellings, and common ambiguities.
Google changed everything about search, including its name. It leveraged a crucial insight: User behavior renders smarter insights about their intent than their use of words. It built a new generation information refinery. The new refinery used a searcher’s query words as a starting point, but bringing back the answer leveraged behavioral and contextual analytics, including user location and search history, and processed the “answer” through a sequence of simultaneous pipelines and filters. By the time a decade had gone by, the refinery had become so sophisticated and so acceptable to users, that Google owned the field, which now includes not just internet search itself, but of a galaxy of surrounding value points such as mobility, devices, shopping, advertising, travel data, and geo-location information. Google was able to accomplish this hegemony by shifting its business onto compute platforms that were designed for big data and driven by machine learning. Business rules techies are still in evidence, but it is algorithm-led science and the business model to support it that show the power of a contemporary refinery.
Sentiment analysis
Another example of changing refinery characteristics lies in the area of sentiment. Sentiment has been important to many firms for a very long time. Customers, shareholders, employees, and other stakeholders’ feelings are a central concern for executives running the business. But in the current era of CX, where customer experience is the new gold standard, being able to measure and analyze sentiment is a core business challenge.
Sentiment analysis, similar to search, had an early information refinery strategy based on text. And just as with search, the dictionaries of sentiment-charged terms on which early sentiment analysis depended proved too limited and fragile to capture actual human expressive powers. When does “bad” mean bad and when does it mean the coolest thing ever? And how long will that interpretation last in general circulation?
Today’s sentiment gurus are thinking about other spectrums of clues. What if we can read facial expressions from a video chat? What if we can understand the tone of voice in an audio recording of a customer service call? What if we can analyze what the friends of our customer are saying about our product or competitors’ products? What if we can integrate a customer’s purchase history into the analysis, or her recent navigation on the web? The next-generation sentiment refinery will integrate all those signals newly made available through the internet and the spread of mobile devices and cameras to propose a much wider spectrum of analysis to answer questions of customer affect.
Image analysis
As a final example, consider the area of image analysis. Early machine learning-based attempts to extract features from images, such as reading spy plane photos for missile launchers or training a machine to interpret an MRI or CAT scan, were highly inaccurate and dependent on access to a very large library of training images to “supervise” the machine’s knowledge about the target of the image-based investigation. Human image readers and expert human analysis represented the true state of the art. Image recognition in “general” areas of life was considered a technology not ready for prime time. Yahoo went so far as to acquire a company (Flickr) dedicated to crowdsourcing the labeling of images captured in amateur photography. Similar to the backward engineering approaches of early internet search, Yahoo researchers experimenting with machine learning-based image recognition saw that human labelers understood “sunset” pictures better than computers were able to and expected this shortfall to last for years.
New refineries
But current developments in unsupervised machine learning and particularly in the growing class of deep learning approaches are adding a new level of accuracy and interest in applications across the image recognition and analysis spectrum, including healthcare and intelligence. Image analysis is another area where the information refinery is being redefined. After years of frustration among practitioners and computer scientists, image analytics is beginning to work.
The idea of the information refinery is alive and well in multiple fields and returning far more relevant and valuable results in just the last months than it had been able to achieve after many earlier attempts. Machine translation, for example, has finally become mainstream. But the success of the new refineries should raise this question for anyone embarking on a cognitive computing project: What kind of information refinery will I need to accomplish my goals, and how will my refinery mix language, pictures, voices, and context? Just stating the question may open new directions for knowledge and business strategy.