Great Expectations for Text Analytics
Language is tricky. We expect people to understand our verbal and written communications. Even when you say something you think is perfectly clear, it may not be clear to the person with whom you’re talking. Take a business lunch you’ve scheduled for 1 o’clock in the afternoon with a colleague. You decide to add a third person to the restaurant reservation and instruct your assistant, who may be digital or a human being, to change the reservation to three. Suddenly you find that the restaurant is expecting two of you at 3 o’clock.
Then there are all the different words that can be used to describe a common object. Take that piece of furniture on which you’re sitting. Is it a couch or a sofa? Or maybe a davenport or a chesterfield. To get really nit-picky, you might be sitting on a love seat, a chaise lounge, a sofabed, or a divan. Think this is amusing? If you’re a furniture retailer, it’s probably not all that funny. You need to build in word equivalencies to accommodate the terms potential customers might use.
Music lovers are familiar with the Tchaikovsky problem. I’ve counted at least 40 variant spellings for the Russian composer’s last name, although the Russian Pyotr and the English Peter don’t pose much of an issue, and I’m told that Chinese search engine Baidu is working on its own Tchaikovsky problem. Apparently, if you say, “Play music by Tchaikovsky,” the Chinese voice recognition software hears “Play music and try cough ski.” This is probably an extreme example, but it’s the sort of thing that frustrates knowledge managers and those involved with text analytics.
High Expectations
When initiating a project, whether it’s text analytics or involving a different technology, we have high expectations for its success. No one starts off determined to fail. The promise offered by text analytics is immense. Done correctly, text analytics can reveal consumer preferences, trending topics and how they’ve changed over time, sentiment analysis regarding products and services, whether your user interface is effective, how enterprise search can be improved, competitive intelligence enhancements, risk mitigation, content enrichment, fraud detection, brand management, and a whole host of other useful and necessary applications. Text analytics can improve health outcomes and provide insights into the effect of globalization on culture.
Unstructured data is replete with ambiguity. It’s not just the couch/sofa example. As enterprises face an increasing deluge of information—some generated internally, some from external communications, and even more from social media—ascertaining what is actually valuable becomes very difficult. People don’t always write with correct grammar, particularly when they’re composing an informal email, a Facebook status update, or a Tweet. Even in more formal communications, acronyms unique to the enterprise may pop up and it’s imperative to understand what the writer is referring to.
The lack of structure in documents compounds the difficulties engendered by the incredible volume of incoming data. The idea of text analytics is to surface high-value information hidden in that deluge of data. Luckily, Filiberto Emanuele, Director of Technology at Expert System Enterprise, provides us with a five-item checklist in the accompanying white paper.
Text Analytics Checklist
Emanuele starts with considering content. Content is content is content, right? Well, no, not really. Content can be Word documents, PDFs, Excel spreadsheets, social media postings, possibly even PowerPoints. All need special consideration, in terms of both format and language. He mentions OCR technology for scanning PDFs. I’ve seen some truly atrocious OCR documents, where the OCR operator, for example, didn’t understand columns and simply scanned straight across the page. The resulting gobbledygook was close to unreadable.
He then moves on to functionality and makes the very good point that text analytics is not one process, but rather a series of operations. Depending on what you want the ultimate outcome to be, you’d tailor your approach to that need.
Quality is implicitly part of any successful project. You want results that are high quality. But you have to realize that total perfection is never going to be obtained. What metadata is most appropriate to your project? Should you use cow, cattle, or bovine? This can be a very sticky wicket—and it took me years to learn what a sticky wicket actually is. Encountering the phrase in British novels, I knew the colloquial meaning—that the person was in trouble—but until I met an actual cricketer, I didn’t know it refers to the ground being damp so that the ball doesn’t bounce very well.
If the objective of the project relates to fraud, security of the country, or financial services, quality becomes of vital importance. That’s when a productivity multiplier can boost the accuracy of the analysis.
The concepts of precision and recall are embedded in information science. Did you get exactly what you wanted (precision)? Did you get absolutely everything on your topic (recall)? Given the situation, which one takes precedence? It’s important to understand the parameters of the project before jumping into a technological solution.
Last on Emanuele’s checklist is integration. Text analytics projects are rarely standalone; they’ll be integrated into a larger operation.
Finding hidden value in unstructured data need not be an unsurmountable problem. Thinking through the project before beginning will lead to success. What I took away from this white paper was the necessity of asking a lot of questions about each of the five items on the checklist. Don’t walk into a project blind. Have it all scoped out before jumping into your text analytics solution.