Text analytics: greater usability, less time to insight
Text analytics has had several distinct generations in the course of its evolution. From keyword-based analysis to clustering of concepts and sentiment analysis, text analytics has grown increasingly sophisticated. The next advance will be more influential, yet its mechanism will be much less visible. “The wave of the future is to make things easier for users,” says Matt LeGare, product manager for Attensity. “The early adopters have been willing to push through the technology and big data to process a large corpus of information, but for others, usability has to improve.”
The mechanism will be less visible because more of the computer’s work will be going on behind the scenes. “Right now, users typically have to provide a lot of manual input to fine-tune text analytics—for example, with categorization,” says LeGare, “but we are aiming for an automated categorization rate of 80 to 100 percent.” That goal can be achieved only if the text analytics solution becomes more intelligent, has the ability to make inferences and provides or seeks out knowledge that is relevant to users.
Another change has been the shift away from keywords. “The point now is to get to the root cause, not just documents that are related to certain topics,” says LeGare. “We want to provide the right context so that users are getting the answers they want, reducing the time to insight.” Flexibility in general will also be a trend in text analytics, so that the core models are less rigid. “The use of statistical methods can lead to a rigid output from machine learning algorithms. In the future, the systems will become more malleable,” he adds.
Supporting knowledge workers
Consultants at Deloitte require a steady flow of information in their areas of expertise to stay current on the work being carried out by colleagues and more broadly in the disciplines in which they operate. Deloitte is a professional services firm that provides support to its clients in a wide range of management and financial activities, so the body of knowledge required is extensive. “We tried using traditional search with document repositories,” says Ben Johnson, senior manager at Deloitte, “but it did not meet our needs for a dynamic, interactive environment.”
Deloitte had been using Brainspace Discovery 5 for some applications in the company and thought that the new Brainspace Enterprise product might be a good match. Brainspace analyzes the text in each repository to identify concepts. Deloitte decided to run a few pilot projects to test whether Discovery provided the results its consultants needed.
The company is using it on three repositories: documents generated by its consultants; premium subscription content that it purchases; and publicly available text documents, blogs and other content that can be captured and put into a third repository. Employees create collections that reflect their areas of interest by selecting or “liking” relevant documents or performing natural language searches. Brainspace then proactively attracts new information to each user’s collection, presented on a news-like dashboard.
The Brainspace interface allows users not only to view the concepts extracted and inferred by Brainspace, but also to easily interact with and refine their queries and collections. Users can also annotate documents, and those annotations become part of the collection, discoverable by other users who share their areas of interest. “Depending on how the user interacts with the document, the system will get trained on what related topics are also of interest,” Johnson explains. “In that respect, we are approaching a machine learning model.”
Building a Brain
As the name suggests, Brainspace is aimed at augmenting or extending the user’s knowledge. It would not be the best solution for users who mainly want to seek out specific documents, such as a report written by a particular person in a particular year. “For traditional search, the user would need to run a properly structured query against a properly structured database,” Johnson says.
The tradeoff is that no manual intervention is required with Brainspace; it learns dynamically from the documents themselves without the use of lexicons or ontologies. “The more traditional approach did not work for us because the repositories were too static and not scalable enough,” Johnson says. “Having run several pilot programs and gotten positive feedback from our employees, we plan to roll it out to a broader audience in the near future.”
The core technology in Brain-space’s products transforms unstructured text into a multidimensional space of related concepts. “A typical Brainspace is a hyperspace of somewhere between 275 and 350 dimensions,” says Dave Copps, CEO of Brainspace. “Our platform can ingest hundreds of millions of documents from virtually any source, and—unsupervised—transform the information into what we call a Brain. It is not just a repository of documents but a method of representing relationships among different documents according to the concepts they embody.”
The machine learning component stems from the fact that Brainspace incorporates new information from the user into its subsequent analyses and brings in suggested materials for each collection. It does not use taxonomies or ontologies. “Companies have treasure troves of data,” Copps says, “but they don’t have the resources to spend weeks or months creating a taxonomy and adding metadata, so they can’t make use of it. We can build a Brain in 45 minutes, and put the collective knowledge of the company at the fingertips of everyone right away.”
Supervised versus unsupervised
Text analytics often has its greatest leverage when used in combination with quantitative analytics or business intelligence (BI) solutions. SAS Enterprise Miner and SAS Text Miner work in tandem to extract explanations for quantitative results from unstructured information. “If SAS Text Miner is used with Enterprise Miner, it becomes possible to do predictive modeling with a combination of structured and unstructured data,” says Sascha Schubert, director for analytics product marketing at SAS, “so a company could anticipate that a customer of a certain age making a certain type of negative comment would be at risk of canceling a service and therefore should get a particular offer in order to prevent that.”
The text analytics portion can be done in two ways. If there is a predefined taxonomy, rules can be created to categorize incoming documents; alternatively, a machine learning algorithm can be applied for unsupervised learning to structure the text data in situations where the possible answers are not known and the significant topics are not yet identified.
“These two approaches use different types of algorithms,” Schubert explains. “Natural language processing can be used to find important terms in the text and how they are related and generate classification rules.” The new documents can then be categorized, but the rules should be made transparent so a human can modify them. That hybrid approach supports active learning where the user can interact dynamically with the algorithm to guide the system to an improved solution, enabling interactive model building.
“Machine learning originated in the computer science discipline and data mining in the statistical community, and these communities have developed different terminology,” Shubert says. “The statistical community would call unsupervised learning ‘cluster analysis,’ for example.” The growing importance of advanced analytics applications with unstructured and structured data really brings those two communities closer together.
One recent development in the area of machine learning is the so-called deep learning, which uses more complex neural network architecture. “These networks have an architecture with inputs and outputs like every predictive model,” Schubert says, “but there are hidden layers that allow them to model more complex structures in the data. They are the ones that can provide very accurate image recognition and things like self-driving cars.” In the domain of text analytics those networks show promise for the automated processing of less clean text sources—for example, the informal language used on social networks such as Twitter or discussion boards.