TEXT ANALYTICS: Compelling products that pack a punch!
Text analytics software is designed to extract meaning from unstructured content, which accounts for about 85 percent of total content and includes reports, blogs, wikis and rich media such as videos. Text analytics can be used alone or in conjunction with structured data to help explain results found in analyses of quantitative data. Prime examples of the growing list of applications for text analytics are sentiment analysis, fraud prevention, e-discovery and analysis of medical data.
The volume, complexity and importance of medical information used in support of diagnosis and treatment of illness, as well as the dramatically rising costs of healthcare, drive initiatives to improve information use. One major contributor to the cost of healthcare is hospital readmission following an illness—an expense estimated at $17 billion per year. The Seton Healthcare Family took on the challenge of identifying those patients most likely to be readmitted after congestive heart failure to improve post-hospitalization outcomes, reduce the likelihood of readmissions and mitigate costs.
A great deal of medical information is in the form of unstructured data such as patient histories, notes taken by doctors during examinations and discharge notes. Generally, although the information is accessed by individual physicians, it is not analyzed broadly for the insights it can provide about outcomes or to support decision-making. Seton is now using IBM Content and Predictive Analytics to identify patients likely to be readmitted in order to target interventions to address the causes of readmissions.
Readmission predictors
As a first step, Seton suggested more than 100 possible factors that might be related to readmissions, and then analyzed patient data using IBM Content and Predictive Analytics to identify the best predictors. "One finding that emerged," says Craig Rhinehart, director of ECM strategy and market development at IBM, "was that the unstructured data provided better predictors of readmissions than structured data."
Some of the indicators were not found among the initial list of possible factors but were discovered through data mining. For example, living situation and drug abuse were strong predictors of readmissions, but their importance was revealed only after text analysis. In addition, a distended jugular vein was a predictor, yet only in doctors' notes would that condition be indicated because it was not found in lab test data or other quantitative information.
Maximizing potential
IBM has a long history in text analytics, dating back to 1957 when the company published a paper on searching and classifying unstructured information, and offers a variety of products that support text analytics. The company also developed and in 2010 introduced Watson, perhaps best known for beating two "Jeopardy" champions in a contest held in February 2011. Watson is designed to understand natural language and produce answers to specific questions, and can be used in conjunction with text analytics to confirm hypotheses or seek alternative explanations for results. The Watson Healthcare Advisory Board was formed in March 2012 to offer guidance to IBM on ways in which Watson could be used in healthcare.
"The healthcare industry has a lot of potential to make greater use of its unstructured information," Rhinehart says. "This industry has not been an early adopter but can truly benefit from the developing technology. Now that more robust text analytics solutions are available, hospitals have the opportunity to improve outcomes and contain costs. By using this information, patients will receive better care, and resources will be focused on those who are most at risk."
Fraud detection using text analytics
In today's global economy, governments must balance competing priorities in international commerce, facilitating the flow of goods while ensuring security and compliance with import controls. When a shipment reaches the customs clearance section in the destination country, customs officials review the commodity's classification, text descriptions and related information, such as the country where the shipment originated and the declaration of value associated with the goods. After the review, officials make a decision about whether the shipment is approved for import clearance or if it requires examination.
Shipment examinations are expensive and they slow the pace of commerce, so customs officials place a high priority on identifying those shipments most likely to pose a risk. Historical records indicate that certain combinations of information—such as commodity classification tariff heads, importer/supplier history and nonalignment with the declared shipment value—often indicate risk of misclassification or misvaluation, but additional insights are obtained by mining the textual descriptions (such as product quality, sizes, pack types, grades, brands and product specifications) in the shipment declaration detailing the exact commodity being imported.
Calculating risk
KIE Square, a consulting firm that specializes in advanced analytics, has developed a BI application with a transaction risk framework using the SAS Text Miner backbone, which helps identify shipments likely to violate import regulations on the basis of importer's declaration data and Transportation Entry & Manifest of Goods information.
Violations can occur across a wide spectrum. In some cases, the shipment does not meet regulations. It may be a product that is not permissible in the boundaries of a state, which is a significant violation, for example, or one for which the quantities are restricted. Those issues are straightforward and can be detected from analyses based on the structured information.
In other cases, however, the warning signs are more subtle; deliberate misclassification in order to obtain a reduced duty cost is harder to detect. The text analytics portion of KIE Square's application provides a much more granular analysis of the declaration than does the high-level structured data. Keywords in the bill of entry or shipping bill declaration, for example, can indicate that the classification is not correct. At that point, the customs officials can make the decision to inspect the shipment.
As the text mining operations on historical textual data in declarations help sub-classify the commodities through distinct keyword combinations, precise reference bands of prices can be developed for the commodities at a granular level. Those reference bands make it easy for customs official to accept or reject the declared price of a commodity. A great deal of expertise goes into establishing the rules on which analyses in KIE Square's system are based so that the right keyword indicators of potential problems are included in the text analysis.
In addition, broader indicators from external sources are also incorporated in the approaches to identify potential sub-segments of the import universe where fraudulent activities may surge. The SAS analytic platform is flexible, and the company offers specific mining capabilities in the area of fraud that are customizable for various text mining needs.