Understanding What Matters With Text Analytics and NLP
Machine learning models
One of the foremost NLP trends is training text analytics domains with dynamic machine learning models. With this approach, organizations can substitute training computers via instruction with training them via examples.
“Deep learning has been really transformative in this area because now, if we teach the machine enough examples of textual variation, the machine learns to create a more robust understanding of the problem and isn’t so sensitive to those variations,” Wilde explained.
The trade-off between this approach and the taxonomic one is clear: Organizations can forsake the extensive time required to build taxonomies by simply using annotated training data. The objective is to “just throw statistics and machine learning at the problem so it will all automatically work,” Aasman said. Although reduced time-to-value is an advantage of the deep learning approach, there are issues to consider, including the following:
♦ Training data: Machine learning models require immense amounts of training data, which organizations might not have for their domains. Transfer learning solves this problem by enabling subject matter experts to upload a couple of hundred examples (instead of thousands), highlight them, and teach dynamic models “the representative entities, key-value pairs, and classes they’re trying to derive from these documents,” Wilde noted.
♦ Controlled vocabularies: Transformers and techniques such as Bidirectional Encoder Representations from Transformers (BERT) reduce the training data quantities for machine learning models, broaden the array of training data that’s relevant, and implement a controlled vocabulary that otherwise isn’t as defined as taxonomic ones. Thus, organizations can take a phrase and “generate a similar phrase that means the same, but can be used in multiple reports in a controlled way,” Mishra said. Additionally, it’s possible to simply purchase libraries of terms and definitions. “Many companies end
up buying those things to be able to incorporate those capabilities,” Shankar added.
♦ Practical knowledge: Exclusively using machine learning models to train text analytics decreases the real-world understanding and applicability of text. “People that do machine learning don’t want to spend the effort to create a vocabulary or the pragmatics or the semantics,” Aasman noted. “Machine learning has a place in all of this, but it misses part of the whole future solution where we have systems that understand what people are talking about.”
Hybridization
Nonetheless, synthesizing machine learning models with taxonomies supporting rules-based systems is effective for text analytics and all natural language technologies such as natural language understanding, natural language generation (NLG), and conversational AI, among others. Mishra cited the utility of a general language model accessible with certain NLG solutions that work out of the box and allow the users to apply their own taxonomies and ontologies. “If they want to customize it for their jargon, they can.”
Such models are typically based on computational linguistics. Therefore, they rely on a host of models, including (but not only) machine learning. With this approach, organizations still get rapid time to value, are able to start without creating exhaustive taxonomies, and don’t need a surplus of example data.
Parsing and tokenization
Regardless of which text analytics approach is selected, parsing is one of the first NLP steps to take once systems have been trained for domains. Parsing is useful for garnering a basic syntactical understanding of text that Aasman defined as “how you order the words. You can have NLP parsers that say this is a noun, noun phrases, this is a verb, object, preposition, and all of that. But again, you don’t have understanding; you’re just looking at the structure of the language.”
Tagging is the dimension of parsing in which parts of speech are assigned to different words and phrases. Tokenization enables tagging by breaking up text into appropriate words and phrases. Tokenization also plays a critical role in what Shankar described as the capacity to “convert from the unstructured world to the structured world, and that’s then passed into the computer because then the computer can understand those structured constructs.” Tokenization also implies intelligently partitioning text into relevant words and phrases.