AI and the Building Blocks of Intelligent Content
At Data Conversion Laboratory, we say that AI and related technology enables organizations to revisit high-value, but previously impractically expensive, projects. Common attributes of data and content that could benefit from AI and related technologies include:
♦ Digitized without structure—Scanning alone, with “dirty” OCR, is a good first step to preserve documents, but the results are image-based PDFs, which are not easily searchable. Modern AI, specifically NLP, can extract intelligence from that previously digitized content. Organizations can achieve more than what was feasible before!
♦ Complex content or data—In the past, variable data (or content) types with special characters, math, chemical formulae, etc., were “digitized” as images. That means filtered search and data analysis could not be performed on or with this type of information. New levels of accuracy are now possible with computer vision.
♦ Security and automation—New AI techniques provide capabilities that were previously impossible in a manual or semi-automated process. Now, cost-effective solutions exist to deal with confidential or sensitive data.
The Wizard Behind the Curtain
Some people don’t really care how data or content is structured. But it’s important to understand if and how you are using intelligence and technology to structure data or content (or if your vendor is!). DCL uses onshore staff with top technology to create structure where it didn’t exist before. The technology has evolved over the years, thanks to our work with structured markup languages. We have built large training sets that enable us to use AI / NLP and generate accurate and rich content.
Let’s look at a simple example. The following text is an example of a simple paragraph referencing mechanical parts:
FIG. 6 illustrates a diagram of the signal adaptive pre-filter 1200 and motion detector 1300 section within the segmented temporal processor 1400.
NLP looks at words, the order of words, and neighboring words and is able to discern what exactly is a “part”:
<para> FIG. 6 illustrates a diagram of the <part-name>signal adaptive pre-filter</part-name> <part-number>1200</part-number> and <part-name>motion detector</part-name> <part-number>1300<part-number> section within the <part-name>segmented temporal processor</part-name> <part-number>1400<part-number></para>.
This is possible because DCL has developed extensive AI training sets over the decades. The combination of NLP, computer vision, and automation enable the computer to “read,” “understand,” and contextually structure complex technical text buried in free-form content.
Real-World Application: The United States Patent and Trademark Office (USPTO)
The USPTO processes millions of trademark and patent applications. The information is dense and a combination of unstructured text, images, math, metadata, and more. Patent examiners required a system that allows them to search information in patent applications. They required a process to take unstructured, confidential information and structure it—with zero human intervention.
Explore the details and complexities of this project by downloading the white paper, “Lights-Out Automation: Using AI to Create Structured Data From Static Documents.”