Text Analysis: The Next Step in Search
Finding Without Knowing What is Available
or What You’re Looking For
First, basic low-level character encoding differences can have a huge impact on the general searchability of data: where English is often represented in basic ASCII, ANSI or UTF-8, foreign languages can use a variety of different code-pages and UNICODE (UTF-16), all of which map characters differently. Before a particular language’s archive can be full-text indexed and processed, a 100% matching character mapping process must be performed. Because this process may change from file to file, and may also be different for different electronic file formats, this exercise can be significant and labor intensive. In fact, words that contain such language-specific special characters as ñ, Æ, ç, or ß (and there are hundreds more of such characters) will not be recognized at all.
Next, the language needs to be recognized and the files need to be tagged with the proper language identifications. For electronic files that contain text that is derived from an optical character recognition (OCR) process or for data that needs to be OCRed, this process can be extra complicated.
Straightforward text-analysis applications use regular expressions, dictionaries (of entities) or simple statistics (often Bayesian or hidden Markov models) that all depend heavily on knowledge of the underlying language. For instance, many regular expressions use US phone number or US postal address conventions, and these structures will not work in other countries or in other languages. Also, regular expressions used by text analysis software often presume words that start with capitals to be named entities, which is not the case with German. Another example is the fact that in languages such as German and Dutch, words can be concatenated to new words, which is never anticipated by English text analysis tools. More examples of linguistic structures exist that cannot be handled by many US-developed text analysis tools.
In order to recognize the start and end of named entities and to resolve anaphora and co-references, more advanced text analysis approaches tag words in sentences with "part-of-speech" techniques. These natural language processing techniques depend completely on lexicons and on morphological, statistical and grammatical knowledge of the underlying language. Without extensive knowledge of a particular language, none of the developed text analysis tools will work at all.
A few text analysis and text-analytics solutions exist that provide real coverage for languages other than English. Due to large investments by the US government, languages such as Arabic, Farsi, Urdu, Somali, Chinese and Russian are often well covered, but German, Spanish, French, Dutch and Scandinavian languages are almost always not fully supported. These limitations need to be taken into account when applying text analysis technology in international cases.
A Prosperous Future for Text Analysis
Even with some of the limitations and challenges profiled here, on balance the next few years will see the extensive application of data mining in two areas: e-discovery and compliance. Associated with these are the cognate areas of bankruptcy settlements, due-diligence processes and the handling of data rooms during a takeover or a merger.
The final application in this context will unfold as major legislative changes and stricter control systems will undoubtedly take place in the short term: companies will have to carry out regular (real time) internal preventative investigations, deeper audits and risk analyses. Text analysis technology will become an essential tool to help process and analyze the enormous amount of information in a timely fashion.
Although changes in the legal and financial world are typically evolutions rather than revolutions, a significant role for text analysis in e-discovery and e-disclosure certainly exists. Data collections are just getting too large to be reviewed sequentially, and collections need to be pre-organized and pre-analyzed. With text analysis, reviews can be implemented more efficiently and deadlines can be made easier.
The challenge will be to convince courts and auditors of the correctness of these new tools. Therefore, a hybrid approach is recommended in which computers make the initial selection and classification of documents and, based on established investigation directives, human reviewers and investigators implement quality control and valuate the investigation suggestions. By doing so, computers can focus on recall, and users can focus on precision.
ZyLAB’s Universal Approach to Text Analytics, E-Discovery and Compliance
Since 1983, ZyLAB has worked alongside professionals in the auditing, legal and intelligence communities to develop the best tools for investigating and managing large sets of archived data. These award-winning technologies have been bundled into the ZyIMAGE Information Access Platform (IAP), an integrated e-discovery, document, content and records management solution that enables businesses, auditors and legal professionals to capture, investigate, structure and disclose information in an efficient and secure manner.
ZyLAB offers specific process functionality, relevancy modeling and flexible content analytics and visualization, all supported by ZyIMAGE’s robust search capabilities and an XML-based archiving framework that can be applied to a number of specific applications:
- Email archiving;
- E-discovery and e-disclosure;
- Corporate compliance and contract management;
- Case management and litigation support;
- Back-office records management for organizations facing legal risk, such as construction, outsourcing, customer service, medical or HRM environments;
- Federal and local government records management; and
- Historical files.
ZyIMAGE IAP is optimized for these applications due to a unique combination of search technology, security and business-focused content-management functionality. ZyLAB can quickly deploy even the most complex installations of specialist solutions and provide all the necessary training, documentation, support and maintenance.
ZyLAB also offers unique text-analytics technology that supports more than 200 languages and can be easily deployed to scale. The ZyIMAGE IAP enables organizations to bring e-discovery in house and stabilize comprehensive records management and knowledge management initiatives.