-->

Keep up with all of the essential KM news with a FREE subscription to KMWorld magazine. Find out more and subscribe today!

Speaking in tongues: Foreign language KM Technologies

text mining activities. REX is built using a statistical processing engine that learns by experience using large training sets of foreign language documents. Basis has designed REX so that it comes ready to use with a variety of core languages. Out of the box, REX supports Arabic, Chinese, Dutch, English, French, German, Italian, Japanese and Spanish.

It is also very easy to extend the named entity extraction service to other foreign language models, and Basis is continually working to add new off-the-shelf language support for important languages of interest to its customers. Figure 3 on page 9, (KMWorld, Vol 16 Issue 7), shows an example of how REX identifies each language in a document, categorizes each entity type by a color code, identifies the different scripts or writing within the document and identifies the digital encoding schemes used within the document.

Advanced services

RLP offers a number of advanced services built on top of its basic services, which I would categorize as follows: language and document encoding identification, name identification (aka name normalization) and transliteration services. As with the basic RLP services, these services were developed by combining a variety of core linguistic capabilities.

Rosette Language Identifier (RLI)

The Rosette Language Identifier (RLI) is a critical part of multilingual text analytics. For example, consider multilingual search engines whose indexing processes crawl Web sites in multiple languages all over the world. A search engine that supports multilingual indexing must be able to ingest any text it finds and quickly and accurately determine what language the indexing engine must use to process the data.

RLI has the unique capability to reliably, efficiently and quickly recognize the language of a document or multiple languages within a single document. At its core, RLI relies on the linguistic concept of an n-gram, which is a technique for breaking a word into multiple parts in order to compare it to other words.

For each language grouping, Basis builds a profile of the n-grams describing a particular language. Out of the box, RLI has 114 profiles that can recognize 43 discrete native languages, 33 native encodings and includes support for UTF-8 for every supported language.

The technology requires only plain text as an input and statistically ranks its findings with the most likely candidate first, followed by multiple matches in descending order. Figure 4 shows RLI detecting a language and then ranking its findings. As can be seen in Figure 4, the language profile that returns the largest number of n-gram matches from the input text is ranked the highest.

The technology requires approximately 128 bytes of data for 100 percent detection and can identify languages using very limited information, such as the title of an RSS feed, the title of an HTML document or the subject of an e-mail message. With the latest version of the technology, Version 5, Basis characterizes RLI as having an accuracy of 91 percent. That is, in a test of 800,000 HTML titles with an average of 39 characters in each title, it misidentified less than 9 percent of the languages in the corpus of documents.

Rosette Name Matcher (RNM)

One of the most difficult problems in linguistics, particularly for non-native speakers of a given language, is recognizing a person or place name and all of its possible variations. Rosette Name Matcher (RNM) helps solve a variety of related problems whereby a person or place may have more than one name. A given name may have multiple spellings in different parts of the world. There may be no international standard for spelling a person’s name or a location name.

If you follow the news of the ongoing wars in Iraq and Afghanistan, you are likely to see person and place names given in Arabic that leave you confounded. Arabic and other Middle-Eastern languages often have complex naming conventions and rules that make it difficult for foreign language (English) speakers to recognize given entities or locations.

RNM solves the name permutation problem in multiple languages including Arabic, Chinese, Korean, Pashto, Persian and Urdu. It allows a user to use his or her language to look up names found in target foreign language documents. For example, for Arabic names in Arabic script in a database, an English-speaking user would be able to enter them using the English letter sounds that match the Arabic language sounds. That phonetic approach is easy for the user to understand and implement.

RNM analyzes the queries using fuzzy algorithms and can successfully align even a partial match. For example, I might look up the name Gaddafi (as in Colonel Qadhdhafi of Libya—I’m certain you remember who he is). If you think about it, there are many different ways I might spell his name. RNM will match my attempt to spell the name Gaddafi in English and give me the appropriate spelling in the native Arabic script. Basis maintains a lexical database containing the proper name spelling for most common entity names. Figure 5, (KMWorld, Vol 16 Issue 7) on page 11 shows all the different ways in which I might attempt to spell Gaddafi phonetically in English and the actual resulting name given in Arabic as supplied by RNM.

Transliteration Assistant

Putting it all together, RLP uses its various linguistics components to build applications that use transliterations written in one language to approximate the sounds of the actual foreign language word. In fact, transliteration means “a systematic way to convert characters in one alphabet or phonetic sounds into another alphabet.” Basis supplies a variety of interactive transliteration tools built on its various RLP components.

One of Basis’ most recent products is the Transliteration Assistant (XA), which is a plug-in for Microsoft Word, Excel and Access. The company also has a variety of other custom transliteration tools in multiple languages that provide standardized ways to spell person and geographical location names. Basis’ transliteration tools observe a variety of different transliteration standards depending on the language in question. For example, with regard to Arabic, there are four different transliteration standards. When U.S. government or intelligence community (IC) personnel perform transliteration services, they must follow congressionally mandated “IC standard”

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues