Speaking in tongues: Foreign language KM Technologies
transliteration rules. In addition to the IC standard, Basis supports the U.S. Board on Geographic Names (BGN), the Standard Arabic Technical Transliteration System (SATTS) and Basis’ own internal transliteration format.
The XA tools for MS-Word, Excel and Access are convenient, particularly for linguists who work in a standard Microsoft Office environment. The plug-ins simply add a “Transliteration” menu option to the standard MS-Word, Excel or Access applications menu bar. The tool works equivalently in all three Microsoft applications; however, I personally like the MS-Excel integration because I can use spreadsheet functions to build a spreadsheet of transliterated names in different formats.
In Figure 6, (KMWorld, Vol 16 Issue 7), on page 11, I show a spreadsheet I built containing a list of Arabic names. In the first column are unvocalized (without the vowel marks) Arabic names. In the second column, I’ve had the XA add the Arabic vocalizations. In the third column, I’ve had the XA convert the Arabic text into transliterated English using the IC standard for transliteration. In the fourth column, I use the BGN standard for transliteration, which produces slightly different English results.
As can be seen in Figure 6, the “Translate” menu function on the Excel application menu bar is self-contained and provides a variety of different tools and help features. I found the help feature comprehensive and useful as well.
Other tools
Basis has used the different components of its RLP to build a variety of different tools and utilities. For example, it has created a geospatial mapping tool called GeoScope Map Viewer, shown in Figure 7, (KMWorld, Vol 16 Issue 7), on page 11, GeoScope uses a gazetteer of foreign place names and allows you to search a map using your native language equivalent of foreign language place names. In Figure 7, GeoScope loads a tourist map of Iraq and an Iraq gazetteer compiled from data from the old “Bathiist” Iraqi Office of Tourism. All of the map place names are shown in Arabic script. In this example, I typed in a fuzzy search for Tikrit using the English search term “tekreet,” and GeoScope placed a crosshair right at the location it labeled Quada’ Tikrit.
Another useful set of utilities is offered as the Arabic Desktop Suite, which contains a Knowledge Center that lets the user search for heads of state using English transliteration. In Figure 8, I do a search on the term Karzai, which returns information telling me that Hamid Karzai is the president of Afghanistan.
Basis has a variety of other tools it has built on top of its RLP technology in multiple different languages. Those tools range from Arabic text editors as part of the Arabic Desktop Suite, name matchers in Korean, other GeoScope Map Viewer mappings and a number of other utilities and applications. Keep in mind that RLP is a linguistics platform with a diverse set of tools, libraries, scripts and applications that can be used to build any type of linguistic support application or service you might imagine, or add foreign language support to any application a developer might build.
Doing the hard work
Foreign language tools for KM are essential to building systems that can accurately and completely support text mining either on the Web or within the enterprise. Basis provides a comprehensive set of foreign language tools, starting with Unicode libraries for multiple foreign languages to support internationalization of a developer’s existing applications. RLP has a set of basic services, including base linguistic capabilities that use natural language processing techniques to provide highly accurate means of accessing parts of speech, performing indexing, entity extraction, stemming, normalization and other linguistic capabilities essential to text mining.
Basis builds on top of those basic services to deliver a range of advanced services, including language identification, name matching and translation and transliteration. RLP should be thought of as a toolkit or framework that combines those basic and advanced services to allow developers to add a range of foreign language capabilities to existing applications (beyond plain old Unicode internationalization). Moreover, developers can use the tools to build even more advanced linguistic applications, as exemplified by GeoScope Map View, Transliteration Assistant for MS-Office and the Basis Arabic Desktop Suite.
As a software developer myself, I rest easy in the knowledge that I don’t have to develop my own Unicode extensions for the applications I write. I don’t know if you have ever looked at the Unicode standards for developing internationalization, but that stuff looks really complicated and difficult. I thank my lucky stars that I know about companies like Basis Technologies that have done all the hard work and more for me.
A Linguist’s Dictionary
- Linguistics
The study of the nature, structure and variation
of language, including phonetics, phonology,
morphology, syntax, semantics, sociolinguistics and pragmatics. - Morpheme
The smallest linguistic unit that has semantic meaning. - Morphological Analysis
A technique developed by Fritz Zwicky (1966, 1969) for exploring all the possible solutions to a multidimensional, non-quantified problem complex. In linguistics, it refers to identification of a word stem from a full word form (see morpheme). - Natural Language Processing
Natural language processing (NLP) is a convenient description for all attempts to use computers to process natural language. - N-gram
An N-gram is a subsequence of n letters from a given string after removing all spaces. For example, the
3-grams that can be generated from "good morning" are "goo," "ood," "odm," "dmo," "mor" and so forth. - POS or parts of speech
Identification of the semantic parts of sentences made up of nouns, verbs, adverbs, adjectives, etc. A POS tagger is a program that identifies and tags text based on different parts of speech. - Stemming
In linguistics, this is a technique for identifying the main part of a word to which prefixes or suffixes are added. - Unicode
Unicode provides a unique code number for every character, no matter what the platform, no matter what the program, no matter what the language. A standard managed by the Unicode Consortium.