Logo Search packages:      
Sourcecode: icu version File versions  Download package

Transliterator Class Reference

#include <translit.h>

Inheritance diagram for Transliterator:

UObject UMemory AnyTransliterator BreakTransliterator CaseMapTransliterator CompoundTransliterator EscapeTransliterator NameUnicodeTransliterator NormalizationTransliterator NullTransliterator RemoveTransliterator RuleBasedTransliterator UnescapeTransliterator UnicodeNameTransliterator

List of all members.

Detailed Description

Transliterator is an abstract class that transliterates text from one format to another. The most common kind of transliterator is a script, or alphabet, transliterator. For example, a Russian to Latin transliterator changes Russian text written in Cyrillic characters to phonetically equivalent Latin characters. It does not translate Russian to English! Transliteration, unlike translation, operates on characters, without reference to the meanings of words and sentences.

Although script conversion is its most common use, a transliterator can actually perform a more general class of tasks. In fact, Transliterator defines a very general API which specifies only that a segment of the input text is replaced by new text. The particulars of this conversion are determined entirely by subclasses of Transliterator.

Transliterators are stateless

Transliterator objects are stateless; they retain no information between calls to transliterate(). (However, this does not mean that threads may share transliterators without synchronizing them. Transliterators are not immutable, so they must be synchronized when shared between threads.) This might seem to limit the complexity of the transliteration operation. In practice, subclasses perform complex transliterations by delaying the replacement of text until it is known that no other replacements are possible. In other words, although the Transliterator objects are stateless, the source text itself embodies all the needed information, and delayed operation allows arbitrary complexity.

Batch transliteration

The simplest way to perform transliteration is all at once, on a string of existing text. This is referred to as batch transliteration. For example, given a string input and a transliterator t, the call

String result = t.transliterate(input);

will transliterate it and return the result. Other methods allow the client to specify a substring to be transliterated and to use Replaceable objects instead of strings, in order to preserve out-of-band information (such as text styles).

Keyboard transliteration

Somewhat more involved is keyboard, or incremental transliteration. This is the transliteration of text that is arriving from some source (typically the user's keyboard) one character at a time, or in some other piecemeal fashion.

In keyboard transliteration, a Replaceable buffer stores the text. As text is inserted, as much as possible is transliterated on the fly. This means a GUI that displays the contents of the buffer may show text being modified as each new character arrives.

Consider the simple RuleBasedTransliterator:


When the user types 't', nothing will happen, since the transliterator is waiting to see if the next character is 'h'. To remedy this, we introduce the notion of a cursor, marked by a '|' in the output string:


Now when the user types 't', tau appears, and if the next character is 'h', the tau changes to a theta. This is accomplished by maintaining a cursor position (independent of the insertion point, and invisible in the GUI) across calls to transliterate(). Typically, the cursor will be coincident with the insertion point, but in a case like the one above, it will precede the insertion point.

Keyboard transliteration methods maintain a set of three indices that are updated with each call to transliterate(), including the cursor, start, and limit. Since these indices are changed by the method, they are passed in an int[] array. The START index marks the beginning of the substring that the transliterator will look at. It is advanced as text becomes committed (but it is not the committed index; that's the CURSOR). The CURSOR index, described above, marks the point at which the transliterator last stopped, either because it reached the end, or because it required more characters to disambiguate between possible inputs. The CURSOR can also be explicitly set by rules in a RuleBasedTransliterator. Any characters before the CURSOR index are frozen; future keyboard transliteration calls within this input sequence will not change them. New text is inserted at the LIMIT index, which marks the end of the substring that the transliterator looks at.

Because keyboard transliteration assumes that more characters are to arrive, it is conservative in its operation. It only transliterates when it can do so unambiguously. Otherwise it waits for more characters to arrive. When the client code knows that no more characters are forthcoming, perhaps because the user has performed some input termination operation, then it should call finishTransliteration() to complete any pending transliterations.


Pairs of transliterators may be inverses of one another. For example, if transliterator A transliterates characters by incrementing their Unicode value (so "abc" -> "def"), and transliterator B decrements character values, then A is an inverse of B and vice versa. If we compose A with B in a compound transliterator, the result is the indentity transliterator, that is, a transliterator that does not change its input text.

The Transliterator method getInverse() returns a transliterator's inverse, if one exists, or null otherwise. However, the result of getInverse() usually will not be a true mathematical inverse. This is because true inverse transliterators are difficult to formulate. For example, consider two transliterators: AB, which transliterates the character 'A' to 'B', and BA, which transliterates 'B' to 'A'. It might seem that these are exact inverses, since

"A" x AB -> "B"
"B" x BA -> "A"

where 'x' represents transliteration. However,

"ABCD" x AB -> "BBCD"
"BBCD" x BA -> "AACD"

so AB composed with BA is not the identity. Nonetheless, BA may be usefully considered to be AB's inverse, and it is on this basis that AB.getInverse() could legitimately return BA.

IDs and display names

A transliterator is designated by a short identifier string or ID. IDs follow the format source-destination, where source describes the entity being replaced, and destination describes the entity replacing source. The entities may be the names of scripts, particular sequences of characters, or whatever else it is that the transliterator converts to or from. For example, a transliterator from Russian to Latin might be named "Russian-Latin". A transliterator from keyboard escape sequences to Latin-1 characters might be named "KeyboardEscape-Latin1". By convention, system entity names are in English, with the initial letters of words capitalized; user entity names may follow any format so long as they do not contain dashes.

In addition to programmatic IDs, transliterator objects have display names for presentation in user interfaces, returned by getDisplayName.

Factory methods and registration

In general, client code should use the factory method createInstance to obtain an instance of a transliterator given its ID. Valid IDs may be enumerated using getAvailableIDs(). Since transliterators are mutable, multiple calls to createInstance with the same ID will return distinct objects.

In addition to the system transliterators registered at startup, user transliterators may be registered by calling registerInstance() at run time. A registered instance acts a template; future calls to createInstance with the ID of the registered object return clones of that object. Thus any object passed to registerInstance() must implement clone() propertly. To register a transliterator subclass without instantiating it (until it is needed), users may call registerFactory. In this case, the objects are instantiated by invoking the zero-argument public constructor of the class.


Subclasses must implement the abstract method handleTransliterate().

Subclasses should override the transliterate() method taking a Replaceable and the transliterate() method taking a String and StringBuffer if the performance of these methods can be improved over the performance obtained by the default implementations in this class.

Alan Liu ICU 2.0

Definition at line 241 of file translit.h.

Public Types

typedef Transliterator
*(U_EXPORT2 * 
Factory )(const UnicodeString &ID, Token context)

Public Member Functions

void adoptFilter (UnicodeFilter *adoptedFilter)
virtual Transliteratorclone () const
int32_t countElements () const
TransliteratorcreateInverse (UErrorCode &status) const
virtual void filteredTransliterate (Replaceable &text, UTransPosition &index, UBool incremental) const
virtual void finishTransliteration (Replaceable &text, UTransPosition &index) const
virtual UClassID getDynamicClassID (void) const =0
const TransliteratorgetElement (int32_t index, UErrorCode &ec) const
const UnicodeFiltergetFilter (void) const
virtual const UnicodeStringgetID (void) const
int32_t getMaximumContextLength (void) const
UnicodeSetgetSourceSet (UnicodeSet &result) const
virtual UnicodeSetgetTargetSet (UnicodeSet &result) const
virtual void handleGetSourceSet (UnicodeSet &result) const
UnicodeFilterorphanFilter (void)
virtual UnicodeStringtoRules (UnicodeString &result, UBool escapeUnprintable) const
virtual void transliterate (Replaceable &text, UTransPosition &index, UErrorCode &status) const
virtual void transliterate (Replaceable &text, UTransPosition &index, UChar32 insertion, UErrorCode &status) const
virtual void transliterate (Replaceable &text, UTransPosition &index, const UnicodeString &insertion, UErrorCode &status) const
virtual void transliterate (Replaceable &text) const
virtual int32_t transliterate (Replaceable &text, int32_t start, int32_t limit) const
virtual ~Transliterator ()

Static Public Member Functions

static int32_t U_EXPORT2 countAvailableIDs (void)
static int32_t U_EXPORT2 countAvailableSources (void)
static int32_t U_EXPORT2 countAvailableTargets (const UnicodeString &source)
static int32_t U_EXPORT2 countAvailableVariants (const UnicodeString &source, const UnicodeString &target)
static Transliterator *U_EXPORT2 createFromRules (const UnicodeString &ID, const UnicodeString &rules, UTransDirection dir, UParseError &parseError, UErrorCode &status)
static Transliterator *U_EXPORT2 createInstance (const UnicodeString &ID, UTransDirection dir, UErrorCode &status)
static Transliterator *U_EXPORT2 createInstance (const UnicodeString &ID, UTransDirection dir, UParseError &parseError, UErrorCode &status)
static const UnicodeString
getAvailableID (int32_t index)
static StringEnumeration *U_EXPORT2 getAvailableIDs (UErrorCode &ec)
static UnicodeString &U_EXPORT2 getAvailableSource (int32_t index, UnicodeString &result)
static UnicodeString &U_EXPORT2 getAvailableTarget (int32_t index, const UnicodeString &source, UnicodeString &result)
static UnicodeString &U_EXPORT2 getAvailableVariant (int32_t index, const UnicodeString &source, const UnicodeString &target, UnicodeString &result)
static UnicodeString &U_EXPORT2 getDisplayName (const UnicodeString &ID, const Locale &inLocale, UnicodeString &result)
static UnicodeString &U_EXPORT2 getDisplayName (const UnicodeString &ID, UnicodeString &result)
static UClassID U_EXPORT2 getStaticClassID (void)
static Token integerToken (int32_t)
static void U_EXPORT2 operator delete (void *, void *) U_NO_THROW
static void U_EXPORT2 operator delete (void *p) U_NO_THROW
static void U_EXPORT2 operator delete[] (void *p) U_NO_THROW
static void *U_EXPORT2 operator new (size_t, void *ptr) U_NO_THROW
static void *U_EXPORT2 operator new (size_t size) U_NO_THROW
static void *U_EXPORT2 operator new[] (size_t size) U_NO_THROW
static Token pointerToken (void *)
static void U_EXPORT2 registerAlias (const UnicodeString &aliasID, const UnicodeString &realID)
static void U_EXPORT2 registerFactory (const UnicodeString &id, Factory factory, Token context)
static void U_EXPORT2 registerInstance (Transliterator *adoptedObj)
static void U_EXPORT2 unregister (const UnicodeString &ID)

Protected Member Functions

virtual void handleTransliterate (Replaceable &text, UTransPosition &pos, UBool incremental) const =0
Transliteratoroperator= (const Transliterator &)
void setID (const UnicodeString &id)
void setMaximumContextLength (int32_t maxContextLength)
 Transliterator (const Transliterator &)
 Transliterator (const UnicodeString &ID, UnicodeFilter *adoptedFilter)

Static Protected Member Functions

static int32_t _countAvailableSources (void)
static int32_t _countAvailableTargets (const UnicodeString &source)
static int32_t _countAvailableVariants (const UnicodeString &source, const UnicodeString &target)
static UnicodeString_getAvailableSource (int32_t index, UnicodeString &result)
static UnicodeString_getAvailableTarget (int32_t index, const UnicodeString &source, UnicodeString &result)
static UnicodeString_getAvailableVariant (int32_t index, const UnicodeString &source, const UnicodeString &target, UnicodeString &result)
static void _registerAlias (const UnicodeString &aliasID, const UnicodeString &realID)
static void _registerFactory (const UnicodeString &id, Factory factory, Token context)
static void _registerInstance (Transliterator *adoptedObj)
static void _registerSpecialInverse (const UnicodeString &target, const UnicodeString &inverseTarget, UBool bidirectional)
static TransliteratorcreateBasicInstance (const UnicodeString &id, const UnicodeString *canon)

Private Member Functions

void _transliterate (Replaceable &text, UTransPosition &index, const UnicodeString *insertion, UErrorCode &status) const
virtual void filteredTransliterate (Replaceable &text, UTransPosition &index, UBool incremental, UBool rollback) const

Static Private Member Functions

static UBool initializeRegistry (UErrorCode &status)

Private Attributes

UnicodeString ID
int32_t maximumContextLength


class TransliteratorAlias
class TransliteratorIDParser
class TransliteratorParser


union  Token

The documentation for this class was generated from the following files:

Generated by  Doxygen 1.6.0   Back to index