Logo Search packages:      
Sourcecode: icu version File versions  Download package

Normalizer Class Reference

#include <normlzr.h>

Inheritance diagram for Normalizer:

UObject UMemory

List of all members.

Detailed Description

The Normalizer class supports the standard normalization forms described in Unicode Standard Annex #15: Unicode Normalization Forms.

Note: This API has been replaced by the Normalizer2 class and is only available for backward compatibility. This class simply delegates to the Normalizer2 class. There is one exception: The new API does not provide a replacement for Normalizer::compare().

The Normalizer class consists of two parts:

The Normalizer class is not suitable for subclassing.

For basic information about normalization forms and details about the C API please see the documentation in unorm.h.

The iterator API with the Normalizer constructors and the non-static functions use a CharacterIterator as input. It is possible to pass a string which is then internally wrapped in a CharacterIterator. The input text is not normalized all at once, but incrementally where needed (providing efficient random access). This allows to pass in a large text but spend only a small amount of time normalizing a small part of that text. However, if the entire text is normalized, then the iterator will be slower than normalizing the entire text at once and iterating over the result. A possible use of the Normalizer iterator is also to report an index into the original text that is close to where the normalized characters come from.

Important: The iterator API was cleaned up significantly for ICU 2.0. The earlier implementation reported the getIndex() inconsistently, and previous() could not be used after setIndex(), next(), first(), and current().

Normalizer allows to start normalizing from anywhere in the input text by calling setIndexOnly(), first(), or last(). Without calling any of these, the iterator will start at the beginning of the text.

At any time, next() returns the next normalized code point (UChar32), with post-increment semantics (like CharacterIterator::next32PostInc()). previous() returns the previous normalized code point (UChar32), with pre-decrement semantics (like CharacterIterator::previous32()).

current() returns the current code point (respectively the one at the newly set index) without moving the getIndex(). Note that if the text at the current position needs to be normalized, then these functions will do that. (This is why current() is not const.) It is more efficient to call setIndexOnly() instead, which does not normalize.

getIndex() always refers to the position in the input text where the normalized code points are returned from. It does not always change with each returned code point. The code point that is returned from any of the functions corresponds to text at or after getIndex(), according to the function's iteration semantics (post-increment or pre-decrement).

next() returns a code point from at or after the getIndex() from before the next() call. After the next() call, the getIndex() might have moved to where the next code point will be returned from (from a next() or current() call). This is semantically equivalent to array access with array[index++] (post-increment semantics).

previous() returns a code point from at or after the getIndex() from after the previous() call. This is semantically equivalent to array access with array[--index] (pre-decrement semantics).

Internally, the Normalizer iterator normalizes a small piece of text starting at the getIndex() and ending at a following "safe" index. The normalized results is stored in an internal string buffer, and the code points are iterated from there. With multiple iteration calls, this is repeated until the next piece of text needs to be normalized, and the getIndex() needs to be moved.

The following "safe" index, the internal buffer, and the secondary iteration index into that buffer are not exposed on the API. This also means that it is currently not practical to return to a particular, arbitrary position in the text because one would need to know, and be able to set, in addition to the getIndex(), at least also the current index into the internal buffer. It is currently only possible to observe when getIndex() changes (with careful consideration of the iteration semantics), at which time the internal index will be 0. For example, if getIndex() is different after next() than before it, then the internal index is 0 and one can return to this getIndex() later with setIndexOnly().

Note: While the setIndex() and getIndex() refer to indices in the underlying Unicode input text, the next() and previous() methods iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next() and previous() and the indices passed to and returned from setIndex() and getIndex(). It is for this reason that Normalizer does not implement the CharacterIterator interface.

Laura Werner, Mark Davis, Markus Scherer ICU 2.0

Definition at line 130 of file normlzr.h.

Public Types

enum  { DONE = 0xffff }

Public Member Functions

Normalizerclone (void) const
UChar32 current (void)
int32_t endIndex (void) const
UChar32 first (void)
virtual UClassID getDynamicClassID () const
int32_t getIndex (void) const
UBool getOption (int32_t option) const
void getText (UnicodeString &result)
UNormalizationMode getUMode (void) const
int32_t hashCode (void) const
UChar32 last (void)
UChar32 next (void)
 Normalizer (const Normalizer &copy)
 Normalizer (const CharacterIterator &iter, UNormalizationMode mode)
 Normalizer (const UChar *str, int32_t length, UNormalizationMode mode)
 Normalizer (const UnicodeString &str, UNormalizationMode mode)
UBool operator!= (const Normalizer &that) const
UBool operator== (const Normalizer &that) const
UChar32 previous (void)
void reset (void)
void setIndexOnly (int32_t index)
void setMode (UNormalizationMode newMode)
void setOption (int32_t option, UBool value)
void setText (const UChar *newText, int32_t length, UErrorCode &status)
void setText (const CharacterIterator &newText, UErrorCode &status)
void setText (const UnicodeString &newText, UErrorCode &status)
int32_t startIndex (void) const
virtual ~Normalizer ()

Static Public Member Functions

static int32_t compare (const UnicodeString &s1, const UnicodeString &s2, uint32_t options, UErrorCode &errorCode)
static void U_EXPORT2 compose (const UnicodeString &source, UBool compat, int32_t options, UnicodeString &result, UErrorCode &status)
static UnicodeString &U_EXPORT2 concatenate (UnicodeString &left, UnicodeString &right, UnicodeString &result, UNormalizationMode mode, int32_t options, UErrorCode &errorCode)
static void U_EXPORT2 decompose (const UnicodeString &source, UBool compat, int32_t options, UnicodeString &result, UErrorCode &status)
static UClassID U_EXPORT2 getStaticClassID ()
static UBool isNormalized (const UnicodeString &src, UNormalizationMode mode, int32_t options, UErrorCode &errorCode)
static UBool isNormalized (const UnicodeString &src, UNormalizationMode mode, UErrorCode &errorCode)
static void U_EXPORT2 normalize (const UnicodeString &source, UNormalizationMode mode, int32_t options, UnicodeString &result, UErrorCode &status)
static void U_EXPORT2 operator delete (void *, void *) U_NO_THROW
static void U_EXPORT2 operator delete (void *p) U_NO_THROW
static void U_EXPORT2 operator delete[] (void *p) U_NO_THROW
static void *U_EXPORT2 operator new (size_t, void *ptr) U_NO_THROW
static void *U_EXPORT2 operator new (size_t size) U_NO_THROW
static void *U_EXPORT2 operator new[] (size_t size) U_NO_THROW
static UNormalizationCheckResult quickCheck (const UnicodeString &source, UNormalizationMode mode, int32_t options, UErrorCode &status)
static UNormalizationCheckResult quickCheck (const UnicodeString &source, UNormalizationMode mode, UErrorCode &status)

Private Member Functions

void clearBuffer (void)
void init ()
UBool nextNormalize ()
Normalizeroperator= (const Normalizer &that)
UBool previousNormalize ()

Private Attributes

UnicodeString buffer
int32_t bufferPos
int32_t currentIndex
const Normalizer2fNorm2
int32_t fOptions
UNormalizationMode fUMode
int32_t nextIndex

The documentation for this class was generated from the following files:

Generated by  Doxygen 1.6.0   Back to index