Logo Search packages:      
Sourcecode: icu version File versions  Download package

utext.h File Reference


Detailed Description

C API: Abstract Unicode Text API.

The Text Access API provides a means to allow text that is stored in alternative formats to work with ICU services. ICU normally operates on text that is stored in UTF-16 format, in (UChar *) arrays for the C APIs or as type UnicodeString for C++ APIs.

ICU Text Access allows other formats, such as UTF-8 or non-contiguous UTF-16 strings, to be placed in a UText wrapper and then passed to ICU services.

There are three general classes of usage for UText:

Application Level Use. This is the simplest usage - applications would use one of the utext_open() functions on their input text, and pass the resulting UText to the desired ICU service.

Second is usage in ICU Services, such as break iteration, that will need to operate on input presented to them as a UText. These implementations will need to use the iteration and related UText functions to gain access to the actual text.

The third class of UText users are "text providers." These are the UText implementations for the various text storage formats. An application or system with a unique text storage format can implement a set of UText provider functions for that format, which will then allow ICU services to operate on that format.

Iterating over text

Here is sample code for a forward iteration over the contents of a UText

    UChar32  c;
    UText    *ut = whatever();

    for (c=utext_next32From(ut, 0); c>=0; c=utext_next32(ut)) {
       // do whatever with the codepoint c here.
    }

And here is similar code to iterate in the reverse direction, from the end of the text towards the beginning.

    UChar32  c;
    UText    *ut = whatever();
    int      textLength = utext_nativeLength(ut);
    for (c=utext_previous32From(ut, textLength); c>=0; c=utext_previous32(ut)) {
       // do whatever with the codepoint c here.
    }

Characters and Indexing

Indexing into text by UText functions is nearly always in terms of the native indexing of the underlying text storage. The storage format could be UTF-8 or UTF-32, for example. When coding to the UText access API, no assumptions can be made regarding the size of characters, or how far an index may move when iterating between characters.

All indices supplied to UText functions are pinned to the length of the text. An out-of-bounds index is not considered to be an error, but is adjusted to be in the range 0 <= index <= length of input text.

When an index position is returned from a UText function, it will be a native index to the underlying text. In the case of multi-unit characters, it will always refer to the first position of the character, never to the interior. This is essentially the same thing as saying that a returned index will always point to a boundary between characters.

When a native index is supplied to a UText function, all indices that refer to any part of a multi-unit character representation are considered to be equivalent. In the case of multi-unit characters, an incoming index will be logically normalized to refer to the start of the character.

It is possible to test whether a native index is on a code point boundary by doing a utext_setNativeIndex() followed by a utext_getNativeIndex(). If the index is returned unchanged, it was on a code point boundary. If an adjusted index is returned, the original index referred to the interior of a character.

Conventions for calling UText functions

Most UText access functions have as their first parameter a (UText *) pointer, which specifies the UText to be used. Unless otherwise noted, the pointer must refer to a valid, open UText. Attempting to use a closed UText or passing a NULL pointer is a programming error and will produce undefined results or NULL pointer exceptions.

The UText_Open family of functions can either open an existing (closed) UText, or heap allocate a new UText. Here is sample code for creating a stack-allocated UText.

    char     *s = whatever();  // A utf-8 string 
    U_ErrorCode status = U_ZERO_ERROR;
    UText    ut = UTEXT_INITIALIZER;
    utext_openUTF8(ut, s, -1, &status);
    if (U_FAILURE(status)) {
        // error handling
    } else {
        // work with the UText
    }

Any existing UText passed to an open function _must_ have been initialized, either by the UTEXT_INITIALIZER, or by having been originally heap-allocated by an open function. Passing NULL will cause the open function to heap-allocate and fully initialize a new UText.

Definition in file utext.h.

#include "unicode/utypes.h"

Go to the source code of this file.

Classes

struct  UText
struct  UTextFuncs

Defines

#define UTEXT_GETNATIVEINDEX(ut)
#define UTEXT_INITIALIZER
#define UTEXT_NEXT32(ut)
#define UTEXT_PREVIOUS32(ut)
#define UTEXT_SETNATIVEINDEX(ut, ix)

Typedefs

typedef struct UText UText
typedef UBool U_CALLCONV UTextAccess (UText *ut, int64_t nativeIndex, UBool forward)
typedef UText *U_CALLCONV UTextClone (UText *dest, const UText *src, UBool deep, UErrorCode *status)
typedef void U_CALLCONV UTextClose (UText *ut)
typedef void U_CALLCONV UTextCopy (UText *ut, int64_t nativeStart, int64_t nativeLimit, int64_t nativeDest, UBool move, UErrorCode *status)
typedef int32_t U_CALLCONV UTextExtract (UText *ut, int64_t nativeStart, int64_t nativeLimit, UChar *dest, int32_t destCapacity, UErrorCode *status)
typedef struct UTextFuncs UTextFuncs
typedef int32_t U_CALLCONV UTextMapNativeIndexToUTF16 (const UText *ut, int64_t nativeIndex)
typedef int64_t U_CALLCONV UTextMapOffsetToNative (const UText *ut)
typedef int64_t U_CALLCONV UTextNativeLength (UText *ut)
typedef int32_t U_CALLCONV UTextReplace (UText *ut, int64_t nativeStart, int64_t nativeLimit, const UChar *replacementText, int32_t replacmentLength, UErrorCode *status)

Enumerations

enum  {
  UTEXT_PROVIDER_LENGTH_IS_EXPENSIVE = 1, UTEXT_PROVIDER_STABLE_CHUNKS = 2, UTEXT_PROVIDER_WRITABLE = 3, UTEXT_PROVIDER_HAS_META_DATA = 4,
  UTEXT_PROVIDER_OWNS_TEXT = 5
}
enum  { UTEXT_MAGIC = 0x345ad82c }

Functions

U_STABLE UChar32 U_EXPORT2 utext_char32At (UText *ut, int64_t nativeIndex)
U_STABLE UText *U_EXPORT2 utext_clone (UText *dest, const UText *src, UBool deep, UBool readOnly, UErrorCode *status)
U_STABLE UText *U_EXPORT2 utext_close (UText *ut)
U_STABLE void U_EXPORT2 utext_copy (UText *ut, int64_t nativeStart, int64_t nativeLimit, int64_t destIndex, UBool move, UErrorCode *status)
U_STABLE UChar32 U_EXPORT2 utext_current32 (UText *ut)
U_STABLE UBool U_EXPORT2 utext_equals (const UText *a, const UText *b)
U_STABLE int32_t U_EXPORT2 utext_extract (UText *ut, int64_t nativeStart, int64_t nativeLimit, UChar *dest, int32_t destCapacity, UErrorCode *status)
U_STABLE void U_EXPORT2 utext_freeze (UText *ut)
U_STABLE int64_t U_EXPORT2 utext_getNativeIndex (const UText *ut)
U_STABLE int64_t U_EXPORT2 utext_getPreviousNativeIndex (UText *ut)
U_STABLE UBool U_EXPORT2 utext_hasMetaData (const UText *ut)
U_STABLE UBool U_EXPORT2 utext_isLengthExpensive (const UText *ut)
U_STABLE UBool U_EXPORT2 utext_isWritable (const UText *ut)
U_STABLE UBool U_EXPORT2 utext_moveIndex32 (UText *ut, int32_t delta)
U_STABLE int64_t U_EXPORT2 utext_nativeLength (UText *ut)
U_STABLE UChar32 U_EXPORT2 utext_next32 (UText *ut)
U_STABLE UChar32 U_EXPORT2 utext_next32From (UText *ut, int64_t nativeIndex)
U_STABLE UText *U_EXPORT2 utext_openUChars (UText *ut, const UChar *s, int64_t length, UErrorCode *status)
U_STABLE UText *U_EXPORT2 utext_openUTF8 (UText *ut, const char *s, int64_t length, UErrorCode *status)
U_STABLE UChar32 U_EXPORT2 utext_previous32 (UText *ut)
U_STABLE UChar32 U_EXPORT2 utext_previous32From (UText *ut, int64_t nativeIndex)
U_STABLE int32_t U_EXPORT2 utext_replace (UText *ut, int64_t nativeStart, int64_t nativeLimit, const UChar *replacementText, int32_t replacementLength, UErrorCode *status)
U_STABLE void U_EXPORT2 utext_setNativeIndex (UText *ut, int64_t nativeIndex)
U_STABLE UText *U_EXPORT2 utext_setup (UText *ut, int32_t extraSpace, UErrorCode *status)


Generated by  Doxygen 1.6.0   Back to index