Logo Search packages:      
Sourcecode: icu version File versions  Download package

Argument values for whether span() and similar functions continue while the current character is contained vs. not contained in the set.

The functionality is straightforward for sets with only single code points, without strings (which is the common case):

  • USET_SPAN_CONTAINED and USET_SPAN_SIMPLE work the same.
  • span() and spanBack() partition any string the same way when alternating between span(USET_SPAN_NOT_CONTAINED) and span(either "contained" condition).
  • Using a complemented (inverted) set and the opposite span conditions yields the same results.

When a set contains multi-code point strings, then these statements may not be true, depending on the strings in the set (for example, whether they overlap with each other) and the string that is processed. For a set with strings:

  • The complement of the set contains the opposite set of code points, but the same set of strings. Therefore, complementing both the set and the span conditions may yield different results.
  • When starting spans at different positions in a string (span(s, ...) vs. span(s+1, ...)) the ends of the spans may be different because a set string may start before the later position.
  • span(USET_SPAN_SIMPLE) may be shorter than span(USET_SPAN_CONTAINED) because it will not recursively try all possible paths. For example, with a set which contains the three strings "xy", "xya" and "ax", span("xyax", USET_SPAN_CONTAINED) will return 4 but span("xyax", USET_SPAN_SIMPLE) will return 3. span(USET_SPAN_SIMPLE) will never be longer than span(USET_SPAN_CONTAINED).
  • With either "contained" condition, span() and spanBack() may partition a string in different ways. For example, with a set which contains the two strings "ab" and "ba", and when processing the string "aba", span() will yield contained/not-contained boundaries of { 0, 2, 3 } while spanBack() will yield boundaries of { 0, 1, 3 }.

Note: If it is important to get the same boundaries whether iterating forward or backward through a string, then either only span() should be used and the boundaries cached for backward operation, or an ICU BreakIterator could be used.

Note: Unpaired surrogates are treated like surrogate code points. Similarly, set strings match only on code point boundaries, never in the middle of a surrogate pair. Illegal UTF-8 sequences are treated like U+FFFD. When processing UTF-8 strings, malformed set strings (strings with unpaired surrogates which cannot be converted to UTF-8) are ignored.

ICU 3.8

Enumerator:
USET_SPAN_NOT_CONTAINED  Continue a span() while there is no set element at the current position. Stops before the first set element (character or string). (For code points only, this is like while contains(current)==FALSE).

When span() returns, the substring between where it started and the position it returned consists only of characters that are not in the set, and none of its strings overlap with the span.

ICU 3.8

USET_SPAN_CONTAINED  Continue a span() while there is a set element at the current position. (For characters only, this is like while contains(current)==TRUE).

When span() returns, the substring between where it started and the position it returned consists only of set elements (characters or strings) that are in the set.

If a set contains strings, then the span will be the longest substring matching any of the possible concatenations of set elements (characters or strings). (There must be a single, non-overlapping concatenation of characters or strings.) This is equivalent to a POSIX regular expression for (OR of each set element)*.

ICU 3.8

USET_SPAN_SIMPLE  Continue a span() while there is a set element at the current position. (For characters only, this is like while contains(current)==TRUE).

When span() returns, the substring between where it started and the position it returned consists only of set elements (characters or strings) that are in the set.

If a set only contains single characters, then this is the same as USET_SPAN_CONTAINED.

If a set contains strings, then the span will be the longest substring with a match at each position with the longest single set element (character or string).

Use this span condition together with other longest-match algorithms, such as ICU converters (ucnv_getUnicodeSet()).

ICU 3.8

USET_SPAN_CONDITION_COUNT  One more than the last span condition. ICU 3.8

Definition at line 156 of file uset.h.

                               {
    /**
     * Continue a span() while there is no set element at the current position.
     * Stops before the first set element (character or string).
     * (For code points only, this is like while contains(current)==FALSE).
     *
     * When span() returns, the substring between where it started and the position
     * it returned consists only of characters that are not in the set,
     * and none of its strings overlap with the span.
     *
     * @stable ICU 3.8
     */
    USET_SPAN_NOT_CONTAINED = 0,
    /**
     * Continue a span() while there is a set element at the current position.
     * (For characters only, this is like while contains(current)==TRUE).
     *
     * When span() returns, the substring between where it started and the position
     * it returned consists only of set elements (characters or strings) that are in the set.
     *
     * If a set contains strings, then the span will be the longest substring
     * matching any of the possible concatenations of set elements (characters or strings).
     * (There must be a single, non-overlapping concatenation of characters or strings.)
     * This is equivalent to a POSIX regular expression for (OR of each set element)*.
     *
     * @stable ICU 3.8
     */
    USET_SPAN_CONTAINED = 1,
    /**
     * Continue a span() while there is a set element at the current position.
     * (For characters only, this is like while contains(current)==TRUE).
     *
     * When span() returns, the substring between where it started and the position
     * it returned consists only of set elements (characters or strings) that are in the set.
     *
     * If a set only contains single characters, then this is the same
     * as USET_SPAN_CONTAINED.
     *
     * If a set contains strings, then the span will be the longest substring
     * with a match at each position with the longest single set element (character or string).
     *
     * Use this span condition together with other longest-match algorithms,
     * such as ICU converters (ucnv_getUnicodeSet()).
     *
     * @stable ICU 3.8
     */
    USET_SPAN_SIMPLE = 2,
    /**
     * One more than the last span condition.
     * @stable ICU 3.8
     */
    USET_SPAN_CONDITION_COUNT
} USetSpanCondition;


Generated by  Doxygen 1.6.0   Back to index