Logo Search packages:      
Sourcecode: icu version File versions  Download package

U_INTERNAL UBool U_EXPORT2 usearch_search ( UStringSearch strsrch,
int32_t  startIdx,
int32_t *  matchStart,
int32_t *  matchLimit,
UErrorCode status 

Simple forward search for the pattern, starting at a specified index, and using using a default set search options.

This is an experimental function, and is not an official part of the ICU API.

The collator options, such as UCOL_STRENGTH and UCOL_NORMALIZTION, are honored.

The UStringSearch options USEARCH_CANONICAL_MATCH, USEARCH_OVERLAP and any Break Iterator are ignored.

Matches obey the following constraints:

Characters at the start or end positions of a match that are ignorable for collation are not included as part of the match, unless they are part of a combining sequence, as described below.

A match will not include a partial combining sequence. Combining character sequences are considered to be inseperable units, and either match the pattern completely, or are considered to not match at all. Thus, for example, an A followed a combining accent mark will not be found when searching for a plain (unaccented) A. (unless the collation strength has been set to ignore all accents).

When beginning a search, the initial starting position, startIdx, is assumed to be an acceptable match boundary with respect to combining characters. A combining sequence that spans across the starting point will not supress a match beginning at startIdx.

Characters that expand to multiple collation elements (German sharp-S becoming 'ss', or the composed forms of accented characters, for example) also must match completely. Searching for a single 's' in a string containing only a sharp-s will find no match.

strsrch the UStringSearch struct, which references both the text to be searched and the pattern being sought.
startIdx The index into the text to begin the search.
matchStart An out parameter, the starting index of the matched text. This parameter may be NULL. A value of -1 will be returned if no match was found.
matchLimit Out parameter, the index of the first position following the matched text. The matchLimit will be at a suitable position for beginning a subsequent search in the input text. This parameter may be NULL. A value of -1 will be returned if no match was found.
status Report any errors. Note that no match found is not an error.
TRUE if a match was found, FALSE otherwise.

For internal use only.

Definition at line 3790 of file usearch.cpp.


    if (U_FAILURE(*status)) {
        return FALSE;

    // TODO:  reject search patterns beginning with a combining char.

    if (getenv("USEARCH_DEBUG") != NULL) {
        printf("Pattern CEs\n");
        for (int ii=0; ii<strsrch->pattern.CELength; ii++) {
            printf(" %8x", strsrch->pattern.CE[ii]);

    // Input parameter sanity check.
    //  TODO:  should input indicies clip to the text length
    //         in the same way that UText does.
    if(strsrch->pattern.CELength == 0         ||
       startIdx < 0                           ||
       startIdx > strsrch->search->textLength ||
       strsrch->pattern.CE == NULL) {
           *status = U_ILLEGAL_ARGUMENT_ERROR;
           return FALSE;

    if (strsrch->pattern.PCE == NULL) {
        initializePatternPCETable(strsrch, status);

    ucol_setOffset(strsrch->textIter, startIdx, status);
    CEBuffer ceb(strsrch, status);

    int32_t    targetIx = 0;
    const CEI *targetCEI = NULL;
    int32_t    patIx;
    UBool      found;

    int32_t  mStart = -1;
    int32_t  mLimit = -1;
    int32_t  minLimit;
    int32_t  maxLimit;

    // Outer loop moves over match starting positions in the
    //      target CE space.
    // Here we see the target as a sequence of collation elements, resulting from the following:
    // 1. Target characters were decomposed, and (if appropriate) other compressions and expansions are applied
    //    (for example, digraphs such as IJ may be broken into two characters).
    // 2. An int64_t CE weight is determined for each resulting unit (high 16 bits are primary strength, next
    //    16 bits are secondary, next 16 (the high 16 bits of the low 32-bit half) are tertiary. Any of these
    //    fields that are for strengths below that of the collator are set to 0. If this makes the int64_t
    //    CE weight 0 (as for a combining diacritic with secondary weight when the collator strentgh is primary),
    //    then the CE is deleted, so the following code sees only CEs that are relevant.
    // For each CE, the lowIndex and highIndex correspond to where this CE begins and ends in the original text.
    // If lowIndex==highIndex, either the CE resulted from an expansion/decomposition of one of the original text
    // characters, or the CE marks the limit of the target text (in which case the CE weight is UCOL_PROCESSED_NULLORDER).
    for(targetIx=0; ; targetIx++)
        found = TRUE;
        //  Inner loop checks for a match beginning at each
        //  position from the outer loop.
        int32_t targetIxOffset = 0;
        int64_t patCE = 0;
        for (patIx=0; patIx<strsrch->pattern.PCELength; patIx++) {
            patCE = strsrch->pattern.PCE[patIx];
            targetCEI = ceb.get(targetIx+patIx+targetIxOffset);
            //  Compare CE from target string with CE from the pattern.
            //    Note that the target CE will be UCOL_PROCESSED_NULLORDER if we reach the end of input,
            //    which will fail the compare, below.
            UCompareCEsResult ceMatch = compareCE64s(targetCEI->ce, patCE, strsrch->search->elementComparisonType);
            if ( ceMatch == U_CE_NO_MATCH ) {
                found = FALSE;
            } else if ( ceMatch > U_CE_NO_MATCH ) {
                if ( ceMatch == U_CE_SKIP_TARG ) {
                    // redo with same patCE, next targCE
                } else { // ceMatch == U_CE_SKIP_PATN
                    // redo with same targCE, next patCE
        targetIxOffset += strsrch->pattern.PCELength; // this is now the offset in target CE space to end of the match so far

        if (!found && ((targetCEI == NULL) || (targetCEI->ce != UCOL_PROCESSED_NULLORDER))) {
            // No match at this targetIx.  Try again at the next.

        if (!found) {
            // No match at all, we have run off the end of the target text.

        // We have found a match in CE space.
        // Now determine the bounds in string index space.
        //  There still is a chance of match failure if the CE range not correspond to
        //     an acceptable character range.
        const CEI *firstCEI = ceb.get(targetIx);
        const CEI *lastCEI  = ceb.get(targetIx + targetIxOffset - 1);

        mStart   = firstCEI->lowIndex;
        minLimit = lastCEI->lowIndex;

        // Look at the CE following the match.  If it is UCOL_NULLORDER the match
        //   extended to the end of input, and the match is good.

        // Look at the high and low indices of the CE following the match. If
        // they are the same it means one of two things:
        //    1. The match extended to the last CE from the target text, which is OK, or
        //    2. The last CE that was part of the match is in an expansion that extends
        //       to the first CE after the match. In this case, we reject the match.
        if (strsrch->search->elementComparisonType == 0) {
            const CEI *nextCEI  = ceb.get(targetIx + targetIxOffset);
            maxLimit = nextCEI->lowIndex;
            if (nextCEI->lowIndex == nextCEI->highIndex && nextCEI->ce != UCOL_PROCESSED_NULLORDER) {
                found = FALSE;
        } else {
            const CEI *nextCEI;
            for ( ; ; ++targetIxOffset ) {
                nextCEI = ceb.get(targetIx + targetIxOffset);
                maxLimit = nextCEI->lowIndex;
                        // If we are at the end of the target too, match succeeds
                if (  nextCEI->ce == UCOL_PROCESSED_NULLORDER ) {
                // As long as the next CE has primary weight of 0,
                // it is part of the last target element matched by the pattern;
                // make sure it can be part of a match with the last patCE
                if ( (((nextCEI->ce) >> 32) & 0xFFFF0000UL) == 0 ) {
                  UCompareCEsResult ceMatch = compareCE64s(nextCEI->ce, patCE, strsrch->search->elementComparisonType);
                  if ( ceMatch == U_CE_NO_MATCH || ceMatch == U_CE_SKIP_PATN ) {
                        found = FALSE;
                // If lowIndex == highIndex, this target CE is part of an expansion of the last matched
                // target element, but it has non-zero primary weight => match fails
                } else if ( nextCEI->lowIndex == nextCEI->highIndex ) {
                  found = false;
                // Else the target CE is not part of an expansion of the last matched element, match succeeds
                } else {

        // Check for the start of the match being within a combining sequence.
        //   This can happen if the pattern itself begins with a combining char, and
        //   the match found combining marks in the target text that were attached
        //    to something else.
        //   This type of match should be rejected for not completely consuming a
        //   combining sequence.
        if (isBreakBoundary(strsrch, mStart)) {
            found = FALSE;

        // Check for the start of the match being within an Collation Element Expansion,
        //   meaning that the first char of the match is only partially matched.
        //   With exapnsions, the first CE will report the index of the source
        //   character, and all subsequent (expansions) CEs will report the source index of the
        //    _following_ character.
        int32_t secondIx = firstCEI->highIndex;
        if (mStart == secondIx) {
            found = FALSE;

        //  Advance the match end position to the first acceptable match boundary.
        //    This advances the index over any combining charcters.
        mLimit = maxLimit;
        if (minLimit < maxLimit) {
            int32_t nba = nextBoundaryAfter(strsrch, minLimit);

            if (nba >= lastCEI->highIndex) {
                mLimit = nba;

    #ifdef USEARCH_DEBUG
        if (getenv("USEARCH_DEBUG") != NULL) {
            printf("minLimit, maxLimit, mLimit = %d, %d, %d\n", minLimit, maxLimit, mLimit);

        // If advancing to the end of a combining sequence in character indexing space
        //   advanced us beyond the end of the match in CE space, reject this match.
        if (mLimit > maxLimit) {
            found = FALSE;

        if (isBreakBoundary(strsrch, mLimit)) {
            found = FALSE;

        if (! checkIdentical(strsrch, mStart, mLimit)) {
            found = FALSE;

        if (found) {

    #ifdef USEARCH_DEBUG
    if (getenv("USEARCH_DEBUG") != NULL) {
        printf("Target CEs [%d .. %d]\n", ceb.firstIx, ceb.limitIx);
        int32_t  lastToPrint = ceb.limitIx+2;
        for (int ii=ceb.firstIx; ii<lastToPrint; ii++) {
            printf("%8x@%d ", ceb.get(ii)->ce, ceb.get(ii)->srcIndex);
        printf("\n%s\n", found? "match found" : "no match");

    // All Done.  Store back the match bounds to the caller.
    if (found==FALSE) {
        mLimit = -1;
        mStart = -1;

    if (matchStart != NULL) {
        *matchStart= mStart;

    if (matchLimit != NULL) {
        *matchLimit = mLimit;

    return found;

Generated by  Doxygen 1.6.0   Back to index