Logo Search packages:      
Sourcecode: icu version File versions  Download package

StringSearch Class Reference

#include <stsearch.h>

Inheritance diagram for StringSearch:
Collaboration diagram for StringSearch:

List of all members.

Public Member Functions

StringSearchclone () const
int32_t first (UErrorCode &status)
int32_t following (int32_t position, UErrorCode &status)
USearchAttributeValue getAttribute (USearchAttribute attribute) const
const BreakIteratorgetBreakIterator (void) const
RuleBasedCollatorgetCollator () const
virtual UClassID getDynamicClassID () const
int32_t getMatchedLength (void) const
int32_t getMatchedStart (void) const
void getMatchedText (UnicodeString &result) const
virtual int32_t getOffset (void) const
const UnicodeStringgetPattern () const
const UnicodeStringgetText (void) const
int32_t last (UErrorCode &status)
int32_t next (UErrorCode &status)
UBool operator!= (const SearchIterator &that) const
StringSearchoperator= (const StringSearch &that)
virtual UBool operator== (const SearchIterator &that) const
int32_t preceding (int32_t position, UErrorCode &status)
int32_t previous (UErrorCode &status)
virtual void reset ()
virtual SearchIteratorsafeClone (void) const
void setAttribute (USearchAttribute attribute, USearchAttributeValue value, UErrorCode &status)
void setBreakIterator (BreakIterator *breakiter, UErrorCode &status)
void setCollator (RuleBasedCollator *coll, UErrorCode &status)
virtual void setOffset (int32_t position, UErrorCode &status)
void setPattern (const UnicodeString &pattern, UErrorCode &status)
virtual void setText (const UnicodeString &text, UErrorCode &status)
virtual void setText (CharacterIterator &text, UErrorCode &status)
 StringSearch (const UnicodeString &pattern, const UnicodeString &text, const Locale &locale, BreakIterator *breakiter, UErrorCode &status)
 StringSearch (const UnicodeString &pattern, const UnicodeString &text, RuleBasedCollator *coll, BreakIterator *breakiter, UErrorCode &status)
 StringSearch (const UnicodeString &pattern, CharacterIterator &text, const Locale &locale, BreakIterator *breakiter, UErrorCode &status)
 StringSearch (const UnicodeString &pattern, CharacterIterator &text, RuleBasedCollator *coll, BreakIterator *breakiter, UErrorCode &status)
 StringSearch (const StringSearch &that)
virtual ~StringSearch (void)

Static Public Member Functions

static UClassID U_EXPORT2 getStaticClassID ()
static void U_EXPORT2 operator delete (void *p) U_NO_THROW
static void U_EXPORT2 operator delete (void *, void *) U_NO_THROW
static void U_EXPORT2 operator delete[] (void *p) U_NO_THROW
static void *U_EXPORT2 operator new (size_t size) U_NO_THROW
static void *U_EXPORT2 operator new (size_t, void *ptr) U_NO_THROW
static void *U_EXPORT2 operator new[] (size_t size) U_NO_THROW

Protected Member Functions

virtual int32_t handleNext (int32_t position, UErrorCode &status)
virtual int32_t handlePrev (int32_t position, UErrorCode &status)
virtual void setMatchLength (int32_t length)
void setMatchNotFound ()
virtual void setMatchStart (int32_t position)

Protected Attributes

BreakIteratorm_breakiterator_
USearchm_search_
UnicodeString m_text_

Private Attributes

RuleBasedCollator m_collator_
UnicodeString m_pattern_
UStringSearchm_strsrch_

Detailed Description

StringSearch is a SearchIterator that provides language-sensitive text searching based on the comparison rules defined in a RuleBasedCollator object. StringSearch ensures that language eccentricity can be handled, e.g. for the German collator, characters ß and SS will be matched if case is chosen to be ignored. See the "ICU Collation Design Document" for more information.

The algorithm implemented is a modified form of the Boyer Moore's search. For more information see "Efficient Text Searching in Java", published in Java Report in February, 1999, for further information on the algorithm.

There are 2 match options for selection:
Let S' be the sub-string of a text string S between the offsets start and end <start, end>.
A pattern string P matches a text string S at the offsets <start, end> if

 
 option 1. Some canonical equivalent of P matches some canonical equivalent 
           of S'
 option 2. P matches S' and if P starts or ends with a combining mark, 
           there exists no non-ignorable combining mark before or after S? 
           in S respectively. 
 

Option 2. will be the default.

This search has APIs similar to that of other text iteration mechanisms such as the break iterators in BreakIterator. Using these APIs, it is easy to scan through text looking for all occurances of a given pattern. This search iterator allows changing of direction by calling a reset followed by a next or previous. Though a direction change can occur without calling reset first, this operation comes with some speed penalty. Match results in the forward direction will match the result matches in the backwards direction in the reverse order

SearchIterator provides APIs to specify the starting position within the text string to be searched, e.g. setOffset, preceding and following. Since the starting position will be set as it is specified, please take note that there are some danger points which the search may render incorrect results:

  • The midst of a substring that requires normalization.
  • If the following match is to be found, the position should not be the second character which requires to be swapped with the preceding character. Vice versa, if the preceding match is to be found, position to search from should not be the first character which requires to be swapped with the next character. E.g certain Thai and Lao characters require swapping.
  • If a following pattern match is to be found, any position within a contracting sequence except the first will fail. Vice versa if a preceding pattern match is to be found, a invalid starting point would be any character within a contracting sequence except the last.

A breakiterator can be used if only matches at logical breaks are desired. Using a breakiterator will only give you results that exactly matches the boundaries given by the breakiterator. For instance the pattern "e" will not be found in the string "\u00e9" if a character break iterator is used.

Options are provided to handle overlapping matches. E.g. In English, overlapping matches produces the result 0 and 2 for the pattern "abab" in the text "ababab", where else mutually exclusive matches only produce the result of 0.

Though collator attributes will be taken into consideration while performing matches, there are no APIs here for setting and getting the attributes. These attributes can be set by getting the collator from getCollator and using the APIs in coll.h. Lastly to update StringSearch to the new collator attributes, reset() has to be called.

Restriction:
Currently there are no composite characters that consists of a character with combining class > 0 before a character with combining class == 0. However, if such a character exists in the future, StringSearch does not guarantee the results for option 1.

Consult the SearchIterator documentation for information on and examples of how to use instances of this class to implement text searching.


 UnicodeString target("The quick brown fox jumps over the lazy dog.");
 UnicodeString pattern("fox");
 UErrorCode      error = U_ZERO_ERROR;
 StringSearch iter(pattern, target, Locale::getUS(), NULL, status);
 for (int pos = iter.first(error);
      pos != USEARCH_DONE; 
      pos = iter.next(error))
 {
     printf("Found match at %d pos, length is %d\n", pos, 
                                             iter.getMatchLength());
 }
 

Note, StringSearch is not to be subclassed.

See also:
SearchIterator
RuleBasedCollator
Since:
ICU 2.0

Definition at line 138 of file stsearch.h.


The documentation for this class was generated from the following files:

Generated by  Doxygen 1.6.0   Back to index