Logo Search packages:      
Sourcecode: icu version File versions

Typedefs | Enumerations | Functions

uspoof.h File Reference

Unicode Security and Spoofing Detection, C API. More...

#include "unicode/utypes.h"
#include "unicode/uset.h"
#include "unicode/parseerr.h"
#include "unicode/localpointer.h"
Include dependency graph for uspoof.h:
This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Typedefs

typedef struct USpoofChecker USpoofChecker
typedef enum USpoofChecks USpoofChecks

Enumerations

enum  USpoofChecks {
  USPOOF_SINGLE_SCRIPT_CONFUSABLE = 1, USPOOF_MIXED_SCRIPT_CONFUSABLE = 2, USPOOF_WHOLE_SCRIPT_CONFUSABLE = 4, USPOOF_ANY_CASE = 8,
  USPOOF_SINGLE_SCRIPT = 16, USPOOF_INVISIBLE = 32, USPOOF_CHAR_LIMIT = 64, USPOOF_ALL_CHECKS = 0x7f
}

Functions

U_STABLE int32_t U_EXPORT2 uspoof_areConfusable (const USpoofChecker *sc, const UChar *s1, int32_t length1, const UChar *s2, int32_t length2, UErrorCode *status)
U_STABLE int32_t U_EXPORT2 uspoof_areConfusableUTF8 (const USpoofChecker *sc, const char *s1, int32_t length1, const char *s2, int32_t length2, UErrorCode *status)
U_STABLE int32_t U_EXPORT2 uspoof_check (const USpoofChecker *sc, const UChar *text, int32_t length, int32_t *position, UErrorCode *status)
U_STABLE int32_t U_EXPORT2 uspoof_checkUTF8 (const USpoofChecker *sc, const char *text, int32_t length, int32_t *position, UErrorCode *status)
U_STABLE USpoofChecker *U_EXPORT2 uspoof_clone (const USpoofChecker *sc, UErrorCode *status)
U_STABLE void U_EXPORT2 uspoof_close (USpoofChecker *sc)
U_STABLE const USet *U_EXPORT2 uspoof_getAllowedChars (const USpoofChecker *sc, UErrorCode *status)
U_STABLE const char *U_EXPORT2 uspoof_getAllowedLocales (USpoofChecker *sc, UErrorCode *status)
U_STABLE int32_t U_EXPORT2 uspoof_getChecks (const USpoofChecker *sc, UErrorCode *status)
U_STABLE int32_t U_EXPORT2 uspoof_getSkeleton (const USpoofChecker *sc, uint32_t type, const UChar *s, int32_t length, UChar *dest, int32_t destCapacity, UErrorCode *status)
U_STABLE int32_t U_EXPORT2 uspoof_getSkeletonUTF8 (const USpoofChecker *sc, uint32_t type, const char *s, int32_t length, char *dest, int32_t destCapacity, UErrorCode *status)
U_STABLE USpoofChecker *U_EXPORT2 uspoof_open (UErrorCode *status)
U_STABLE USpoofChecker *U_EXPORT2 uspoof_openFromSerialized (const void *data, int32_t length, int32_t *pActualLength, UErrorCode *pErrorCode)
U_STABLE USpoofChecker *U_EXPORT2 uspoof_openFromSource (const char *confusables, int32_t confusablesLen, const char *confusablesWholeScript, int32_t confusablesWholeScriptLen, int32_t *errType, UParseError *pe, UErrorCode *status)
U_STABLE int32_t U_EXPORT2 uspoof_serialize (USpoofChecker *sc, void *data, int32_t capacity, UErrorCode *status)
U_STABLE void U_EXPORT2 uspoof_setAllowedChars (USpoofChecker *sc, const USet *chars, UErrorCode *status)
U_STABLE void U_EXPORT2 uspoof_setAllowedLocales (USpoofChecker *sc, const char *localesList, UErrorCode *status)
U_STABLE void U_EXPORT2 uspoof_setChecks (USpoofChecker *sc, int32_t checks, UErrorCode *status)

Detailed Description

Unicode Security and Spoofing Detection, C API.

These functions are intended to check strings, typically identifiers of some type, such as URLs, for the presence of characters that are likely to be visually confusing - for cases where the displayed form of an identifier may not be what it appears to be.

Unicode Technical Report #36, http://unicode.org/reports/tr36, and Unicode Technical Standard #39, http://unicode.org/reports/tr39 "Unicode security considerations", give more background on security an spoofing issues with Unicode identifiers. The tests and checks provided by this module implement the recommendations from those Unicode documents.

The tests available on identifiers fall into two general categories:

  1. Single identifier tests. Check whether an identifier is potentially confusable with any other string, or is suspicious for other reasons.
  2. Two identifier tests. Check whether two specific identifiers are confusable. This does not consider whether either of strings is potentially confusable with any string other than the exact one specified.

The steps to perform confusability testing are

  1. Open a USpoofChecker.
  2. Configure the USPoofChecker for the desired set of tests. The tests that will be performed are specified by a set of USpoofChecks flags.
  3. Perform the checks using the pre-configured USpoofChecker. The results indicate which (if any) of the selected tests have identified possible problems with the identifier. Results are reported as a set of USpoofChecks flags; this mirrors the form in which the set of tests to perform was originally specified to the USpoofChecker.

A USpoofChecker may be used repeatedly to perform checks on any number of identifiers.

Thread Safety: The test functions for checking a single identifier, or for testing whether two identifiers are possible confusable, are thread safe. They may called concurrently, from multiple threads, using the same USpoofChecker instance.

More generally, the standard ICU thread safety rules apply: functions that take a const USpoofChecker parameter are thread safe. Those that take a non-const USpoofChecier are not thread safe.

Descriptions of the available checks.

When testing whether pairs of identifiers are confusable, with the uspoof_areConfusable() family of functions, the relevant tests are

  1. USPOOF_SINGLE_SCRIPT_CONFUSABLE: All of the characters from the two identifiers are from a single script, and the two identifiers are visually confusable.
  2. USPOOF_MIXED_SCRIPT_CONFUSABLE: At least one of the identifiers contains characters from more than one script, and the two identifiers are visually confusable.
  3. USPOOF_WHOLE_SCRIPT_CONFUSABLE: Each of the two identifiers is of a single script, but the two identifiers are from different scripts, and they are visually confusable.

The safest approach is to enable all three of these checks as a group.

USPOOF_ANY_CASE is a modifier for the above tests. If the identifiers being checked can be of mixed case and are used in a case-sensitive manner, this option should be specified.

If the identifiers being checked are used in a case-insensitive manner, and if they are displayed to users in lower-case form only, the USPOOF_ANY_CASE option should not be specified. Confusabality issues involving upper case letters will not be reported.

When performing tests on a single identifier, with the uspoof_check() family of functions, the relevant tests are:

  1. USPOOF_MIXED_SCRIPT_CONFUSABLE: the identifier contains characters from multiple scripts, and there exists an identifier of a single script that is visually confusable.
  2. USPOOF_WHOLE_SCRIPT_CONFUSABLE: the identifier consists of characters from a single script, and there exists a visually confusable identifier. The visually confusable identifier also consists of characters from a single script. but not the same script as the identifier being checked.
  3. USPOOF_ANY_CASE: modifies the mixed script and whole script confusables tests. If specified, the checks will consider confusable characters of any case. If this flag is not set, the test is performed assuming case folded identifiers.
  4. USPOOF_SINGLE_SCRIPT: check that the identifier contains only characters from a single script. (Characters from the 'common' and 'inherited' scripts are ignored.) This is not a test for confusable identifiers
  5. USPOOF_INVISIBLE: check an identifier for the presence of invisible characters, such as zero-width spaces, or character sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark. This check does not test the input string as a whole for conformance to any particular syntax for identifiers.
  6. USPOOF_CHAR_LIMIT: check that an identifier contains only characters from a specified set of acceptable characters. See uspoof_setAllowedChars() and uspoof_setAllowedLocales().

Note on Scripts: Characters from the Unicode Scripts "Common" and "Inherited" are ignored when considering the script of an identifier. Common characters include digits and symbols that are normally used with text from more than one script.

Identifier Skeletons: A skeleton is a transformation of an identifier, such that all identifiers that are confusable with each other have the same skeleton. Using skeletons, it is possible to build a dictionary data structure for a set of identifiers, and then quickly test whether a new identifier is confusable with an identifier already in the set. The uspoof_getSkeleton() family of functions will produce the skeleton from an identifier.

Note that skeletons are not guaranteed to be stable between versions of Unicode or ICU, so an applications should not rely on creating a permanent, or difficult to update, database of skeletons. Instabilities result from identifying new pairs or sequences of characters that are visually confusable, and thus must be mapped to the same skeleton character(s).

Definition in file uspoof.h.


Generated by  Doxygen 1.6.0   Back to index