Class ScriptCategory
java.lang.Object
org.apache.tika.langdetect.charsoup.ScriptCategory
Coarse Unicode script categories for language detection.
The full Character.UnicodeScript enum has ~160 values, far more
granularity than needed. This class maps scripts into a small set of
categories that matter for language detection:
- Scripts that cover multiple confusable languages (Latin, Cyrillic, Arabic)
- CJK scripts that need special n-gram treatment (Han, Hiragana, Katakana, Hangul)
- Major Indic and Southeast Asian scripts
- Everything else bucketed into OTHER
The category ID (0–15) is used as a salt byte in feature hashing, ensuring that characters from different scripts never collide in the bucket space.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intNumber of distinct categories.static final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final int -
Method Summary
-
Field Details
-
LATIN
public static final int LATIN- See Also:
-
CYRILLIC
public static final int CYRILLIC- See Also:
-
ARABIC
public static final int ARABIC- See Also:
-
HAN
public static final int HAN- See Also:
-
HANGUL
public static final int HANGUL- See Also:
-
HIRAGANA
public static final int HIRAGANA- See Also:
-
KATAKANA
public static final int KATAKANA- See Also:
-
DEVANAGARI
public static final int DEVANAGARI- See Also:
-
THAI
public static final int THAI- See Also:
-
GREEK
public static final int GREEK- See Also:
-
HEBREW
public static final int HEBREW- See Also:
-
BENGALI
public static final int BENGALI- See Also:
-
GEORGIAN
public static final int GEORGIAN- See Also:
-
ARMENIAN
public static final int ARMENIAN- See Also:
-
ETHIOPIC
public static final int ETHIOPIC- See Also:
-
OTHER
public static final int OTHER- See Also:
-
CANADIAN_ABORIGINAL
public static final int CANADIAN_ABORIGINAL- See Also:
-
MYANMAR
public static final int MYANMAR- See Also:
-
TIBETAN
public static final int TIBETAN- See Also:
-
KHMER
public static final int KHMER- See Also:
-
HAN_EXT_A
public static final int HAN_EXT_A- See Also:
-
HAN_EXT_B
public static final int HAN_EXT_B- See Also:
-
HAN_COMPAT
public static final int HAN_COMPAT- See Also:
-
BOPOMOFO
public static final int BOPOMOFO- See Also:
-
COUNT
public static final int COUNTNumber of distinct categories.- See Also:
-
-
Method Details
-
of
public static int of(int cp) Map a codepoint to its coarse script category.Uses a fast-path for ASCII (Latin) before falling through to
Character.UnicodeScript.of(int).- Parameters:
cp- a Unicode codepoint (should already be lowercased)- Returns:
- category ID in [0,
COUNT)
-
name
Human-readable name of a category.
-