java.lang.Object

org.apache.tika.langdetect.charsoup.ScriptCategory

public final class ScriptCategory extends Object

Coarse Unicode script categories for language detection.

The full Character.UnicodeScript enum has ~160 values, far more granularity than needed. This class maps scripts into a small set of categories that matter for language detection:

Scripts that cover multiple confusable languages (Latin, Cyrillic, Arabic)
CJK scripts that need special n-gram treatment (Han, Hiragana, Katakana, Hangul)
Major Indic and Southeast Asian scripts
Everything else bucketed into OTHER

The category ID (0–15) is used as a salt byte in feature hashing, ensuring that characters from different scripts never collide in the bucket space.

Field Summary

Fields

Modifier and Type

Field

Description

static final int

ARABIC

static final int

ARMENIAN

static final int

BENGALI

static final int

BOPOMOFO

static final int

CANADIAN_ABORIGINAL

static final int

COUNT

Number of distinct categories.

static final int

CYRILLIC

static final int

DEVANAGARI

static final int

ETHIOPIC

static final int

GEORGIAN

static final int

GREEK

static final int

HAN

static final int

HAN_COMPAT

static final int

HAN_EXT_A

static final int

HAN_EXT_B

static final int

HANGUL

static final int

HEBREW

static final int

HIRAGANA

static final int

KATAKANA

static final int

KHMER

static final int

LATIN

static final int

MYANMAR

static final int

OTHER

static final int

THAI

static final int

TIBETAN
Method Summary

Modifier and Type

Method

Description

static String

name(int category)

Human-readable name of a category.

static int

of(int cp)

Map a codepoint to its coarse script category.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- LATIN
  
  public static final int LATIN
  See Also:
  
  Constant Field Values
- CYRILLIC
  
  public static final int CYRILLIC
  See Also:
  
  Constant Field Values
- ARABIC
  
  public static final int ARABIC
  See Also:
  
  Constant Field Values
- HAN
  
  public static final int HAN
  See Also:
  
  Constant Field Values
- HANGUL
  
  public static final int HANGUL
  See Also:
  
  Constant Field Values
- HIRAGANA
  
  public static final int HIRAGANA
  See Also:
  
  Constant Field Values
- KATAKANA
  
  public static final int KATAKANA
  See Also:
  
  Constant Field Values
- DEVANAGARI
  
  public static final int DEVANAGARI
  See Also:
  
  Constant Field Values
- THAI
  
  public static final int THAI
  See Also:
  
  Constant Field Values
- GREEK
  
  public static final int GREEK
  See Also:
  
  Constant Field Values
- HEBREW
  
  public static final int HEBREW
  See Also:
  
  Constant Field Values
- BENGALI
  
  public static final int BENGALI
  See Also:
  
  Constant Field Values
- GEORGIAN
  
  public static final int GEORGIAN
  See Also:
  
  Constant Field Values
- ARMENIAN
  
  public static final int ARMENIAN
  See Also:
  
  Constant Field Values
- ETHIOPIC
  
  public static final int ETHIOPIC
  See Also:
  
  Constant Field Values
- OTHER
  
  public static final int OTHER
  See Also:
  
  Constant Field Values
- CANADIAN_ABORIGINAL
  
  public static final int CANADIAN_ABORIGINAL
  See Also:
  
  Constant Field Values
- MYANMAR
  
  public static final int MYANMAR
  See Also:
  
  Constant Field Values
- TIBETAN
  
  public static final int TIBETAN
  See Also:
  
  Constant Field Values
- KHMER
  
  public static final int KHMER
  See Also:
  
  Constant Field Values
- HAN_EXT_A
  
  public static final int HAN_EXT_A
  See Also:
  
  Constant Field Values
- HAN_EXT_B
  
  public static final int HAN_EXT_B
  See Also:
  
  Constant Field Values
- HAN_COMPAT
  
  public static final int HAN_COMPAT
  See Also:
  
  Constant Field Values
- BOPOMOFO
  
  public static final int BOPOMOFO
  See Also:
  
  Constant Field Values
- COUNT
  
  public static final int COUNT
  
  Number of distinct categories.
  See Also:
  
  Constant Field Values
Method Details
- of
  
  public static int of(int cp)
  
  Map a codepoint to its coarse script category.
  Uses a fast-path for ASCII (Latin) before falling through to Character.UnicodeScript.of(int).
  
  Parameters:
  
  cp - a Unicode codepoint (should already be lowercased)
  
  Returns:
  
  category ID in [0, COUNT)
- name
  
  public static String name(int category)
  
  Human-readable name of a category.

Class ScriptCategory

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

LATIN

CYRILLIC

ARABIC

HAN

HANGUL

HIRAGANA

KATAKANA

DEVANAGARI

THAI

GREEK

HEBREW

BENGALI

GEORGIAN

ARMENIAN

ETHIOPIC

OTHER

CANADIAN_ABORIGINAL

MYANMAR

TIBETAN

KHMER

HAN_EXT_A

HAN_EXT_B

HAN_COMPAT

BOPOMOFO

COUNT

Method Details

of

name