Class ScriptCategory

java.lang.Object
org.apache.tika.langdetect.charsoup.ScriptCategory

public final class ScriptCategory extends Object
Coarse Unicode script categories for language detection.

The full Character.UnicodeScript enum has ~160 values, far more granularity than needed. This class maps scripts into a small set of categories that matter for language detection:

  • Scripts that cover multiple confusable languages (Latin, Cyrillic, Arabic)
  • CJK scripts that need special n-gram treatment (Han, Hiragana, Katakana, Hangul)
  • Major Indic and Southeast Asian scripts
  • Everything else bucketed into OTHER

The category ID (0–15) is used as a salt byte in feature hashing, ensuring that characters from different scripts never collide in the bucket space.