Class ScriptAwareFeatureExtractor

java.lang.Object
org.apache.tika.langdetect.charsoup.ScriptAwareFeatureExtractor
All Implemented Interfaces:
FeatureExtractor

public class ScriptAwareFeatureExtractor extends Object implements FeatureExtractor
Production feature extractor for the CharSoup language detection model.

Hardcoded to the winning configuration established during the 2026-02 ablation study (flat-16k+tri+suf+pre, 220 languages):

  • Character bigrams with word-boundary sentinels (non-CJK)
  • Character trigrams including boundary trigrams
  • 3-char word suffixes
  • 3-char word prefixes
  • Whole-word unigrams (2–30 codepoints, non-CJK)
  • CJK/kana character unigrams
All features share a single flat hash space.

For the fully-parameterized version used during ablation experiments, see ResearchFeatureExtractor in the test module.

  • Field Details

    • FEATURE_FLAGS

      public static final int FEATURE_FLAGS
      Bitmask of CharSoupModel.FLAG_* constants that exactly describes the features this extractor emits. Used by CharSoupModel.getFeatureFlags() so that the model file always reflects the real inference-time feature set.
      See Also:
    • FEATURE_FLAGS_LEGACY

      public static final int FEATURE_FLAGS_LEGACY
      Flags used by models trained before script block features were added.
      See Also:
    • SCRIPT_BASIS

      public static final int SCRIPT_BASIS
      See Also:
    • SCRIPT_TRANS_BASIS

      public static final int SCRIPT_TRANS_BASIS
      See Also:
  • Constructor Details

    • ScriptAwareFeatureExtractor

      public ScriptAwareFeatureExtractor(int numBuckets)
    • ScriptAwareFeatureExtractor

      public ScriptAwareFeatureExtractor(int numBuckets, boolean useScriptBlocks)
  • Method Details

    • extract

      public int[] extract(String rawText)
      Description copied from interface: FeatureExtractor
      Full preprocessing + feature extraction pipeline.
      Specified by:
      extract in interface FeatureExtractor
      Parameters:
      rawText - raw input text (may be null)
      Returns:
      int array of size FeatureExtractor.getNumBuckets() with feature counts
    • extract

      public void extract(String rawText, int[] counts)
      Description copied from interface: FeatureExtractor
      Extract into caller-supplied buffer (zeroed first).
      Specified by:
      extract in interface FeatureExtractor
      Parameters:
      rawText - raw input text (may be null)
      counts - pre-allocated int array of size FeatureExtractor.getNumBuckets() (will be zeroed)
    • extractFromPreprocessed

      public int[] extractFromPreprocessed(String text)
      Description copied from interface: FeatureExtractor
      Extract from already-preprocessed text.
      Specified by:
      extractFromPreprocessed in interface FeatureExtractor
      Parameters:
      text - text already passed through CharSoupFeatureExtractor.preprocess(String)
      Returns:
      int array of size FeatureExtractor.getNumBuckets() with feature counts
    • extractFromPreprocessed

      public void extractFromPreprocessed(String text, int[] counts, boolean clear)
      Description copied from interface: FeatureExtractor
      Extract from already-preprocessed text into a caller-supplied buffer.
      Specified by:
      extractFromPreprocessed in interface FeatureExtractor
      Parameters:
      text - text already passed through CharSoupFeatureExtractor.preprocess(String)
      counts - pre-allocated int array of size FeatureExtractor.getNumBuckets()
      clear - if true, zero the array before extracting; if false, accumulate on top of existing counts
    • extractAndCount

      public int extractAndCount(String rawText, int[] counts)
      Description copied from interface: FeatureExtractor
      Extract features into counts and return the total n-gram emission count.

      The count is the raw number of individual n-gram tokens processed before bucket hashing. It is a script-neutral measure of how much signal the input carries: whitespace-only input yields 0; ~200 chars of typical Latin or CJK prose yields roughly 400. This is the right threshold variable for length-gated confusables because it is insensitive to padding spaces or punctuation-heavy inputs, and it naturally accounts for the higher feature density of CJK text vs. Latin text.

      The default implementation sums the feature vector after extraction, which is correct because every emission does counts[bucket]++; the sum therefore equals the total emission count regardless of hash collisions.

      Specified by:
      extractAndCount in interface FeatureExtractor
      Parameters:
      rawText - raw input text (may be null)
      counts - pre-allocated int array of size FeatureExtractor.getNumBuckets() (will be zeroed)
      Returns:
      total n-gram emission count (≥ 0)
    • isCjkScript

      public static boolean isCjkScript(int script)
    • isCjkOrKana

      public static boolean isCjkOrKana(int cp)
    • getNumBuckets

      public int getNumBuckets()
      Specified by:
      getNumBuckets in interface FeatureExtractor
      Returns:
      number of hash buckets (feature vector size)
    • getFeatureFlags

      public int getFeatureFlags()
      Description copied from interface: FeatureExtractor
      Returns the bitmask of CharSoupModel FLAG_* constants that describes which feature types this extractor emits.

      This must match the featureFlags stored in any CharSoupModel used with this extractor. A mismatch means the model was trained with a different feature set and will produce garbage scores.

      Specified by:
      getFeatureFlags in interface FeatureExtractor
      Returns:
      bitmask of active feature flags