Class CharSoupFeatureExtractor

java.lang.Object
org.apache.tika.langdetect.charsoup.CharSoupFeatureExtractor

public class CharSoupFeatureExtractor extends Object
Extracts character n-gram features from text using the hashing trick (FNV-1a).

WARNING — DO NOT CHANGE THIS CLASS WITHOUT RETRAINING THE MODEL. This class encodes the exact preprocessing and feature-extraction pipeline that was used when langdetect.bin was trained. Training and inference must be bit-for-bit identical: any change to isTransparent(int), preprocess(String), preprocessNoTruncate(String), extractBigrams(java.lang.String, int[]), or the FNV-1a hash will silently degrade accuracy because the model weights will no longer correspond to the features being computed. Changes here also affect tika-eval tokenization, which calls preprocessNoTruncate(String) and isTransparent(int) directly.

Pipeline

  1. Truncate input at MAX_TEXT_LENGTH chars
  2. Strip URLs and emails (TIKA-2777 bounded patterns)
  3. NFC normalize
  4. Iterate codepoints (surrogate-safe)
  5. Skip transparent characters (see isTransparent(int))
  6. Filter: Character.isLetter(int)
  7. Case fold: Character.toLowerCase(int)
  8. Emit bigrams (and optionally trigrams) with underscore _ sentinels at word boundaries
  9. Hash each n-gram via FNV-1a → bucket index

Trigram mode

When includeTrigrams is enabled, both bigrams and trigrams are hashed into the same bucket vector. Trigrams are more discriminative than bigrams (e.g., "the" vs "th"+"he"), which improves accuracy on very short texts. The tradeoff is more hash collisions in smaller bucket vectors.

Transparent character handling

Certain codepoints are treated as transparent — they are skipped entirely during n-gram extraction so that base letters on either side form a contiguous pair. This is critical for Arabic and Hebrew where diacritical marks (harakat, niqqud) are Unicode nonspacing marks (Mn) that would otherwise break words into isolated single-letter fragments, destroying the bigram signal.

See isTransparent(int) for the full list of skipped codepoints.

  • Constructor Details

    • CharSoupFeatureExtractor

      public CharSoupFeatureExtractor(int numBuckets)
      Create an extractor with bigrams only.
      Parameters:
      numBuckets - number of hash buckets (feature vector size)
    • CharSoupFeatureExtractor

      public CharSoupFeatureExtractor(int numBuckets, boolean includeTrigrams)
      Create an extractor with configurable n-gram mode.
      Parameters:
      numBuckets - number of hash buckets (feature vector size)
      includeTrigrams - if true, both bigrams and trigrams are hashed into the bucket vector; if false, only bigrams are used (the default)
  • Method Details

    • extract

      public int[] extract(String rawText)
      Full preprocessing + feature extraction pipeline.
      Parameters:
      rawText - raw input text (may be null)
      Returns:
      int array of size numBuckets with bigram counts
    • extract

      public void extract(String rawText, int[] counts)
      Extract features into a caller-supplied buffer, avoiding allocation. The buffer is zeroed and then filled with bigram counts.

      Use this in tight training loops to eliminate per-sample GC pressure from allocating 128KB int arrays millions of times.

      Parameters:
      rawText - raw input text (may be null)
      counts - pre-allocated int array of size numBuckets (will be zeroed)
    • extractFromPreprocessed

      public int[] extractFromPreprocessed(String preprocessedText)
      Extract features from already-preprocessed text (no NFC, no URL stripping, no truncation). Use this when the text has already been passed through preprocess(String) — for example, when loading preprocessed data from disk.
      Parameters:
      preprocessedText - text that has already been through preprocess(String)
      Returns:
      int array of size numBuckets with bigram counts
    • extractFromPreprocessed

      public void extractFromPreprocessed(String preprocessedText, int[] counts)
      Extract features from already-preprocessed text into a caller-supplied buffer. Combines the benefits of extractFromPreprocessed(String) (skip preprocessing) and extract(String, int[]) (no allocation).

      This is the fastest extraction path — use it in training loops where text has been preprocessed and written to disk ahead of time.

      Parameters:
      preprocessedText - text that has already been through preprocess(String)
      counts - pre-allocated int array of size numBuckets (will be zeroed)
    • extractFromPreprocessed

      public void extractFromPreprocessed(String preprocessedText, int[] counts, boolean clear)
      Extract features from already-preprocessed text into a caller-supplied buffer, optionally clearing it first.

      When clear is false, bigram counts are accumulated on top of whatever is already in the buffer. This is useful in training loops where features from multiple sources need to be combined into a single vector.

      Parameters:
      preprocessedText - text that has already been through preprocess(String)
      counts - pre-allocated int array of size numBuckets
      clear - if true, zero the array before extracting; if false, accumulate on top of existing counts
    • preprocess

      public static String preprocess(String rawText)
      Preprocessing: truncate, strip URLs/emails, NFC normalize.

      This method is also used by the general word tokenizer so that tika-eval shares the same normalization pipeline.

      Parameters:
      rawText - raw input
      Returns:
      cleaned, NFC-normalized text
    • preprocessNoTruncate

      public static String preprocessNoTruncate(String rawText)
      Preprocessing without the length truncation: strip URLs/emails and NFC-normalize. Used by tika-eval tokenization, which imposes its own maxTokens limit rather than a character limit.

      Important: preprocess(String) delegates to this method after truncating, so any change here affects both language detection and tika-eval tokenization. Also note that TikaEvalTokenizer calls isTransparent(int) directly; changes to that method affect tika-eval token boundaries as well.

      Parameters:
      rawText - raw input
      Returns:
      cleaned, NFC-normalized text
    • isTransparent

      public static boolean isTransparent(int cp)
      Determine whether a codepoint should be treated as transparent (skipped) during bigram extraction and word tokenization.

      Transparent codepoints are invisible to the bigram/tokenization logic: base letters on either side of a transparent run form a contiguous bigram or remain part of the same word token.

      The following categories are transparent:

      • Unicode nonspacing marks (Mn) — includes Arabic harakat (fatha U+064E, damma U+064F, kasra U+0650, shadda U+0651, sukun U+0652, tanwin U+064B–U+064D, superscript alef U+0670) and Hebrew niqqud (U+05B0–U+05BD, U+05BF, U+05C1–U+05C2, U+05C4–U+05C5, U+05C7). Without this, diacritics break Arabic/Hebrew words into isolated single-letter fragments because Character.isLetter(int) returns false for Mn codepoints. Stripping them yields clean base-letter bigrams, which Stripping them preserves clean base-letter bigrams.
      • Arabic Tatweel / Kashida (U+0640) — a typographic stretching character that is classified as a letter but carries no linguistic information. "كتب" and "كـتـب" should produce identical bigrams.
      • ZWNJ (U+200C) — Zero Width Non-Joiner, used heavily in Persian/Farsi (e.g., "می‌خواهم") and in Arabic, Urdu, and Kurdish to control cursive joining. It is not a word boundary; bigrams should span across it.
      • ZWJ (U+200D) — Zero Width Joiner, forces cursive joining. Also not a word boundary.

      A fast guard (cp < 0x0300) short-circuits the check for ASCII and basic Latin/Greek text, adding zero overhead to the common case.

      Parameters:
      cp - a Unicode codepoint
      Returns:
      true if the codepoint should be skipped
    • getNumBuckets

      public int getNumBuckets()
    • isIncludeTrigrams

      public boolean isIncludeTrigrams()