org.apache.tika.langdetect.charsoup.CharSoupFeatureExtractor

public class CharSoupFeatureExtractor extends Object

Extracts character n-gram features from text using the hashing trick (FNV-1a).

WARNING — DO NOT CHANGE THIS CLASS WITHOUT RETRAINING THE MODEL. This class encodes the exact preprocessing and feature-extraction pipeline that was used when langdetect.bin was trained. Training and inference must be bit-for-bit identical: any change to isTransparent(int), preprocess(String), preprocessNoTruncate(String), extractBigrams(java.lang.String, int[]), or the FNV-1a hash will silently degrade accuracy because the model weights will no longer correspond to the features being computed. Changes here also affect tika-eval tokenization, which calls preprocessNoTruncate(String) and isTransparent(int) directly.

Pipeline

Truncate input at MAX_TEXT_LENGTH chars
Strip URLs and emails (TIKA-2777 bounded patterns)
NFC normalize
Iterate codepoints (surrogate-safe)
Skip transparent characters (see isTransparent(int))
Filter: Character.isLetter(int)
Case fold: Character.toLowerCase(int)
Emit bigrams (and optionally trigrams) with underscore _ sentinels at word boundaries
Hash each n-gram via FNV-1a → bucket index

Trigram mode

When includeTrigrams is enabled, both bigrams and trigrams are hashed into the same bucket vector. Trigrams are more discriminative than bigrams (e.g., "the" vs "th"+"he"), which improves accuracy on very short texts. The tradeoff is more hash collisions in smaller bucket vectors.

Transparent character handling

Certain codepoints are treated as transparent — they are skipped entirely during n-gram extraction so that base letters on either side form a contiguous pair. This is critical for Arabic and Hebrew where diacritical marks (harakat, niqqud) are Unicode nonspacing marks (Mn) that would otherwise break words into isolated single-letter fragments, destroying the bigram signal.

See isTransparent(int) for the full list of skipped codepoints.

Constructor Summary

Constructors

Constructor

Description

CharSoupFeatureExtractor(int numBuckets)

Create an extractor with bigrams only.

CharSoupFeatureExtractor(int numBuckets, boolean includeTrigrams)

Create an extractor with configurable n-gram mode.
Method Summary

Modifier and Type

Method

Description

int[]

extract(String rawText)

Full preprocessing + feature extraction pipeline.

void

extract(String rawText, int[] counts)

Extract features into a caller-supplied buffer, avoiding allocation.

int[]

extractFromPreprocessed(String preprocessedText)

Extract features from already-preprocessed text (no NFC, no URL stripping, no truncation).

void

extractFromPreprocessed(String preprocessedText, int[] counts)

Extract features from already-preprocessed text into a caller-supplied buffer.

void

extractFromPreprocessed(String preprocessedText, int[] counts, boolean clear)

Extract features from already-preprocessed text into a caller-supplied buffer, optionally clearing it first.

int

getNumBuckets()

boolean

isIncludeTrigrams()

static boolean

isTransparent(int cp)

Determine whether a codepoint should be treated as transparent (skipped) during bigram extraction and word tokenization.

static String

preprocess(String rawText)

Preprocessing: truncate, strip URLs/emails, NFC normalize.

static String

preprocessNoTruncate(String rawText)

Preprocessing without the length truncation: strip URLs/emails and NFC-normalize.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- CharSoupFeatureExtractor
  
  public CharSoupFeatureExtractor(int numBuckets)
  
  Create an extractor with bigrams only.
  
  Parameters:
  
  numBuckets - number of hash buckets (feature vector size)
- CharSoupFeatureExtractor
  
  public CharSoupFeatureExtractor(int numBuckets, boolean includeTrigrams)
  
  Create an extractor with configurable n-gram mode.
  
  Parameters:
  
  numBuckets - number of hash buckets (feature vector size)
  
  includeTrigrams - if true, both bigrams and trigrams are hashed into the bucket vector; if false, only bigrams are used (the default)
Method Details
- extract
  
  public int[] extract(String rawText)
  
  Full preprocessing + feature extraction pipeline.
  
  Parameters:
  
  rawText - raw input text (may be null)
  
  Returns:
  
  int array of size numBuckets with bigram counts
- extract
  
  public void extract(String rawText, int[] counts)
  
  Extract features into a caller-supplied buffer, avoiding allocation. The buffer is zeroed and then filled with bigram counts.
  Use this in tight training loops to eliminate per-sample GC pressure from allocating 128KB int arrays millions of times.
  
  Parameters:
  
  rawText - raw input text (may be null)
  
  counts - pre-allocated int array of size numBuckets (will be zeroed)
- extractFromPreprocessed
  
  public int[] extractFromPreprocessed(String preprocessedText)
  
  Extract features from already-preprocessed text (no NFC, no URL stripping, no truncation). Use this when the text has already been passed through preprocess(String) — for example, when loading preprocessed data from disk.
  
  Parameters:
  
  preprocessedText - text that has already been through preprocess(String)
  
  Returns:
  
  int array of size numBuckets with bigram counts
- extractFromPreprocessed
  
  public void extractFromPreprocessed(String preprocessedText, int[] counts)
  
  Extract features from already-preprocessed text into a caller-supplied buffer. Combines the benefits of extractFromPreprocessed(String) (skip preprocessing) and extract(String, int[]) (no allocation).
  This is the fastest extraction path — use it in training loops where text has been preprocessed and written to disk ahead of time.
  
  Parameters:
  
  preprocessedText - text that has already been through preprocess(String)
  
  counts - pre-allocated int array of size numBuckets (will be zeroed)
- extractFromPreprocessed
  
  public void extractFromPreprocessed(String preprocessedText, int[] counts, boolean clear)
  
  Extract features from already-preprocessed text into a caller-supplied buffer, optionally clearing it first.
  When clear is false, bigram counts are accumulated on top of whatever is already in the buffer. This is useful in training loops where features from multiple sources need to be combined into a single vector.
  
  Parameters:
  
  preprocessedText - text that has already been through preprocess(String)
  
  counts - pre-allocated int array of size numBuckets
  
  clear - if true, zero the array before extracting; if false, accumulate on top of existing counts
- preprocess
  
  public static String preprocess(String rawText)
  
  Preprocessing: truncate, strip URLs/emails, NFC normalize.
  This method is also used by the general word tokenizer so that tika-eval shares the same normalization pipeline.
  
  Parameters:
  
  rawText - raw input
  
  Returns:
  
  cleaned, NFC-normalized text
- preprocessNoTruncate
  
  public static String preprocessNoTruncate(String rawText)
  
  Preprocessing without the length truncation: strip URLs/emails and NFC-normalize. Used by tika-eval tokenization, which imposes its own maxTokens limit rather than a character limit.
  Important: preprocess(String) delegates to this method after truncating, so any change here affects both language detection and tika-eval tokenization. Also note that TikaEvalTokenizer calls isTransparent(int) directly; changes to that method affect tika-eval token boundaries as well.
  
  Parameters:
  
  rawText - raw input
  
  Returns:
  
  cleaned, NFC-normalized text
- isTransparent
  
  public static boolean isTransparent(int cp)
  Determine whether a codepoint should be treated as transparent (skipped) during bigram extraction and word tokenization.
  Transparent codepoints are invisible to the bigram/tokenization logic: base letters on either side of a transparent run form a contiguous bigram or remain part of the same word token.
  
  The following categories are transparent:
  
  Unicode nonspacing marks (Mn) — includes Arabic harakat (fatha U+064E, damma U+064F, kasra U+0650, shadda U+0651, sukun U+0652, tanwin U+064B–U+064D, superscript alef U+0670) and Hebrew niqqud (U+05B0–U+05BD, U+05BF, U+05C1–U+05C2, U+05C4–U+05C5, U+05C7). Without this, diacritics break Arabic/Hebrew words into isolated single-letter fragments because Character.isLetter(int) returns false for Mn codepoints. Stripping them yields clean base-letter bigrams, which Stripping them preserves clean base-letter bigrams.
  
  Arabic Tatweel / Kashida (U+0640) — a typographic stretching character that is classified as a letter but carries no linguistic information. "كتب" and "كـتـب" should produce identical bigrams.
  
  ZWNJ (U+200C) — Zero Width Non-Joiner, used heavily in Persian/Farsi (e.g., "می‌خواهم") and in Arabic, Urdu, and Kurdish to control cursive joining. It is not a word boundary; bigrams should span across it.
  
  ZWJ (U+200D) — Zero Width Joiner, forces cursive joining. Also not a word boundary.
  
  A fast guard (cp < 0x0300) short-circuits the check for ASCII and basic Latin/Greek text, adding zero overhead to the common case.
  Parameters:
  
  cp - a Unicode codepoint
  
  Returns:
  
  true if the codepoint should be skipped
- getNumBuckets
  
  public int getNumBuckets()
- isIncludeTrigrams
  
  public boolean isIncludeTrigrams()

Class CharSoupFeatureExtractor

Pipeline

Trigram mode

Transparent character handling

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

CharSoupFeatureExtractor

CharSoupFeatureExtractor

Method Details

extract

extract

extractFromPreprocessed

extractFromPreprocessed

extractFromPreprocessed

preprocess

preprocessNoTruncate

isTransparent

getNumBuckets

isIncludeTrigrams