Class CharSoupFeatureExtractor
WARNING — DO NOT CHANGE THIS CLASS WITHOUT RETRAINING THE MODEL.
This class encodes the exact preprocessing and feature-extraction pipeline
that was used when langdetect.bin was trained. Training and inference
must be bit-for-bit identical: any change to isTransparent(int),
preprocess(String), preprocessNoTruncate(String),
extractBigrams(java.lang.String, int[]), or the FNV-1a hash will silently degrade accuracy
because the model weights will no longer correspond to the features being
computed. Changes here also affect tika-eval tokenization, which calls
preprocessNoTruncate(String) and isTransparent(int)
directly.
Pipeline
- Truncate input at
MAX_TEXT_LENGTHchars - Strip URLs and emails (TIKA-2777 bounded patterns)
- NFC normalize
- Iterate codepoints (surrogate-safe)
- Skip transparent characters (see
isTransparent(int)) - Filter:
Character.isLetter(int) - Case fold:
Character.toLowerCase(int) - Emit bigrams (and optionally trigrams) with underscore
_sentinels at word boundaries - Hash each n-gram via FNV-1a → bucket index
Trigram mode
When includeTrigrams is enabled, both bigrams and trigrams are hashed
into the same bucket vector. Trigrams are more discriminative than bigrams
(e.g., "the" vs "th"+"he"), which improves accuracy on very short texts.
The tradeoff is more hash collisions in smaller bucket vectors.
Transparent character handling
Certain codepoints are treated as transparent — they are skipped entirely
during n-gram extraction so that base letters on either side form a contiguous pair.
This is critical for Arabic and Hebrew where diacritical marks (harakat, niqqud) are
Unicode nonspacing marks (Mn) that would otherwise break words into isolated
single-letter fragments, destroying the bigram signal.
See isTransparent(int) for the full list of skipped codepoints.
-
Constructor Summary
ConstructorsConstructorDescriptionCharSoupFeatureExtractor(int numBuckets) Create an extractor with bigrams only.CharSoupFeatureExtractor(int numBuckets, boolean includeTrigrams) Create an extractor with configurable n-gram mode. -
Method Summary
Modifier and TypeMethodDescriptionint[]Full preprocessing + feature extraction pipeline.voidExtract features into a caller-supplied buffer, avoiding allocation.int[]extractFromPreprocessed(String preprocessedText) Extract features from already-preprocessed text (no NFC, no URL stripping, no truncation).voidextractFromPreprocessed(String preprocessedText, int[] counts) Extract features from already-preprocessed text into a caller-supplied buffer.voidextractFromPreprocessed(String preprocessedText, int[] counts, boolean clear) Extract features from already-preprocessed text into a caller-supplied buffer, optionally clearing it first.intbooleanstatic booleanisTransparent(int cp) Determine whether a codepoint should be treated as transparent (skipped) during bigram extraction and word tokenization.static Stringpreprocess(String rawText) Preprocessing: truncate, strip URLs/emails, NFC normalize.static StringpreprocessNoTruncate(String rawText) Preprocessing without the length truncation: strip URLs/emails and NFC-normalize.
-
Constructor Details
-
CharSoupFeatureExtractor
public CharSoupFeatureExtractor(int numBuckets) Create an extractor with bigrams only.- Parameters:
numBuckets- number of hash buckets (feature vector size)
-
CharSoupFeatureExtractor
public CharSoupFeatureExtractor(int numBuckets, boolean includeTrigrams) Create an extractor with configurable n-gram mode.- Parameters:
numBuckets- number of hash buckets (feature vector size)includeTrigrams- iftrue, both bigrams and trigrams are hashed into the bucket vector; iffalse, only bigrams are used (the default)
-
-
Method Details
-
extract
Full preprocessing + feature extraction pipeline.- Parameters:
rawText- raw input text (may benull)- Returns:
- int array of size
numBucketswith bigram counts
-
extract
Extract features into a caller-supplied buffer, avoiding allocation. The buffer is zeroed and then filled with bigram counts.Use this in tight training loops to eliminate per-sample GC pressure from allocating 128KB int arrays millions of times.
- Parameters:
rawText- raw input text (may benull)counts- pre-allocated int array of sizenumBuckets(will be zeroed)
-
extractFromPreprocessed
Extract features from already-preprocessed text (no NFC, no URL stripping, no truncation). Use this when the text has already been passed throughpreprocess(String)— for example, when loading preprocessed data from disk.- Parameters:
preprocessedText- text that has already been throughpreprocess(String)- Returns:
- int array of size
numBucketswith bigram counts
-
extractFromPreprocessed
Extract features from already-preprocessed text into a caller-supplied buffer. Combines the benefits ofextractFromPreprocessed(String)(skip preprocessing) andextract(String, int[])(no allocation).This is the fastest extraction path — use it in training loops where text has been preprocessed and written to disk ahead of time.
- Parameters:
preprocessedText- text that has already been throughpreprocess(String)counts- pre-allocated int array of sizenumBuckets(will be zeroed)
-
extractFromPreprocessed
Extract features from already-preprocessed text into a caller-supplied buffer, optionally clearing it first.When
clearisfalse, bigram counts are accumulated on top of whatever is already in the buffer. This is useful in training loops where features from multiple sources need to be combined into a single vector.- Parameters:
preprocessedText- text that has already been throughpreprocess(String)counts- pre-allocated int array of sizenumBucketsclear- iftrue, zero the array before extracting; iffalse, accumulate on top of existing counts
-
preprocess
Preprocessing: truncate, strip URLs/emails, NFC normalize.This method is also used by the general word tokenizer so that tika-eval shares the same normalization pipeline.
- Parameters:
rawText- raw input- Returns:
- cleaned, NFC-normalized text
-
preprocessNoTruncate
Preprocessing without the length truncation: strip URLs/emails and NFC-normalize. Used by tika-eval tokenization, which imposes its ownmaxTokenslimit rather than a character limit.Important:
preprocess(String)delegates to this method after truncating, so any change here affects both language detection and tika-eval tokenization. Also note thatTikaEvalTokenizercallsisTransparent(int)directly; changes to that method affect tika-eval token boundaries as well.- Parameters:
rawText- raw input- Returns:
- cleaned, NFC-normalized text
-
isTransparent
public static boolean isTransparent(int cp) Determine whether a codepoint should be treated as transparent (skipped) during bigram extraction and word tokenization.Transparent codepoints are invisible to the bigram/tokenization logic: base letters on either side of a transparent run form a contiguous bigram or remain part of the same word token.
The following categories are transparent:
- Unicode nonspacing marks (Mn) — includes Arabic harakat
(fatha U+064E, damma U+064F, kasra U+0650, shadda U+0651,
sukun U+0652, tanwin U+064B–U+064D, superscript alef U+0670)
and Hebrew niqqud (U+05B0–U+05BD, U+05BF, U+05C1–U+05C2,
U+05C4–U+05C5, U+05C7). Without this, diacritics break Arabic/Hebrew
words into isolated single-letter fragments because
Character.isLetter(int)returnsfalsefor Mn codepoints. Stripping them yields clean base-letter bigrams, which Stripping them preserves clean base-letter bigrams. - Arabic Tatweel / Kashida (U+0640) — a typographic stretching character that is classified as a letter but carries no linguistic information. "كتب" and "كـتـب" should produce identical bigrams.
- ZWNJ (U+200C) — Zero Width Non-Joiner, used heavily in Persian/Farsi (e.g., "میخواهم") and in Arabic, Urdu, and Kurdish to control cursive joining. It is not a word boundary; bigrams should span across it.
- ZWJ (U+200D) — Zero Width Joiner, forces cursive joining. Also not a word boundary.
A fast guard (
cp < 0x0300) short-circuits the check for ASCII and basic Latin/Greek text, adding zero overhead to the common case.- Parameters:
cp- a Unicode codepoint- Returns:
trueif the codepoint should be skipped
- Unicode nonspacing marks (Mn) — includes Arabic harakat
(fatha U+064E, damma U+064F, kasra U+0650, shadda U+0651,
sukun U+0652, tanwin U+064B–U+064D, superscript alef U+0670)
and Hebrew niqqud (U+05B0–U+05BD, U+05BF, U+05C1–U+05C2,
U+05C4–U+05C5, U+05C7). Without this, diacritics break Arabic/Hebrew
words into isolated single-letter fragments because
-
getNumBuckets
public int getNumBuckets() -
isIncludeTrigrams
public boolean isIncludeTrigrams()
-