Generative Language Model
The GenerativeLanguageModel is a character n-gram model that answers a
different question from the
discriminative language detector:
| Discriminative model | Generative model |
|---|---|
Which language is this? |
How language-like is this text for language L? |
The discriminative model chooses the best language among candidates. The generative model scores how well text matches a single language’s character statistics. Mojibake, encoding errors, and garbage text produce very low scores even when the discriminative model confidently picks a language.
Use Cases
Charset Detection Arbitration
When the CharSoupEncodingDetector must choose between candidate charsets
and the discriminative model is inconclusive, the generative model breaks
the tie. Each candidate charset’s decoded text is scored; the charset
producing the most "language-like" text wins.
tika-eval Quality Column
The TikaEvalMetadataFilter adds a tika-eval:languageness metadata field
alongside the existing tika-eval:oov (out-of-vocabulary) ratio. The
languageness value is a length-adjusted z-score: values near zero indicate
normal text; values below -2 suggest possible encoding problems or garbled
content.
How It Works
Scoring
For a given (text, language) pair, the model computes the average
log-probability of the text’s character n-grams under that language:
score = mean( log P(ngram | language) ) for all n-grams in text
Higher scores (closer to zero) mean the text is more consistent with the language’s character statistics.
Feature Types
| Script | Feature types | Bucket counts |
|---|---|---|
CJK (Han, Hiragana, Katakana) |
Character unigrams, character bigrams |
8,192 / 16,384 |
Non-CJK |
Character unigrams (streaming); positionally-salted bigrams and trigrams (BOW/MID/EOW/FULL_WORD); bidirectional word bigrams (short-anchor) |
4,096 / 8,192 / 16,384 / 8,192 |
All |
Script distribution (L1-normalized, 34 fine-grained script categories) |
34 |
Positional Salting
Non-CJK bigrams and trigrams use a salt byte (BOW, MID, EOW, or FULL_WORD) as the first byte of the FNV hash rather than sentinel characters. This encodes word-boundary information without polluting codepoint space — n-grams always contain N real characters.
Bidirectional Word Bigrams
Two word-bigram types capture function-word context:
-
Forward: fired when the previous word is short (≤ 3 chars) — captures "the X", "de X", "в X".
-
Backward: fired when the current word is short — captures "X the", "X de", "X в".
Script Distribution
Per-language log-probabilities over 34 fine-grained script categories
(using GlmScriptCategory). Contributions are L1-normalized over the text:
one weighted contribution is emitted regardless of text length, preventing
script signal from swamping n-gram signal on long Indic or Cyrillic text.
Log-probabilities are quantized to unsigned INT8 over the range [-18, 0] and stored in dense byte arrays. Add-k smoothing (k=0.01) prevents zero-probability n-grams.
Z-Scores
Raw scores vary across languages (CJK languages have higher raw scores due to fewer, more information-dense characters). To enable a universal threshold, each language stores calibration statistics (μ, σ) computed from the training corpus:
z = (score - μ) / σ
A z-score of 0 means "average for this language." A z-score of -3 means "3 standard deviations worse than average."
Length Adjustment
Score variance scales as approximately 1/√(text length): shorter text has
noisier scores. The zScoreLengthAdjusted method inflates σ for short
text to prevent spurious low z-scores on snippets:
σ_adjusted = σ × max(1, √(120 / text_length))
where 120 is the approximate character length of a typical training sentence. For text at or above 120 characters, the adjustment is a no-op.
Empirical Noise Sensitivity (FLORES-200)
The table below shows mean z-scores under various noise types at different
text lengths, averaged over 204 languages on the FLORES-200 dev set
(v4 model, GlmNoiseSensitivityReport):
| Length | clean | reversed | wrong-lang | mojibake-latin1 | sep-rev | sep-spc+ |
|---|---|---|---|---|---|---|
20 |
0.03 |
-1.29 |
-9.28 |
-4.71 |
1.33 |
0.78 |
50 |
-0.04 |
-2.29 |
-14.84 |
-6.37 |
2.26 |
1.36 |
100 |
-0.10 |
-3.36 |
-21.07 |
-6.67 |
3.26 |
1.98 |
200 |
-0.12 |
-3.74 |
-23.12 |
-6.43 |
3.61 |
2.03 |
Columns: sep-rev = clean − reversed (directionality sensitivity);
sep-spc+ = clean − space-inserted (space sensitivity).
Clean z-scores near zero confirm calibration is correct.
Model Format (GLM1 v4)
4 bytes magic: 0x474C4D31 ("GLM1")
4 bytes version: 4
4 bytes numLangs
4 bytes cjkUnigramBuckets (8,192)
4 bytes cjkBigramBuckets (16,384)
4 bytes noncjkUnigramBuckets (4,096)
4 bytes noncjkBigramBuckets (8,192)
4 bytes noncjkTrigramBuckets (16,384)
4 bytes scriptCategories (34)
4 bytes wordBigramBuckets (8,192)
For each language:
2 bytes langCode length (uint16)
N bytes langCode (UTF-8)
1 byte isCjk (0 or 1)
4 bytes scoreMean (float32, μ)
4 bytes scoreStdDev (float32, σ)
B bytes unigramTable
B bytes bigramTable
B bytes trigramTable (non-CJK only)
B bytes wordBigramTable (non-CJK only)
B bytes scriptTable
The model is stored at
org/apache/tika/langdetect/charsoup/langdetect-generative-v4-20260320.bin
on the classpath, alongside the discriminative models.
Training
The model is trained by TrainGenerativeLanguageModel (in test scope).
A discriminative model binary is required to determine the set of languages
to train:
./mvnw -pl tika-langdetect/tika-langdetect-charsoup exec:java \
-Dexec.classpathScope=test -Dcheckstyle.skip=true \
-Dexec.mainClass=org.apache.tika.langdetect.charsoup.tools.TrainGenerativeLanguageModel \
"-Dexec.args=--corpus /path/to/wikipedia-dumps \
--output langdetect-generative-v4-YYYYMMDD.bin \
--disc-model langdetect-YYYYMMDD.bin"
Training performs two passes:
-
Count pass — accumulates n-gram, word-bigram, and script counts per language, then converts to log-probabilities with add-k smoothing (k=0.01).
-
Calibration pass — re-scores training sentences to compute per-language μ and σ (Welford’s online algorithm), stored for z-score computation at runtime.
The corpus can be in Wikipedia dump format (corpusDir/{code}/sentences.txt)
or flat format (corpusDir/{code} with one sentence per line).
Use --max-per-lang N (default 500,000) to cap sentences per language.
Evaluation Tools
Several tools are provided for evaluating model quality (all in test scope
under org.apache.tika.langdetect.charsoup.tools):
-
GlmNoiseSensitivityReport— measures z-scores under 12 noise types (random substitution, shuffle, reversal, wrong-language, space insert/remove, mojibake re-encoding) at four text lengths. Outputs a TSV for version comparison. -
GlmAdjudicateDiagnostic— per-language breakdown of z-score margin between correct and incorrect language hypotheses. -
ZScoreDistributionReport— z-score distribution at various text lengths and thresholds.