Generative Language Model

The GenerativeLanguageModel is a character n-gram model that answers a different question from the discriminative language detector:

Discriminative model Generative model

Which language is this?

How language-like is this text for language L?

The discriminative model chooses the best language among candidates. The generative model scores how well text matches a single language’s character statistics. Mojibake, encoding errors, and garbage text produce very low scores even when the discriminative model confidently picks a language.

Use Cases

Charset Detection Arbitration

When the CharSoupEncodingDetector must choose between candidate charsets and the discriminative model is inconclusive, the generative model breaks the tie. Each candidate charset’s decoded text is scored; the charset producing the most "language-like" text wins.

tika-eval Quality Column

The TikaEvalMetadataFilter adds a tika-eval:languageness metadata field alongside the existing tika-eval:oov (out-of-vocabulary) ratio. The languageness value is a length-adjusted z-score: values near zero indicate normal text; values below -2 suggest possible encoding problems or garbled content.

Training Data Filtering

The CorpusFilterReport tool uses the generative model to scan training corpora and flag sentences that don’t match their language label — bot-generated stubs, mixed-language content, or templated text.

How It Works

Scoring

For a given (text, language) pair, the model computes the average log-probability of the text’s character n-grams under that language:

score = mean( log P(ngram | language) )   for all n-grams in text

Higher scores (closer to zero) mean the text is more consistent with the language’s character statistics.

Feature Types

Script Feature types Bucket counts

CJK (Han, Hiragana, Katakana)

Character unigrams, character bigrams

8,192 / 16,384

Non-CJK

Character unigrams (streaming); positionally-salted bigrams and trigrams (BOW/MID/EOW/FULL_WORD); bidirectional word bigrams (short-anchor)

4,096 / 8,192 / 16,384 / 8,192

All

Script distribution (L1-normalized, 34 fine-grained script categories)

34

Positional Salting

Non-CJK bigrams and trigrams use a salt byte (BOW, MID, EOW, or FULL_WORD) as the first byte of the FNV hash rather than sentinel characters. This encodes word-boundary information without polluting codepoint space — n-grams always contain N real characters.

Bidirectional Word Bigrams

Two word-bigram types capture function-word context:

  • Forward: fired when the previous word is short (≤ 3 chars) — captures "the X", "de X", "в X".

  • Backward: fired when the current word is short — captures "X the", "X de", "X в".

Script Distribution

Per-language log-probabilities over 34 fine-grained script categories (using GlmScriptCategory). Contributions are L1-normalized over the text: one weighted contribution is emitted regardless of text length, preventing script signal from swamping n-gram signal on long Indic or Cyrillic text.

Log-probabilities are quantized to unsigned INT8 over the range [-18, 0] and stored in dense byte arrays. Add-k smoothing (k=0.01) prevents zero-probability n-grams.

Z-Scores

Raw scores vary across languages (CJK languages have higher raw scores due to fewer, more information-dense characters). To enable a universal threshold, each language stores calibration statistics (μ, σ) computed from the training corpus:

z = (score - μ) / σ

A z-score of 0 means "average for this language." A z-score of -3 means "3 standard deviations worse than average."

Length Adjustment

Score variance scales as approximately 1/√(text length): shorter text has noisier scores. The zScoreLengthAdjusted method inflates σ for short text to prevent spurious low z-scores on snippets:

σ_adjusted = σ × max(1, √(120 / text_length))

where 120 is the approximate character length of a typical training sentence. For text at or above 120 characters, the adjustment is a no-op.

Empirical Noise Sensitivity (FLORES-200)

The table below shows mean z-scores under various noise types at different text lengths, averaged over 204 languages on the FLORES-200 dev set (v4 model, GlmNoiseSensitivityReport):

Length clean reversed wrong-lang mojibake-latin1 sep-rev sep-spc+

20

0.03

-1.29

-9.28

-4.71

1.33

0.78

50

-0.04

-2.29

-14.84

-6.37

2.26

1.36

100

-0.10

-3.36

-21.07

-6.67

3.26

1.98

200

-0.12

-3.74

-23.12

-6.43

3.61

2.03

Columns: sep-rev = clean − reversed (directionality sensitivity); sep-spc+ = clean − space-inserted (space sensitivity). Clean z-scores near zero confirm calibration is correct.

Model Format (GLM1 v4)

4 bytes   magic: 0x474C4D31 ("GLM1")
4 bytes   version: 4
4 bytes   numLangs
4 bytes   cjkUnigramBuckets      (8,192)
4 bytes   cjkBigramBuckets       (16,384)
4 bytes   noncjkUnigramBuckets   (4,096)
4 bytes   noncjkBigramBuckets    (8,192)
4 bytes   noncjkTrigramBuckets   (16,384)
4 bytes   scriptCategories       (34)
4 bytes   wordBigramBuckets      (8,192)

For each language:
  2 bytes   langCode length (uint16)
  N bytes   langCode (UTF-8)
  1 byte    isCjk (0 or 1)
  4 bytes   scoreMean (float32, μ)
  4 bytes   scoreStdDev (float32, σ)
  B bytes   unigramTable
  B bytes   bigramTable
  B bytes   trigramTable          (non-CJK only)
  B bytes   wordBigramTable       (non-CJK only)
  B bytes   scriptTable

The model is stored at org/apache/tika/langdetect/charsoup/langdetect-generative-v4-20260320.bin on the classpath, alongside the discriminative models.

Training

The model is trained by TrainGenerativeLanguageModel (in test scope). A discriminative model binary is required to determine the set of languages to train:

./mvnw -pl tika-langdetect/tika-langdetect-charsoup exec:java \
  -Dexec.classpathScope=test -Dcheckstyle.skip=true \
  -Dexec.mainClass=org.apache.tika.langdetect.charsoup.tools.TrainGenerativeLanguageModel \
  "-Dexec.args=--corpus /path/to/wikipedia-dumps \
               --output langdetect-generative-v4-YYYYMMDD.bin \
               --disc-model langdetect-YYYYMMDD.bin"

Training performs two passes:

  1. Count pass — accumulates n-gram, word-bigram, and script counts per language, then converts to log-probabilities with add-k smoothing (k=0.01).

  2. Calibration pass — re-scores training sentences to compute per-language μ and σ (Welford’s online algorithm), stored for z-score computation at runtime.

The corpus can be in Wikipedia dump format (corpusDir/{code}/sentences.txt) or flat format (corpusDir/{code} with one sentence per line). Use --max-per-lang N (default 500,000) to cap sentences per language.

Evaluation Tools

Several tools are provided for evaluating model quality (all in test scope under org.apache.tika.langdetect.charsoup.tools):

  • GlmNoiseSensitivityReport — measures z-scores under 12 noise types (random substitution, shuffle, reversal, wrong-language, space insert/remove, mojibake re-encoding) at four text lengths. Outputs a TSV for version comparison.

  • GlmAdjudicateDiagnostic — per-language breakdown of z-score margin between correct and incorrect language hypotheses.

  • ZScoreDistributionReport — z-score distribution at various text lengths and thresholds.