Building the CharSoup Language Detector

This page documents how the tika-langdetect-charsoup language detection model is trained, the decisions made along the way, and benchmark comparisons against the existing OpenNLP-based detector. For architecture and API details, see Language Detection.

Training Corpus

The primary training data comes from Wikipedia database dumps (dumps.wikimedia.org). Wikipedia is preferred over web-crawl corpora for quality: articles are human-authored, editorial standards filter boilerplate and spam, and the sentence distribution reflects genuine prose rather than SEO content or duplicated web templates.

Sentences are extracted from the Wikipedia XML dumps using the extract_wiki_sentences.py script, which strips markup, splits into sentences, and writes one lineNum<TAB>sentence file per language directory:

~/datasets/wikipedia-dumps/
    eng/sentences.txt
    deu/sentences.txt
    ara/sentences.txt
    ...

For 16 languages with insufficient Wikipedia coverage, sentences are supplemented from MADLAD-400 (Magnusson et al., 2023). These languages are written into a parallel sentences_madlad.txt file alongside the Wikipedia data, and PrepareCorpus reads all *.txt files from each language directory automatically:

~/datasets/wikipedia-dumps/
    mya/
        sentences.txt          (Wikipedia)
        sentences_madlad.txt   (MADLAD supplement)
    xho/
        sentences.txt
        sentences_madlad.txt
    ...

The MADLAD-supplemented languages are: mya, xho, nya, smo, sot, tet, orm, udm, tir, hil, ewe, tso, aka, tsn, ceb, mlg, che.

The extract_madlad_to_wiki.py script handles extraction from MADLAD’s document-per-line format (paragraph boundaries encoded as literal \n sequences), applies quality filters identical to those used by the main download pipeline, and caps output at 500,000 sentences per language.

Deduplication

Deduplication is performed at the Java training-pipeline stage using FNV-1a 64-bit hashing. Web-crawl data has lower duplication rates than Leipzig news corpora, so a single deduplication pass is sufficient.

Language Code Merging

Several languages have multiple ISO 639-3 codes that refer to the same language or are indistinguishable by character features. These are merged during both download and training:

Merged From Merged To Note

azj (North Azerbaijani)

aze (Azerbaijani)

Code variant

cmn (Mandarin Chinese)

zho (Chinese)

Code variant

ekk (Standard Estonian)

est (Estonian)

Code variant

gug (Paraguayan Guaraní)

grn (Guaraní)

Code variant

lvs (Standard Latvian)

lav (Latvian)

Code variant

nor (Norwegian)

nob (Norwegian Bokmål)

Code variant

pes (Iranian Persian)

fas (Persian)

Code variant

plt (Plateau Malagasy)

mlg (Malagasy)

Code variant

quz (Cusco Quechua)

que (Quechua)

Code variant

swa (Swahili macrolanguage)

swh (Coastal Swahili)

Code variant

yid (Yiddish macrolanguage)

ydd (Eastern Yiddish)

Code variant

zsm (Standard Malay)

msa (Malay)

Code variant

The same merge map is maintained in three places and must be kept in sync:

  • download_madlad.pyLANG_MERGE_MAP

  • PrepareCorpus.javaLANG_MERGE_MAP

  • CommonTokenGenerator.javaLANG_MERGE_MAP

Corpus Cleaning

Two data-quality issues were identified during development and addressed in CorpusReader.java:

Breton (bre) noise — approximately 5% of MADLAD Breton sentences were French blog posts, identifiable by lines containing three or more consecutive tildes (~), a Common Crawl redaction marker. A filter discards any sentence containing this pattern.

Dhivehi (div) mixed-script headlines — MADLAD Dhivehi documents consistently begin with a Latin-script headline followed by Thaana-script body text, separated by a literal \n escape sequence. The pipeline splits on this separator and treats each segment as a distinct sentence, preventing the Latin headline from polluting the Thaana training signal. This raised 20-character accuracy from 32.9% to 95.5%.

Filtering Low-Resource Languages

Languages with fewer than 10,000 sentences after deduplication are excluded. This threshold ensures enough data for the model to learn useful distributions even after the mislabel-filtering step.

Explicit Language Exclusions

Some languages meet the 10,000-sentence minimum but are explicitly excluded. Exclusion decisions are made after a full training run by evaluating per-language F1 on the held-out test set and inspecting confusion patterns. A language is excluded if it falls into one or more of these categories:

Accuracy interference with a closely related language. When a language’s written form is nearly identical to a more widely-used language at the character n-gram level, including both causes the model to split probability mass between them. This depresses accuracy for the more widely-used language by an amount that exceeds any benefit from detecting the variant. The exclusion is a deliberate choice to serve the larger user population; it does not reflect on the importance or validity of the excluded language.

Own accuracy below a useful threshold. If the model cannot achieve reliable accuracy for a language — because its character profile overlaps too heavily with other included languages — then returning a prediction for it does more harm than good. A confidently wrong prediction is worse than no prediction.

Training corpus not representative of natural language use. Some languages have MADLAD corpus entries dominated by boilerplate, Lorem Ipsum placeholder text, or other non-natural content. A model trained on such data learns the boilerplate rather than the language, producing unreliable results in real-world documents.

These exclusions are applied by removing the language’s pool file before Pass 2 training. If future corpus improvements make a language reliably distinguishable, it can be reintroduced by adding its pool file back and retraining.

Data Splitting Strategy

Split Size Preprocessing

Test

10% per language (max 20,000)

Raw (no preprocessing)

Dev

10% per language (max 20,000)

Preprocessed (NFC, lowercase, URL/email stripped)

Training pool

Remainder

Preprocessed, stored as per-language files

Each epoch draws a fresh sample from the pool: binary-search finds a flat cap C such that Σ min(n_i, C) ≈ 5,000,000, then each language contributes up to C sentences, globally shuffled. High-resource languages are capped per epoch; low-resource languages contribute all their data every epoch.

Training Pipeline

Pass 1: Initial Training

AdamW for 2 epochs followed by Hogwild! SGD for up to 3 more epochs, each with epoch-level resampling from the full training pool.

Mislabeled Sentence Filtering

The Pass 1 model predicts each sentence in the entire training pool. Sentences where the prediction does not match the label are removed — unless the prediction falls within the same confusable language group (e.g., a sentence labeled msa predicted as ind is kept).

This filtering is applied once to the full pool, producing a pool_filtered/ directory that is used for Pass 2.

Pass 2: Retraining on Filtered Data

Same optimizer schedule and resampling strategy, but drawing from the filtered pool. This typically improves accuracy by 0.3–0.5 percentage points.

Final Steps

  1. INT8 quantization — convert float32 weights to int8 with per-class scales

  2. Evaluate — test the quantized model on the raw test set (full pipeline)

  3. Export — write the LDM1 binary model file

Confusable Language Groups

Confusable groups are defined only for language pairs where the trained model demonstrably confuses them at meaningful rates on held-out data. Groups are not added speculatively; each entry is backed by observed confusion in evaluation.

Groups are defined in tika-langdetect-charsoup-core/src/main/resources/…​/confusables.txt and must only contain codes that are actual output classes of the trained model (i.e., present in the training corpus). Dead codes add no benefit.

Current groups:

  • msa / ind — Malay and Indonesian share vocabulary and script so heavily that in-distribution confusion exceeds 20% in both directions.

  • xho / zul — Xhosa and Zulu are both Nguni Bantu languages written in the same Latin-based orthography with very similar character n-gram profiles.

These groups are used in:

  1. Training — group-aware mislabel filtering (a sentence labeled msa predicted as ind is not removed as mislabeled)

  2. Inference — probability mass within a group is collapsed to the highest-scoring member before returning a result

  3. Evaluation — within-group predictions count as correct in the group accuracy metric

Common Token Lists (tika-eval)

The same MADLAD corpus is used to generate common token frequency lists for tika-eval. The CommonTokenGenerator in tika-eval-core reads sentences_madlad.txt files and applies:

  • TikaEvalTokenizer in COMMON_TOKENS mode (NFKD normalization, minimum length 3, no numbers, no HTML terms)

  • The same language merge map and FNV-1a deduplication as the training pipeline

The 500k sentences stored per language provide stable frequency estimates for the top-30,000 tokens with a minimum document frequency of 10.

java -cp tika-eval/tika-eval-core/target/test-classes:\
tika-eval/tika-eval-core/target/classes:\
tika-langdetect/tika-langdetect-charsoup/target/classes:\
tika-langdetect/tika-langdetect-charsoup-core/target/classes \
    org.apache.tika.eval.core.tokens.tools.CommonTokenGenerator \
    ~/datasets/madlad/data \
    /tmp/common_tokens_new \
    30000 10 \
    --model tika-langdetect/tika-langdetect-charsoup-core/src/main/resources/org/apache/tika/langdetect/charsoup/langdetect-20260320.bin

Arguments: <corpusDir> <outputDir> [topN] [minDocFreq] [--model <modelFile>]

The --model flag restricts output to the languages that are actual trained output classes of the given CharSoup model. Without it, all non-excluded languages in the corpus directory are processed — which may include languages that did not survive the mislabel-filtering step and are not in the final model.

CommonTokenGenerator looks for sentences_madlad.txt files inside each language subdirectory.

Current Build: 20260320

The model langdetect-20260320.bin is the current production model (trained 2026-03-20). It uses SaltedNgramFeatureExtractor with trigrams, 4-grams, script block features, L2 normalization, and short-word-anchored word bigrams.

20260320 Training Configuration

  • Corpus: Wikipedia dumps + MADLAD-400 supplements

  • Languages: 204 (includes Tibetan bod)

  • Pool cap: 500,000 sentences per language

  • Feature extractor: SaltedNgramFeatureExtractor — positional-salted character bigrams (BOW/EOW/FULL_WORD/MID), trigrams, 4-grams, CJK character unigrams, script block features (24 script categories, raw counts), short-word-anchored word bigrams (anchor = prev word ≤ 3 chars)

  • Hash buckets: 32,768

  • L2 normalization: enabled

  • Target epoch total: 5,000,000 sentences per epoch

  • Two-pass training: Pass 1 on full pool → mislabel filter → Pass 2

  • JVM: -Xmx8g

  • Model size: ~6.4 MB on disk, ~8.1 MB heap (INT8 quantized)

20260320 Corpus Preparation

./mvnw clean compile test-compile \
  -pl tika-langdetect/tika-langdetect-charsoup-core,tika-langdetect/tika-langdetect-charsoup \
  -DskipTests -Dforbiddenapis.skip=true -Dcheckstyle.skip=true

./mvnw -pl tika-langdetect/tika-langdetect-charsoup exec:java \
  -Dexec.mainClass="org.apache.tika.langdetect.charsoup.tools.PrepareCorpus" \
  -Dexec.classpathScope=test \
  -Dexec.args="--corpus ~/datasets/wikipedia-dumps \
               --output-dir ~/datasets/wikipedia-model-20260320/preprocessed \
               --max-train 500000" \
  -DskipTests -Dforbiddenapis.skip=true -Dcheckstyle.skip=true

20260320 Training

./mvnw -pl tika-langdetect/tika-langdetect-charsoup exec:java \
  -Dexec.mainClass="org.apache.tika.langdetect.charsoup.tools.TrainLanguageModel" \
  -Dexec.classpathScope=test \
  -Dexec.args="--corpus ~/datasets/wikipedia-dumps \
               --prep-dir ~/datasets/wikipedia-model-20260320/preprocessed \
               --output ~/datasets/wikipedia-model-20260320/langdetect-20260320.bin \
               --buckets 32768 \
               --4grams --salted --l2-norm --word-bigrams" \
  -DskipTests -Dforbiddenapis.skip=true -Dcheckstyle.skip=true \
  -Dexec.jvmArgs="-Xmx8g"

Copy the trained model into the resources directory:

cp ~/datasets/wikipedia-model-20260320/langdetect-20260320.bin \
   tika-langdetect/tika-langdetect-charsoup-core/src/main/resources/\
org/apache/tika/langdetect/charsoup/langdetect-20260320.bin

Then update MODEL_RESOURCE in CharSoupLanguageDetector to point to the new file, and remove the old binary.

20260320 Changes from previous builds

  • 32,768 buckets (up from 16,384) — doubles the feature space, reducing hash collisions at the cost of doubling model size. Net gain at all lengths.

  • 4-grams added (--4grams) — improves disambiguation of short text.

  • ScriptCategory COUNT reverted to 24 — the COUNT=36 expansion (adding 12 Indic script categories) was net negative. Indic languages were already 98–99%+ accurate from n-grams alone; the extra categories added bucket collision noise for Latin/Cyrillic languages without improving Indic accuracy. Reverted to 24.

  • Log-dampening of script features reverted — replacing raw character counts with log1p(count) hurt 108 languages while helping 64. Kept raw counts.

  • Word bigrams (--word-bigrams) — short-word-anchored bigrams (anchor = prev word ≤ 3 chars, e.g. "the X", "de X"). +0.32pp @20 with significant wins for minority/confusable languages.

  • Single model — the separate short-text model is retired. The unified 32K-bucket model outperforms it at all text lengths while covering all 204 languages.

20260320 Evaluation: FLORES-200 Dev Set

Results on the FLORES-200 dev set (204 test languages, 997 sentences each). All scores are macro-averaged F1. Raw eval output: flores-eval-20260320.txt.

Coverage-adjusted accuracy (each detector on its own supported languages)

Length CharSoup mF1 OpenNLP mF1 Lingua mF1 Optimaize mF1

@20 chars

82.51%

74.87%

76.35%

84.87%

@50 chars

94.44%

86.09%

90.99%

94.44%

@100 chars

96.98%

90.25%

95.43%

96.51%

@200 chars

97.45%

91.11%

96.23%

96.75%

full text

97.46%

91.12%

96.25%

96.76%

Languages

204

~105

75

63

Coverage-adjusted scores reflect each detector’s performance on the languages it actually supports. CharSoup covers 204 languages; Optimaize covers only 63, Lingua 75, OpenNLP ~105.

Head-to-head: CharSoup vs OpenNLP (105 shared languages)

Length CharSoup mF1 OpenNLP mF1 CharSoup ms OpenNLP ms

@20 chars

84.23%

76.69%

504

228

@50 chars

95.73%

87.27%

431

333

@100 chars

98.09%

91.05%

523

620

@200 chars

98.51%

91.77%

583

969

full text

98.51%

91.78%

604

959

Head-to-head: CharSoup vs Lingua (71 shared languages)

Length CharSoup mF1 Lingua mF1 CharSoup ms Lingua ms

@20 chars

85.28%

77.78%

303

2,960

@50 chars

96.37%

92.24%

296

5,992

@100 chars

98.51%

96.58%

359

9,756

@200 chars

98.82%

97.37%

437

12,615

full text

98.83%

97.38%

402

12,847

Head-to-head: CharSoup vs Optimaize (63 shared languages)

Length CharSoup mF1 Optimaize mF1 CharSoup ms Optimaize ms

@20 chars

86.52%

86.30%

215

76

@50 chars

96.92%

95.40%

250

367

@100 chars

98.84%

97.23%

320

360

@200 chars

99.13%

97.34%

346

374

full text

99.14%

97.34%

353

385

CharSoup matches or leads Optimaize across all lengths on the 63-language overlap, while covering 3× more languages total.

20260320 Resource Usage

Metric CharSoup OpenNLP Lingua (low accuracy) Optimaize

Languages supported

204

~105

75

63

Model heap

~8.1 MB

~79.2 MB

~0.1 MB

~94.5 MB

Model file (disk)

6.4 MB

~22 MB

Throughput (@20)

~139K sent/s

~133K sent/s

~10K sent/s

~248K sent/s

Throughput (full)

~116K sent/s

Runtime dependencies

None

OpenNLP + model

Lingua jar + Kotlin

Optimaize jar

All detectors evaluated with 12 threads on the FLORES-200 dev set (203,381 sentences). Throughput is wall-clock sentences per second.

Historical Build: v7 (Wikipedia + MADLAD)

v7 used a single general model (langdetect.bin) — 203 languages, trained on Wikipedia dumps (primary) supplemented by MADLAD for under-resourced languages. Uses 16,384 hash buckets and ScriptAwareFeatureExtractor. 3.2 MB on disk.

A separate short-text model was also trained for v7 but is retired as of 20260320; the unified 32K-bucket model supersedes it.

v7 General Model Training Configuration

  • Corpus: Wikipedia dumps (primary) + MADLAD supplement for 17 languages

  • Languages: 203

  • Hash buckets: 16,384

  • Feature extractor: ScriptAwareFeatureExtractor — character bigrams, trigrams, 3-char suffixes, 3-char prefixes, word unigrams, CJK character unigrams

  • Target epoch total: 5,000,000 sentences per epoch

  • Optimizer: AdamW (lr=0.001, mini-batch=64) × 2 epochs, then Hogwild! SGD (lr=0.01→0.001) × 6 max epochs with within-epoch early stopping

  • Two-pass training: Pass 1 on full pool → mislabel filter (removed 1.5%) → Pass 2

  • Model size: 3.2 MB (INT8 quantized)

  • JVM: -Xmx8g

References

For the academic references behind the techniques used in training and inference, see the References section of the main language detection page.