Building the CharSoup Language Detector
- Training Corpus
- Data Splitting Strategy
- Training Pipeline
- Confusable Language Groups
- Common Token Lists (tika-eval)
- Current Build: 20260320
- Historical Build: v7 (Wikipedia + MADLAD)
- References
This page documents how the tika-langdetect-charsoup language detection model
is trained, the decisions made along the way, and benchmark comparisons against
the existing OpenNLP-based detector. For architecture and API details, see
Language Detection.
Training Corpus
The primary training data comes from Wikipedia database dumps (dumps.wikimedia.org). Wikipedia is preferred over web-crawl corpora for quality: articles are human-authored, editorial standards filter boilerplate and spam, and the sentence distribution reflects genuine prose rather than SEO content or duplicated web templates.
Sentences are extracted from the Wikipedia XML dumps using the
extract_wiki_sentences.py script, which strips markup, splits into sentences,
and writes one lineNum<TAB>sentence file per language directory:
~/datasets/wikipedia-dumps/
eng/sentences.txt
deu/sentences.txt
ara/sentences.txt
...
For 16 languages with insufficient Wikipedia coverage, sentences are
supplemented from
MADLAD-400
(Magnusson et al., 2023). These languages are written into a parallel
sentences_madlad.txt file alongside the Wikipedia data, and PrepareCorpus
reads all *.txt files from each language directory automatically:
~/datasets/wikipedia-dumps/
mya/
sentences.txt (Wikipedia)
sentences_madlad.txt (MADLAD supplement)
xho/
sentences.txt
sentences_madlad.txt
...
The MADLAD-supplemented languages are:
mya, xho, nya, smo, sot, tet, orm, udm, tir, hil,
ewe, tso, aka, tsn, ceb, mlg, che.
The extract_madlad_to_wiki.py script handles extraction from MADLAD’s
document-per-line format (paragraph boundaries encoded as literal \n
sequences), applies quality filters identical to those used by the main
download pipeline, and caps output at 500,000 sentences per language.
Deduplication
Deduplication is performed at the Java training-pipeline stage using FNV-1a 64-bit hashing. Web-crawl data has lower duplication rates than Leipzig news corpora, so a single deduplication pass is sufficient.
Language Code Merging
Several languages have multiple ISO 639-3 codes that refer to the same language or are indistinguishable by character features. These are merged during both download and training:
| Merged From | Merged To | Note |
|---|---|---|
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
|
|
Code variant |
The same merge map is maintained in three places and must be kept in sync:
-
download_madlad.py—LANG_MERGE_MAP -
PrepareCorpus.java—LANG_MERGE_MAP -
CommonTokenGenerator.java—LANG_MERGE_MAP
Corpus Cleaning
Two data-quality issues were identified during development and addressed in
CorpusReader.java:
Breton (bre) noise — approximately 5% of MADLAD Breton sentences were
French blog posts, identifiable by lines containing three or more consecutive
tildes (~), a Common Crawl redaction marker. A filter discards any sentence
containing this pattern.
Dhivehi (div) mixed-script headlines — MADLAD Dhivehi documents
consistently begin with a Latin-script headline followed by Thaana-script body
text, separated by a literal \n escape sequence. The pipeline splits on this
separator and treats each segment as a distinct sentence, preventing the Latin
headline from polluting the Thaana training signal. This raised 20-character
accuracy from 32.9% to 95.5%.
Filtering Low-Resource Languages
Languages with fewer than 10,000 sentences after deduplication are excluded. This threshold ensures enough data for the model to learn useful distributions even after the mislabel-filtering step.
Explicit Language Exclusions
Some languages meet the 10,000-sentence minimum but are explicitly excluded. Exclusion decisions are made after a full training run by evaluating per-language F1 on the held-out test set and inspecting confusion patterns. A language is excluded if it falls into one or more of these categories:
Accuracy interference with a closely related language. When a language’s written form is nearly identical to a more widely-used language at the character n-gram level, including both causes the model to split probability mass between them. This depresses accuracy for the more widely-used language by an amount that exceeds any benefit from detecting the variant. The exclusion is a deliberate choice to serve the larger user population; it does not reflect on the importance or validity of the excluded language.
Own accuracy below a useful threshold. If the model cannot achieve reliable accuracy for a language — because its character profile overlaps too heavily with other included languages — then returning a prediction for it does more harm than good. A confidently wrong prediction is worse than no prediction.
Training corpus not representative of natural language use. Some languages have MADLAD corpus entries dominated by boilerplate, Lorem Ipsum placeholder text, or other non-natural content. A model trained on such data learns the boilerplate rather than the language, producing unreliable results in real-world documents.
These exclusions are applied by removing the language’s pool file before Pass 2 training. If future corpus improvements make a language reliably distinguishable, it can be reintroduced by adding its pool file back and retraining.
Data Splitting Strategy
| Split | Size | Preprocessing |
|---|---|---|
Test |
10% per language (max 20,000) |
Raw (no preprocessing) |
Dev |
10% per language (max 20,000) |
Preprocessed (NFC, lowercase, URL/email stripped) |
Training pool |
Remainder |
Preprocessed, stored as per-language files |
Each epoch draws a fresh sample from the pool: binary-search finds a flat cap C
such that Σ min(n_i, C) ≈ 5,000,000, then each language contributes up to C
sentences, globally shuffled. High-resource languages are capped per epoch;
low-resource languages contribute all their data every epoch.
Training Pipeline
Pass 1: Initial Training
AdamW for 2 epochs followed by Hogwild! SGD for up to 3 more epochs, each with epoch-level resampling from the full training pool.
Mislabeled Sentence Filtering
The Pass 1 model predicts each sentence in the entire training pool. Sentences
where the prediction does not match the label are removed — unless the
prediction falls within the same confusable language group (e.g., a sentence
labeled msa predicted as ind is kept).
This filtering is applied once to the full pool, producing a pool_filtered/
directory that is used for Pass 2.
Confusable Language Groups
Confusable groups are defined only for language pairs where the trained model demonstrably confuses them at meaningful rates on held-out data. Groups are not added speculatively; each entry is backed by observed confusion in evaluation.
Groups are defined in
tika-langdetect-charsoup-core/src/main/resources/…/confusables.txt
and must only contain codes that are actual output classes of the trained model
(i.e., present in the training corpus). Dead codes add no benefit.
Current groups:
-
msa/ind— Malay and Indonesian share vocabulary and script so heavily that in-distribution confusion exceeds 20% in both directions. -
xho/zul— Xhosa and Zulu are both Nguni Bantu languages written in the same Latin-based orthography with very similar character n-gram profiles.
These groups are used in:
-
Training — group-aware mislabel filtering (a sentence labeled
msapredicted asindis not removed as mislabeled) -
Inference — probability mass within a group is collapsed to the highest-scoring member before returning a result
-
Evaluation — within-group predictions count as correct in the group accuracy metric
Common Token Lists (tika-eval)
The same MADLAD corpus is used to generate common token frequency lists for
tika-eval. The CommonTokenGenerator in tika-eval-core reads
sentences_madlad.txt files and applies:
-
TikaEvalTokenizerinCOMMON_TOKENSmode (NFKD normalization, minimum length 3, no numbers, no HTML terms) -
The same language merge map and FNV-1a deduplication as the training pipeline
The 500k sentences stored per language provide stable frequency estimates for the top-30,000 tokens with a minimum document frequency of 10.
java -cp tika-eval/tika-eval-core/target/test-classes:\
tika-eval/tika-eval-core/target/classes:\
tika-langdetect/tika-langdetect-charsoup/target/classes:\
tika-langdetect/tika-langdetect-charsoup-core/target/classes \
org.apache.tika.eval.core.tokens.tools.CommonTokenGenerator \
~/datasets/madlad/data \
/tmp/common_tokens_new \
30000 10 \
--model tika-langdetect/tika-langdetect-charsoup-core/src/main/resources/org/apache/tika/langdetect/charsoup/langdetect-20260320.bin
Arguments: <corpusDir> <outputDir> [topN] [minDocFreq] [--model <modelFile>]
The --model flag restricts output to the languages that are actual trained
output classes of the given CharSoup model. Without it, all non-excluded
languages in the corpus directory are processed — which may include languages
that did not survive the mislabel-filtering step and are not in the final model.
CommonTokenGenerator looks for sentences_madlad.txt files inside
each language subdirectory.
|
Current Build: 20260320
The model langdetect-20260320.bin is the current production model (trained
2026-03-20). It uses SaltedNgramFeatureExtractor with trigrams, 4-grams,
script block features, L2 normalization, and short-word-anchored word bigrams.
20260320 Training Configuration
-
Corpus: Wikipedia dumps + MADLAD-400 supplements
-
Languages: 204 (includes Tibetan
bod) -
Pool cap: 500,000 sentences per language
-
Feature extractor:
SaltedNgramFeatureExtractor— positional-salted character bigrams (BOW/EOW/FULL_WORD/MID), trigrams, 4-grams, CJK character unigrams, script block features (24 script categories, raw counts), short-word-anchored word bigrams (anchor = prev word ≤ 3 chars) -
Hash buckets: 32,768
-
L2 normalization: enabled
-
Target epoch total: 5,000,000 sentences per epoch
-
Two-pass training: Pass 1 on full pool → mislabel filter → Pass 2
-
JVM:
-Xmx8g -
Model size: ~6.4 MB on disk, ~8.1 MB heap (INT8 quantized)
20260320 Corpus Preparation
./mvnw clean compile test-compile \
-pl tika-langdetect/tika-langdetect-charsoup-core,tika-langdetect/tika-langdetect-charsoup \
-DskipTests -Dforbiddenapis.skip=true -Dcheckstyle.skip=true
./mvnw -pl tika-langdetect/tika-langdetect-charsoup exec:java \
-Dexec.mainClass="org.apache.tika.langdetect.charsoup.tools.PrepareCorpus" \
-Dexec.classpathScope=test \
-Dexec.args="--corpus ~/datasets/wikipedia-dumps \
--output-dir ~/datasets/wikipedia-model-20260320/preprocessed \
--max-train 500000" \
-DskipTests -Dforbiddenapis.skip=true -Dcheckstyle.skip=true
20260320 Training
./mvnw -pl tika-langdetect/tika-langdetect-charsoup exec:java \
-Dexec.mainClass="org.apache.tika.langdetect.charsoup.tools.TrainLanguageModel" \
-Dexec.classpathScope=test \
-Dexec.args="--corpus ~/datasets/wikipedia-dumps \
--prep-dir ~/datasets/wikipedia-model-20260320/preprocessed \
--output ~/datasets/wikipedia-model-20260320/langdetect-20260320.bin \
--buckets 32768 \
--4grams --salted --l2-norm --word-bigrams" \
-DskipTests -Dforbiddenapis.skip=true -Dcheckstyle.skip=true \
-Dexec.jvmArgs="-Xmx8g"
Copy the trained model into the resources directory:
cp ~/datasets/wikipedia-model-20260320/langdetect-20260320.bin \
tika-langdetect/tika-langdetect-charsoup-core/src/main/resources/\
org/apache/tika/langdetect/charsoup/langdetect-20260320.bin
Then update MODEL_RESOURCE in CharSoupLanguageDetector to point to the
new file, and remove the old binary.
20260320 Changes from previous builds
-
32,768 buckets (up from 16,384) — doubles the feature space, reducing hash collisions at the cost of doubling model size. Net gain at all lengths.
-
4-grams added (
--4grams) — improves disambiguation of short text. -
ScriptCategory COUNT reverted to 24 — the COUNT=36 expansion (adding 12 Indic script categories) was net negative. Indic languages were already 98–99%+ accurate from n-grams alone; the extra categories added bucket collision noise for Latin/Cyrillic languages without improving Indic accuracy. Reverted to 24.
-
Log-dampening of script features reverted — replacing raw character counts with
log1p(count)hurt 108 languages while helping 64. Kept raw counts. -
Word bigrams (
--word-bigrams) — short-word-anchored bigrams (anchor = prev word ≤ 3 chars, e.g. "the X", "de X"). +0.32pp @20 with significant wins for minority/confusable languages. -
Single model — the separate short-text model is retired. The unified 32K-bucket model outperforms it at all text lengths while covering all 204 languages.
20260320 Evaluation: FLORES-200 Dev Set
Results on the FLORES-200 dev set (204 test languages, 997 sentences each). All scores are macro-averaged F1. Raw eval output: flores-eval-20260320.txt.
Coverage-adjusted accuracy (each detector on its own supported languages)
| Length | CharSoup mF1 | OpenNLP mF1 | Lingua mF1 | Optimaize mF1 |
|---|---|---|---|---|
@20 chars |
82.51% |
74.87% |
76.35% |
84.87% |
@50 chars |
94.44% |
86.09% |
90.99% |
94.44% |
@100 chars |
96.98% |
90.25% |
95.43% |
96.51% |
@200 chars |
97.45% |
91.11% |
96.23% |
96.75% |
full text |
97.46% |
91.12% |
96.25% |
96.76% |
Languages |
204 |
~105 |
75 |
63 |
| Coverage-adjusted scores reflect each detector’s performance on the languages it actually supports. CharSoup covers 204 languages; Optimaize covers only 63, Lingua 75, OpenNLP ~105. |
Head-to-head: CharSoup vs OpenNLP (105 shared languages)
| Length | CharSoup mF1 | OpenNLP mF1 | CharSoup ms | OpenNLP ms |
|---|---|---|---|---|
@20 chars |
84.23% |
76.69% |
504 |
228 |
@50 chars |
95.73% |
87.27% |
431 |
333 |
@100 chars |
98.09% |
91.05% |
523 |
620 |
@200 chars |
98.51% |
91.77% |
583 |
969 |
full text |
98.51% |
91.78% |
604 |
959 |
Head-to-head: CharSoup vs Lingua (71 shared languages)
| Length | CharSoup mF1 | Lingua mF1 | CharSoup ms | Lingua ms |
|---|---|---|---|---|
@20 chars |
85.28% |
77.78% |
303 |
2,960 |
@50 chars |
96.37% |
92.24% |
296 |
5,992 |
@100 chars |
98.51% |
96.58% |
359 |
9,756 |
@200 chars |
98.82% |
97.37% |
437 |
12,615 |
full text |
98.83% |
97.38% |
402 |
12,847 |
Head-to-head: CharSoup vs Optimaize (63 shared languages)
| Length | CharSoup mF1 | Optimaize mF1 | CharSoup ms | Optimaize ms |
|---|---|---|---|---|
@20 chars |
86.52% |
86.30% |
215 |
76 |
@50 chars |
96.92% |
95.40% |
250 |
367 |
@100 chars |
98.84% |
97.23% |
320 |
360 |
@200 chars |
99.13% |
97.34% |
346 |
374 |
full text |
99.14% |
97.34% |
353 |
385 |
CharSoup matches or leads Optimaize across all lengths on the 63-language overlap, while covering 3× more languages total.
20260320 Resource Usage
| Metric | CharSoup | OpenNLP | Lingua (low accuracy) | Optimaize |
|---|---|---|---|---|
Languages supported |
204 |
~105 |
75 |
63 |
Model heap |
~8.1 MB |
~79.2 MB |
~0.1 MB |
~94.5 MB |
Model file (disk) |
6.4 MB |
~22 MB |
— |
— |
Throughput (@20) |
~139K sent/s |
~133K sent/s |
~10K sent/s |
~248K sent/s |
Throughput (full) |
~116K sent/s |
— |
— |
— |
Runtime dependencies |
None |
OpenNLP + model |
Lingua jar + Kotlin |
Optimaize jar |
| All detectors evaluated with 12 threads on the FLORES-200 dev set (203,381 sentences). Throughput is wall-clock sentences per second. |
Historical Build: v7 (Wikipedia + MADLAD)
v7 used a single general model (langdetect.bin) — 203 languages, trained on
Wikipedia dumps (primary) supplemented by MADLAD for under-resourced languages.
Uses 16,384 hash buckets and ScriptAwareFeatureExtractor. 3.2 MB on disk.
A separate short-text model was also trained for v7 but is retired as of 20260320; the unified 32K-bucket model supersedes it.
v7 General Model Training Configuration
-
Corpus: Wikipedia dumps (primary) + MADLAD supplement for 17 languages
-
Languages: 203
-
Hash buckets: 16,384
-
Feature extractor:
ScriptAwareFeatureExtractor— character bigrams, trigrams, 3-char suffixes, 3-char prefixes, word unigrams, CJK character unigrams -
Target epoch total: 5,000,000 sentences per epoch
-
Optimizer: AdamW (lr=0.001, mini-batch=64) × 2 epochs, then Hogwild! SGD (lr=0.01→0.001) × 6 max epochs with within-epoch early stopping
-
Two-pass training: Pass 1 on full pool → mislabel filter (removed 1.5%) → Pass 2
-
Model size: 3.2 MB (INT8 quantized)
-
JVM:
-Xmx8g
References
For the academic references behind the techniques used in training and inference, see the References section of the main language detection page.