Charset Detection Pipeline
Tika 4.x introduces a new charset detection pipeline built from scratch. It combines structural byte-pattern rules, a purpose-trained ML classifier covering 35 charset classes, and language-signal arbitration into an integrated collect all candidates, then arbitrate architecture.
Pipeline overview
The default EncodingDetector chain registered via SPI runs five detectors in
order. All results are collected into an EncodingDetectorContext stored in
the ParseContext. When a MetaEncodingDetector (currently only
CharSoupEncodingDetector) is present in the chain,
CompositeEncodingDetector switches into collect-all-then-arbitrate mode:
every detector runs regardless of what the others returned, and the
MetaEncodingDetector makes the final call using the full picture.
| # | Detector | Module | Role |
|---|---|---|---|
1 |
|
|
Definitive — reads the first 4 bytes and returns a DECLARATIVE result when a UTF-8, UTF-16, or UTF-32 BOM is present. Returns empty otherwise. |
2 |
|
|
Reads |
3 |
|
|
Structural pre-filters (UTF-32, UTF-16, ISO-2022, UTF-8) + ML byte-ngram classifier (35 classes including UTF-16, CJK, EBCDIC, and single-byte encodings). Returns 1 candidate for long probes, up to 3 for short probes (≤ 50 bytes). |
4 |
|
|
Scans HTML |
5 |
|
|
|
EncodingResult and ResultType
Detectors return List<EncodingResult> instead of a single Charset.
Each EncodingResult carries:
-
charset— the detectedjava.nio.charset.Charset -
confidence— 0.0–1.0 float -
label— the detector’s original internal label (e.g.IBM424-ltr,UTF-16-BE) which may be finer-grained than the Java charset name -
resultType— one of:
| ResultType | Meaning |
|---|---|
|
Explicit charset declaration: BOM, HTML |
|
Derived from byte-level structure (UTF-8 validity, EBCDIC space distribution). More reliable than statistics but less authoritative than an explicit declaration. |
|
ML model output. Plausible but not certain; subject to arbitration by
|
Default encoding for pure ASCII
When no bytes ≥ 0x80 are present, MojibusterEncodingDetector returns
windows-1252 (not US-ASCII or UTF-8). Rationale:
-
US-ASCII is a strict subset of windows-1252, UTF-8, ISO-8859-1, and every other single-byte encoding — calling it US-ASCII would be the most specific correct answer, but it is also unactionable: a decoder for ASCII cannot safely decode a file that turns out to have high bytes beyond the probe depth.
-
windows-1252is the HTML5 / WHATWG default for the Western web. It is the least surprising fallback for text that looks ASCII but may have typographic characters (smart quotes, em-dash) beyond the probe window. -
A CRLF heuristic further refines the default: CR+LF line endings are a strong Windows signal, reinforcing
windows-1252overUTF-8.
MojibusterEncodingDetector
A multinomial logistic regression classifier trained on byte n-gram features
extracted from the MADLAD-400 corpus, Wikipedia dumps (including Traditional
Chinese and Cantonese), with 100 MB per charset byte-balanced training data.
The model has 35 charset classes — UTF-16 LE/BE, all EBCDIC variants, and all
major single-byte and multi-byte CJK encodings. UTF-32 is handled by the
structural WideUnicodeDetector (see below), which achieves 100% accuracy.
Feature extraction
Only bytes ≥ 0x80 contribute features, so HTML/XML markup (pure ASCII) is ignored without stripping.
-
Unigrams — each high byte hashed individually; encodes byte-frequency distributions separating single-byte encodings.
-
Bigrams — consecutive pair
(b[i], b[i+1])whereb[i] ≥ 0x80; captures multi-byte character structure (Shift_JIS, EUC-*, Big5, GB18030). -
Stride-2 bigrams — pairs sampled at even positions
(b[2i], b[2i+1]); gives the model structural visibility into UTF-16 null-column patterns.
Features are FNV-1a hashed into 16 384 buckets with int8 quantisation.
Structural rules (run before the model)
Structural rules run first and produce STRUCTURAL results with 1.0
confidence. When a structural rule fires, the model is not consulted.
| Rule | Trigger |
|---|---|
|
Every 4-byte group is decoded as a 32-bit integer in both BE and LE order
and checked for Unicode validity (0x000000–0x10FFFF, excluding surrogates).
Only 0.004% of the 32-bit space is valid, so non-UTF-32 data almost always
produces an out-of-range value within the first 8 bytes. Inspired by ICU4J’s
|
|
Latin/ASCII BMP content in UTF-16 produces alternating null bytes. One byte column (even or odd positions at stride-2) has a high null rate. No legacy encoding produces alternating nulls, so this is safe. |
|
Scripts whose UTF-16 high byte is below 0x20 (Cyrillic 0x04, Arabic 0x06, Hebrew 0x05, Devanagari 0x09, Thai 0x0E, etc.): the constrained column has all non-null values below 0x20, the other column is more diverse. Safe because Big5/Shift-JIS/GBK lead bytes are always ≥ 0x81. |
|
Even when no positive UTF-16 detection fires, the detector tracks whether the probe contains structurally invalid UTF-16 surrogate sequences (unpaired high/low surrogates). These invalidity flags are passed downstream to suppress UTF-16 model predictions for probes that cannot be valid UTF-16. |
|
ESC designation sequences → ISO-2022-JP / KR / CN |
|
No bytes ≥ 0x80 → |
|
Structurally valid UTF-8 multi-byte sequences → |
|
EBCDIC space (0x40) dominance + Hebrew-range bytes → IBM424-ltr/rtl |
CJK UTF-16 (CJK Unified block prefix 0x4E–0x9F, Hangul 0xAC–0xD7) is handled by the statistical model rather than structurally — stride-2 bigram features give the model direct visibility into UTF-16 code-unit structure, and the model learns to distinguish CJK UTF-16 from legacy CJK encodings whose lead bytes overlap with these block prefixes.
Candidate selection and top-N limiting
After the model scores all 35 classes, candidate selection works in two steps:
-
Logit-gap window (
selectByLogitGap) — include all candidates whose logit is withinLOGIT_GAP(5.0) of the top logit. -
Short-probe floor — for probes shorter than 50 bytes, if the gap window returns fewer than
MIN_CANDIDATES(3) results,selectAtLeast(3)extends the window to the top 3 candidates by raw logit rank.
The number of candidates returned to CharSoup is probe-length-dependent:
-
Short probes (≤ 50 bytes): return the top 3 candidates. Short byte sequences (e.g. ZIP entry filenames) are ambiguous enough that the model’s top pick may be wrong, but the correct answer is usually in the top 3. Passing too many candidates to CharSoup is dangerous — see Case study: why top-N limiting and the generative model matter below.
-
Long probes (> 50 bytes): return only the top 1 candidate. At longer probe lengths the model is confident enough that additional candidates are just noise.
These results are added to EncodingDetectorContext alongside results from
BOMDetector, MetadataCharsetDetector, and StandardHtmlEncodingDetector.
CharSoup arbitrates across all of them.
BOM bytes are stripped from the probe before feature extraction so that the BOM itself does not bias the byte-ngram features.
Post-model corrections
ISO-8859-X → Windows-12XX upgrade
C1 bytes (0x80–0x9F) are control characters in every ISO-8859-X standard but
printable in every Windows-12XX encoding. A single C1 byte is definitive proof
the content is not ISO-8859-X. upgradeIsoToWindows() replaces ISO-8859-X
results with their Windows-12XX equivalent, preserving the model’s confidence.
GB18030 4-byte upgrade
GB18030-specific 4-byte sequences have digit trail bytes (0x30–0x39)
impossible in GBK/GB2312. A single matching 4-tuple upgrades a GBK/GB2312
model result to GB18030.
CJK grammar filter
After the model nominates CJK candidates, CjkEncodingRules validates each
against the encoding’s formal byte grammar (Shift_JIS, EUC-JP, EUC-KR, Big5,
GB18030). The filter is conservative:
-
Score 0 — grammar rejects the candidate (the probe contains byte sequences that are structurally impossible in this encoding). The candidate is dropped from the result list.
-
Score > 0 — grammar passes. The candidate keeps its model sigmoid confidence unchanged so all candidates remain on the same scale for
CharSoupEncodingDetectorto compare.
The grammar filter acts as a gatekeeper, not a scorer — candidates keep their
original model confidence so all candidates remain on the same sigmoid scale
for downstream arbitration by CharSoupEncodingDetector.
EBCDIC variants as direct model labels
All IBM EBCDIC variants are direct labels in the 35-class model — there is no separate EBCDIC routing step or sub-model. The model handles:
-
IBM500/IBM1047— Latin international EBCDIC (confusable pair, differ in 9 of 256 byte positions) -
IBM424-ltr/IBM424-rtl— EBCDIC Hebrew (ltr/rtl are the same code page;checkIbm424fires first for clear cases) -
IBM420-ltr/IBM420-rtl— EBCDIC Arabic (training data requires thecp420Python codec) -
IBM850/IBM852/IBM855/IBM866— DOS/OEM code pages (not true EBCDIC; byte layouts follow the ASCII/Latin convention)
IBM855 and IBM866 are DOS Cyrillic code pages, not EBCDIC. Their
byte layouts are entirely different from EBCDIC and they are classified directly
alongside windows-1251, KOI8-R, and x-mac-cyrillic.
|
CharSoupEncodingDetector — language-signal arbitration
CharSoupEncodingDetector is a MetaEncodingDetector. Its presence in the
chain switches CompositeEncodingDetector into collect-all mode. After all
other detectors run, CharSoup receives the full EncodingDetectorContext and
arbitrates.
Before any charset decoding, CharSoup strips leading BOM bytes from the raw probe. This ensures every candidate charset decodes the same content bytes, preventing the BOM itself from skewing language scores.
Arbitration rules (in priority order)
-
Unanimous — if all detectors agree (or only one result exists), return it directly without language scoring.
-
Language scoring — for each unique candidate charset, the BOM-stripped bytes are decoded using that charset and fed to
CharSoupLanguageDetector(a character-bigram language model covering ~165 languages). Candidates whose decoded text exceeds a junk-character threshold (MAX_JUNK_RATIO = 0.10) are discarded before scoring. The charset whose decoded text produces the highest maximum logit across all languages wins, provided that logit is positive (sigmoid > 0.5). -
DECLARATIVE preference — after language scoring, if the winner is not a DECLARATIVE result but a DECLARATIVE candidate exists, the DECLARATIVE result is preferred when both of the following hold:
-
Its decoded text has junk ratio ≤ the language winner’s junk ratio (it decodes at least as cleanly).
-
Its decoded text has a positive language signal (max logit > 0).
This handles the case where a valid BOM (e.g.
UTF-16BE) is overridden by a wrong-endian decoding that happens to look like CJK text, which the language model scores more confidently than short Latin text. The junk guard prevents false positives from truly lying BOMs or wrong<meta charset>tags.
-
-
Inconclusive — if no candidate’s logit is positive (all decodings are too ambiguous for the language model to distinguish), CharSoup falls back to the DECLARATIVE result if one exists and its decoding is at least as clean as the statistical winner; otherwise it returns the first candidate from the highest-confidence statistical detector.
Case study: why top-N limiting and the generative model matter
Consider the GBK-encoded ZIP entry name 审计压缩包文件检索测试/ (23 bytes).
The byte-ngram model correctly ranks GB18030 at #1 with 0.99 confidence.
But without top-N limiting, the model also returns lower-ranked candidates
including windows-874 (Thai) at #8 with 0.15 confidence:
#1 GB18030 conf=0.99 → "审计压缩包文件检索测试/" (correct Chinese) #2 EUC-JP conf=0.36 → "蕪柴儿抹淫猟周殊沫霞編/" (Japanese gibberish) #3 Big5-HKSCS conf=0.31 → "机數揤坫婦恅璃潰坰聆彸/" (CJK gibberish) ... #8 windows-874 conf=0.15 → "ษ๓ผฦันห๕ฐ�ฮฤผ�ผ์ห๗ฒโสิ/" (Thai-looking text)
The GBK bytes happen to land in Thai character ranges when decoded as
windows-874. CharSoup’s discriminative language model scores Thai text
confidently (it recognises Thai character patterns), while the Chinese text
decoded from GB18030 may score lower on a 23-byte probe. Without top-N
limiting, CharSoup overrides the model’s correct high-confidence #1 pick
with the wrong #8 pick — returning x-windows-874 instead of GB18030.
The fix has two parts:
-
Top-N limiting: on short probes (≤ 50 bytes), only the top 3 candidates are passed to CharSoup.
windows-874at #8 never enters the candidate set. CharSoup arbitrates among {GB18030, EUC-JP, Big5-HKSCS} — all CJK encodings where the generative model can meaningfully compare text quality. -
Generative language model: even within the top 3, the discriminative language model can be fooled by coincidental character patterns on short probes. The generative model provides a second check: it scores how language-like each decoded text is for its detected language, rather than just which language it looks like. Genuine Chinese text ("审计压缩包文件 检索测试") scores well under the Chinese generative model; CJK gibberish from wrong-charset decoding scores poorly.
This example illustrates a fundamental limitation of discriminative language models for charset arbitration: their logits are not comparable across scripts. A discriminative model answers "which language is this?" — but when the same bytes decode as Chinese under one charset and Thai under another, the model produces two confident answers in two unrelated scripts. Comparing the Thai logit to the Chinese logit is meaningless; the model was never trained to rank "real Chinese" against "fake Thai" on an absolute scale. This cross-script incomparability is not a rare edge case — it will happen whenever wrong-charset decoding maps bytes into a different script’s character range, which is common for CJK encodings whose byte values overlap with Thai, Arabic, Cyrillic, and other single-byte encodings.
The generative model solves this by asking a different question: not "which language is this?" but "how language-like is this text for language L?" Genuine Chinese text ("审计压缩包文件检索测试") scores well under the Chinese generative model because its character n-gram statistics match real Chinese. Thai-looking gibberish from a wrong-charset decode scores poorly under the Thai generative model because — despite using Thai characters — it doesn’t follow Thai character co-occurrence patterns. This quality signal is comparable across scripts, making it safe to use for cross-script arbitration.
junkRatio
The junk ratio filters out clearly wrong-charset decodings before language scoring runs. Currently counts:
-
U+FFFD replacement characters (wrong-charset multi-byte decode)
-
U+FFFE (the "wrong-endian BOM" / Unicode noncharacter — produced when a UTF-16 BOM is decoded with the wrong byte order)
-
ISO C1 control characters 0x00–0x08, 0x0E–0x1F, 0x80–0x9F (excluding TAB, LF, VT, FF, CR which appear in source code and structured documents)
Ordinary whitespace is not junk. Non-ASCII non-alphabetic punctuation
(e.g. • U+2022, ¶ U+00B6) is also not junk — while these can indicate a
wrong-charset single-byte decoding of multi-byte lead bytes, they also appear
legitimately in windows-125x documents (bullet lists, legal symbols). The
language model already handles this discrimination correctly.
Why "positive logit" is the only threshold
Earlier versions used two thresholds: a minimum absolute confidence
(sigmoid ≥ 0.88) and a minimum relative margin between best and runner-up.
Both were removed. Rationale:
-
If the language model gives a positive logit for any language in the decoded text, it has a real signal. The best-logit candidate is better than all the alternatives by definition — requiring an additional margin just delays the correct answer.
-
The margin threshold was sensitive to which other charsets happened to be in the candidate set. A third candidate with a strong (but wrong) language signal could narrow the margin below threshold and force a fallback to the model’s top statistical pick, which might be wrong.
MetadataCharsetDetector
A lightweight detector in tika-core that reads declarative charset hints from
the Metadata object before any byte analysis:
-
Content-Typecharset parameter (e.g.text/html; charset=windows-1251) -
Content-Encoding(used byRFC822Parserand similar MIME-aware parsers)
Applies WHATWG label normalization: ISO-8859-1 and US-ASCII are mapped to
windows-1252 because browsers (and the HTML5 spec) treat them as aliases for
windows-1252 in practice.
Returns a DECLARATIVE result, so CharSoupEncodingDetector will treat it with
preference over statistical candidates.
BOMDetector
Reads the first 4 bytes and detects:
| Byte sequence | Encoding |
|---|---|
|
UTF-8 |
|
UTF-32-LE |
|
UTF-32-BE |
|
UTF-16-LE |
|
UTF-16-BE |
Returns a DECLARATIVE result. StandardHtmlEncodingDetector skips BOM
detection by default (skipBOM=true) so that BOMDetector is the sole source
of BOM evidence. This separation allows CharSoupEncodingDetector to
arbitrate when a BOM and a <meta charset> tag disagree.
Performance and accuracy
Numbers are from the held-out MADLAD-400 + Wikipedia devtest set (model v6). 1 469 647 samples across 41 charsets including structural-only charsets (US-ASCII, ISO-2022-JP/KR/CN, UTF-32-BE/LE).
"All" = ML model + structural pre-filters + all post-processing rules — this
is the production configuration. UTF-32 is detected by a structural
pre-filter (validity-only codepoint check inspired by ICU4J’s
CharsetRecog_UTF_32); UTF-16 uses a combination of structural phases
(null-column, low-block-prefix) and the ML model for CJK cases. The ML model
has 35 classes (UTF-32 labels are excluded from training since they are handled
structurally).
Raw eval output: eval-v6-no-utf32.txt in this directory.
Overall accuracy
| Metric | Mojibuster (All) | ICU4J | juniversalchardet |
|---|---|---|---|
Strict accuracy (full probe) |
95.0% |
45.5% |
33.5% |
Soft accuracy (full probe) |
97.3% |
67.0% |
42.0% |
Decode-match (full probe) |
99.4% |
54.8% |
41.2% |
Alpha-match (full probe) |
99.8% |
55.3% |
43.5% |
Latency (full probe) |
~13 µs |
~147 µs |
~16 µs |
Strict accuracy = exact charset name match.
Soft accuracy = exact or confusable-group match (e.g. predicting IBM500 for
an IBM1047 file counts as soft-correct since they share 247 of 256 byte
mappings).
Decode-match = the predicted charset decodes the probe bytes to the same
string as the true charset (the prediction is functionally correct even if the
label differs).
Alpha-match = same as decode-match but ignoring non-alphanumeric characters.
Accuracy by probe length
| Probe length | Strict% | Soft% | Top-3% | Decode% | Alpha% |
|---|---|---|---|---|---|
8 bytes |
59.1 |
62.6 |
70.2 |
83.2 |
83.4 |
32 bytes |
80.8 |
83.6 |
86.3 |
93.4 |
93.5 |
128 bytes |
91.4 |
93.8 |
94.2 |
97.4 |
97.5 |
full |
95.0 |
97.3 |
97.3 |
99.4 |
99.8 |
Latency by probe length (µs/sample)
| Probe length | Mojibuster | ICU4J | juniversalchardet |
|---|---|---|---|
8 bytes |
6 |
5 |
2 |
32 bytes |
8 |
14 |
3 |
128 bytes |
8 |
40 |
6 |
full |
13 |
147 |
16 |
Mojibuster latency scales sub-linearly with probe length (feature extraction is O(probe length) but the model forward pass is constant). ICU4J and juniversalchardet use different architectures with different scaling characteristics.
Per-charset highlights (full probe, All detector)
| Charset | Mojibuster | ICU4J | juniversalchardet |
|---|---|---|---|
UTF-8 |
100.0% |
100.0% |
99.6% |
UTF-32-BE / UTF-32-LE |
100.0% |
100.0% |
0% |
UTF-16-BE |
98.8% |
68.6% |
0% |
UTF-16-LE |
99.4% |
68.8% |
0% |
Shift_JIS |
100.0% |
100.0% |
99.9% |
EUC-JP |
99.8% |
99.9% |
99.6% |
EUC-KR |
99.9% |
100.0% |
100.0% |
GB18030 |
100.0% |
99.5% |
99.7% |
Big5-HKSCS |
100.0% |
0% |
0% |
windows-1252 |
99.7% |
46.5% |
0% |
x-EUC-TW |
99.9% |
0% |
0% |
IBM424 / IBM420 |
99.9–100% |
0–99.5% |
0% |
IBM500 / IBM1047 |
89–99.7% (soft: 99.7–99.8%) |
75–76% |
0% |
Mojibuster covers 35 charset classes — including EBCDIC variants, DOS code pages (IBM850/852/855), x-EUC-TW, Big5-HKSCS, x-mac-cyrillic, and all windows-125x encodings. Each detector targets a different charset repertoire, so the numbers above reflect those differences as much as algorithmic accuracy.
Training data
The model is trained on 100 MB of byte-balanced data per charset (80/10/10 train/devtest/test split), sourced from:
-
MADLAD-400 (clean split) — multilingual web text covering 188 languages for Unicode charsets and targeted language subsets for legacy charsets.
-
Wikipedia dumps — Cantonese (yue) and Traditional Chinese (zhwiki) for Big5-HKSCS and x-EUC-TW coverage.
Training data preparation applies charset-specific normalisation to maximise data volume for legacy charsets:
-
windows-1256: maps Farsi Yeh → Arabic Yeh, Extended Arabic-Indic digits → Arabic-Indic digits, strips bidi control characters.
-
IBM424: strips Hebrew nikkud (vowel points).
-
All legacy charsets: replaces typographic punctuation (curly quotes, em/en-dash, ellipsis) with ASCII equivalents only when the target charset cannot encode the original character. Charsets that can encode these characters (e.g. windows-1252) keep them as discriminative features.
An ambiguity gate drops SBCS samples that encode identically under any rival charset (e.g. pure-Latin text identical under windows-1250 and windows-1252), ensuring the model learns genuinely discriminative byte patterns.
Notes on specific charsets
IBM500 / IBM1047: these share 247 of 256 byte mappings — the 9 positions
that differ are mostly below 0x80 and invisible to high-byte features. For
normal Latin prose they are genuinely indistinguishable. Both are listed in
CharsetConfusables as a soft-confusable group; predicting either counts as a
soft hit, and either charset decodes the other’s content correctly for the vast
majority of text.
windows-1252: achieves 99.7% strict accuracy. The model learns discriminative features in the 0x80–0x9F range (smart quotes, em-dash, Euro sign) which are unique to windows-1252 vs ISO-8859-1. ISO-8859-1 and ISO-8859-15 are treated as confusable peers of windows-1252.
x-EUC-TW: training data comes exclusively from Traditional Chinese Wikipedia. The model achieves 99.9% accuracy.
Configuration
Using ICU4J or juniversalchardet instead
ICU4J and juniversalchardet remain available as optional modules. To use them instead of the default pipeline, configure Tika explicitly via JSON:
{
"encodingDetectors": [
{ "type": "icu4j-encoding-detector" },
{ "type": "universal-encoding-detector" }
]
}
Both detectors are available in their respective modules
(tika-encoding-detector-icu4j and tika-encoding-detector-universal).
Changing the HTML meta scan depth
StandardHtmlEncodingDetector reads up to 65 536 bytes by default when
scanning for <meta charset> tags. This can be tuned via tika-config.json
(see TIKA-2485):
{
"encodingDetectors": [
{ "type": "standard-html-encoding-detector",
"params": { "markLimit": 131072 } }
]
}
Model training and evaluation
Data preparation scripts live in tika-ml/tika-ml-chardetect/scripts/ and the
Java tools are in org.apache.tika.ml.chardetect.tools.
Prerequisites
-
MADLAD-400 clean data downloaded via
scripts/download_madlad.pyinto~/datasets/madlad/data/<lang>/sentences_madlad.txt. -
Wikipedia data (optional but recommended for CJK) converted via
scripts/convert_wiki_to_madlad.pyinto~/datasets/madlad/data/<lang>/sentences_wikipedia.txt. -
A
unicode_langs.txtfile in the MADLAD data directory listing languages for Unicode charset training (generated byscripts/select_unicode_langs.py).
Build, train, evaluate
# 1. Build training data (100 MB per charset, 80/10/10 split)
mvn -pl tika-ml/tika-ml-chardetect exec:java \
-Dexec.mainClass="org.apache.tika.ml.chardetect.tools.BuildCharsetTrainingData" \
-Dexec.args="--output-dir ~/datasets/madlad/charset-detect4" \
-Dexec.classpathScope=compile -DskipTests \
"-Dexec.vmArgs=-Xmx32g"
# 2. Build the tools jar (requires -Ptrain profile for the shaded jar)
mvn package -pl tika-ml/tika-ml-chardetect -Ptrain -DskipTests \
-Dforbiddenapis.skip=true -Dcheckstyle.skip=true
# 3. Train (SGD, 3 epochs, 16384 hash buckets, int8 quantised)
# --exclude UTF-32-BE,UTF-32-LE: UTF-32 is handled structurally
# --no-tri: trigrams add no accuracy over unigrams+bigrams+stride-2
java -cp tika-ml/tika-ml-chardetect/target/tika-ml-chardetect-*-tools.jar \
org.apache.tika.ml.chardetect.tools.TrainCharsetModel \
--data ~/datasets/madlad/charset-detect4/train \
--output ~/datasets/madlad/charset-detect4/chardetect-v6-no-utf32.bin \
--exclude UTF-32-BE,UTF-32-LE \
--no-tri --buckets 16384 --epochs 3 --lr 0.05
# 4. Copy model into production resources and rebuild
cp ~/datasets/madlad/charset-detect4/chardetect-v6-no-utf32.bin \
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/\
org/apache/tika/ml/chardetect/chardetect-v6-no-utf32.bin
mvn install -pl tika-ml/tika-ml-core,tika-encoding-detectors/tika-encoding-detector-mojibuster \
-DskipTests -Dforbiddenapis.skip=true -Dcheckstyle.skip=true
mvn package -pl tika-ml/tika-ml-chardetect -Ptrain -DskipTests \
-Dforbiddenapis.skip=true -Dcheckstyle.skip=true
# 5. Evaluate on devtest (not test — reserve test for final sign-off)
java -cp tika-ml/tika-ml-chardetect/target/tika-ml-chardetect-*-tools.jar \
org.apache.tika.ml.chardetect.tools.EvalCharsetDetectors \
--data ~/datasets/madlad/charset-detect4/devtest \
--lengths 8,32,128,full --confusion