Building the Junk Detector
This page documents the training pipeline, model format, evaluation methodology, and guidance for improving the junk detector model. For usage, see Text Quality Scoring (Junk Detection).
Overview
The junk detector is a per-script byte-bigram language model. For each
Unicode script (Latin, Cyrillic, Arabic, Han, etc.) it maintains a 256×256
table of log P(byte_b | byte_a) values — the probability of seeing byte b
immediately after byte a in clean UTF-8 text of that script.
The pipeline has three stages:
1. BuildJunkTrainingData — collect and split corpus per script group
2. TrainJunkModel — train bigram tables and calibrate z-scores
3. EvalJunkDetector — measure discrimination quality
All three tools are packaged as a fat JAR via the train Maven profile:
mvn -pl tika-ml/tika-ml-junkdetect package -Ptrain -DskipTests
The resulting JAR is tika-ml-junkdetect-*-train.jar.
Stage 1: Corpus collection (BuildJunkTrainingData)
This tool collects clean UTF-8 sentences from language-specific source files, groups them by Unicode script, allocates a byte budget proportional to per-script bigram entropy, and writes 80/10/10 train/dev/test splits.
Data format
Source data lives in one directory per language (ISO 639 code), each containing up to two files:
sentences_wikipedia.txt-
Line-numbered Wikipedia sentences:
{lineNum}{TAB}{text}. One sentence per line. sentences_madlad.txt-
Line-numbered MADLAD-400 documents:
{lineNum}{TAB}{text}. Documents contain literal two-character\nescape sequences as sub-sentence separators. The tool splits on these before processing.
Script group detection
For each language directory the dominant Unicode script is detected by
sampling up to 2,000 lines and histogramming Character.UnicodeScript over
all codepoints. The COMMON, INHERITED, and UNKNOWN pseudo-scripts are
excluded. The plurality script (with a 1% minimum floor to suppress spurious
wins on mixed-script text) determines which group that language belongs to.
Languages that share the same dominant script are pooled together into one training group. No script groups are hardcoded — the set of groups is derived entirely from the data.
Entropy-proportional byte budget
All scripts are not equal: CJK text has thousands of distinct 3-byte UTF-8 codepoints producing high byte-bigram entropy (~10.4 bits), while Arabic text clusters in a narrow 0xD8–0xDB high-byte range (~7.2 bits). A naïve sentence-count budget would badly over-represent low-entropy scripts.
Instead the tool allocates a total byte budget (default 50 MB) across script groups in proportion to their empirical byte-bigram Shannon entropy, estimated from a 200 KB sample per group:
H(script) = -Σ p(a,b) · log₂ p(a,b) over all observed bigrams (a,b)
budget(script) = totalBudget × H(script) / Σ H(all scripts)
Within each script group the budget is distributed evenly across its member languages, ensuring no single language dominates the training data.
Train/dev/test split
After collecting and shuffling sentences, the tool writes three gzipped files per script:
| File | Split | Purpose |
|---|---|---|
|
80% |
Bigram count accumulation in |
|
10% |
Calibration (mu/sigma estimation) in |
|
10% |
Held out completely. Use only for final reported evaluation numbers. Never use to make model or threshold decisions. |
Running corpus collection
java -cp tika-ml-junkdetect-*-train.jar \
org.apache.tika.ml.junkdetect.tools.BuildJunkTrainingData \
--data-dir ~/datasets/madlad/data \
--output-dir ~/datasets/madlad/junkdetect \
--total-budget-bytes 50000000
Key options:
| Option | Default | Description |
|---|---|---|
|
|
Root directory containing per-language subdirectories. |
|
|
Where to write |
|
|
Total UTF-8 byte budget across all scripts. Increase for production runs. |
|
|
Minimum UTF-8 byte length for a sentence to be accepted. |
|
|
Maximum fraction of codepoints that may be ASCII punctuation or digits. Filters out bullet lists, code snippets, and other non-prose content. |
|
|
Random seed for reproducible shuffles. |
|
|
Print script detection and entropy results without writing files. |
Stage 2: Training (TrainJunkModel)
For each script, this tool reads the .train.gz file, accumulates
byte-bigram counts, applies Laplace smoothing, computes log-probabilities,
then calibrates z-score statistics from the .dev.gz file.
Bigram table training
for each sentence in {script}.train.gz:
utf8 = sentence.getBytes(UTF-8)
for each consecutive pair (a, b) in utf8:
counts[a * 256 + b]++
for each row a in 0..255:
rowTotal = Σ (counts[a * 256 + b] + 1) for b in 0..255 // Laplace add-1
for each b in 0..255:
table[a * 256 + b] = log((counts[a * 256 + b] + 1) / rowTotal)
Laplace (add-1) smoothing is applied per row: every possible next byte is given a pseudocount of 1, preventing log(0) for unseen bigrams and providing a small but nonzero probability for novel byte sequences.
Calibration
For each sentence in {script}.dev.gz:
meanLogProb = Σ table[bigram] / (bytes - 1)
The calibration statistics are the mean (μ) and standard deviation (σ) of
meanLogProb across all dev sentences. At inference:
zScore = (meanLogProb - μ) / σ
A z-score of 0 means "exactly as likely as average clean text for this script." Negative scores indicate text that is less likely than clean — i.e., garbled.
Running training
java -cp tika-ml-junkdetect-*-train.jar \
org.apache.tika.ml.junkdetect.tools.TrainJunkModel \
--data-dir ~/datasets/madlad/junkdetect \
--output ~/datasets/madlad/junkdetect/junkdetect.bin
After training, copy the model to the classpath resource location:
cp ~/datasets/madlad/junkdetect/junkdetect.bin \
tika-ml/tika-ml-junkdetect/src/main/resources/org/apache/tika/ml/junkdetect/junkdetect.bin
Stage 3: Evaluation (EvalJunkDetector)
The evaluator measures how well the model separates clean text from corrupted text across scripts, distortion types, and string lengths.
Distortion modes
| Mode | Description |
|---|---|
|
Random bytes (0x80–0xFF) are substituted at rate |
|
Codepoints are reversed (Unicode-aware, preserving surrogate pairs). Produces valid UTF-8 but in nonsensical reading order. Most meaningful for RTL scripts (Arabic, Hebrew) where reversed text is a realistic failure mode; LTR script bigrams are nearly symmetric, so detection is harder. |
|
All bytes are randomly shuffled (Fisher-Yates). The most extreme corruption — destroys all sequential structure. |
Output files
detail.tsv-
One row per
(script, distortion, param, length)cell, with columns:script,distortion,param,length,n_clean,n_corrupt,mean_clean_z,mean_corrupt_z,cohens_d,fpr,tpr. summary.tsv-
Macro-averaged across scripts per
(distortion, param, length). Themacro_cohens_dcolumn is the headline comparison metric.
Key metrics
- Cohen’s d (primary metric)
-
Effect size separating clean from corrupted z-scores:
d = (mean_clean_z - mean_corrupt_z) / pooled_stdHigher is better. A value of 1.0 means the distributions are separated by one pooled standard deviation. Values above 2.0 indicate strong, reliable discrimination.
- True positive rate (TPR)
-
Fraction of corrupted samples with z < threshold (−2.0 by default). Higher is better.
- False positive rate (FPR)
-
Fraction of clean samples with z < threshold. Should stay near 2–5%. A well-calibrated model will have FPR ≈ 2.5% (since z < −2.0 corresponds to the left tail of the standard normal for clean text).
Running evaluation
# During development: use the dev split
java -cp tika-ml-junkdetect-*-train.jar \
org.apache.tika.ml.junkdetect.tools.EvalJunkDetector \
--data-dir ~/datasets/madlad/junkdetect \
--split dev \
--output-dir ~/datasets/madlad/junkdetect/eval
# Final reporting only: use the held-out test split
java -cp tika-ml-junkdetect-*-train.jar \
org.apache.tika.ml.junkdetect.tools.EvalJunkDetector \
--data-dir ~/datasets/madlad/junkdetect \
--split test \
--output-dir ~/datasets/madlad/junkdetect/eval-final
Use --split test only once, for final reporting. The test split
is completely held out and should never inform model or threshold decisions.
|
Tracking improvement
To compare two model versions:
-
Train model A, run
EvalJunkDetector --split dev, savesummary.tsvassummary-A.tsv. -
Retrain as model B, run eval again, save as
summary-B.tsv. -
Diff the
macro_cohens_dcolumn. Positive change = improvement.
The # OVERALL line at the bottom of summary.tsv gives a single-number
summary of model quality.
Model binary format (JUNKDET1)
The model is stored as a gzipped binary file. Auto-detection of the gzip
wrapper is done by inspecting the first two bytes (magic 0x1f 0x8b).
[8 bytes] magic "JUNKDET1" (ASCII)
[1 byte] version = 1
[4 bytes] num_scripts (int32 big-endian)
For each script (sorted by name):
[2 bytes] name length (uint16 big-endian)
[N bytes] script name (UTF-8)
[4 bytes] μ — mean of dev-set mean_bigram_logprob (float32 big-endian)
[4 bytes] σ — std deviation (float32 big-endian)
[65536×4 bytes] log-prob table (float32 big-endian, index = a*256+b)
The default classpath resource is
org/apache/tika/ml/junkdetect/junkdetect.bin.
Known limitations and improvement paths
Baltic and closely related Latin scripts
The LATIN script pools ~322 languages from Latin, Basic Latin, and extended Latin alphabets. Baltic languages (Lithuanian, Latvian) use distinctive diacritics encoded differently in cp1257 vs. cp1252, but these bigrams are diluted by the large shared Latin vocabulary. The model correctly identifies the winner but with low delta (< 0.5), below the production confidence threshold of 1.0.
Possible improvements:
-
Retrain with Baltic languages weighted more heavily within the LATIN group.
-
Split LATIN into LATIN-WEST and LATIN-EAST sub-models, where LATIN-EAST receives its own dedicated bigram table trained primarily on Baltic, Slavic Latin (Polish, Czech, Slovak), and Romanian.
RTL script reversal
For Arabic and Hebrew, codepoint-reversal is a realistic failure mode (text stored in the wrong visual order). The model detects this with moderate Cohen’s d at lengths ≥ 50 characters. Shorter strings (15–30 characters) show weaker separation because there are too few bigrams to be statistically reliable.
Possible improvement: train a secondary short-text specialist model for RTL scripts using finer-grained features (trigrams or unigram frequency distributions).
Scaling up
The default 50 MB byte budget is a proof-of-concept setting. For production:
-
Increase
--total-budget-bytesto 500 MB or more. -
Larger budgets improve calibration quality (tighter σ, more accurate μ) and reduce variance on infrequent bigrams.
-
The model binary grows only slightly (the 256×256 table is the same size regardless of training set size) — only calibration quality improves.
Smoke tests
Five smoke tests in JunkDetectorSmokeTest verify the bundled model.
All tests use the TextQualityDetector interface and return TextQualityScore
or TextQualityComparison from tika-core.
| Test | What it checks |
|---|---|
|
Clean English |
|
Forward Arabic z-score > codepoint-reversed Arabic z-score. Reversal is done at codepoint (not byte) granularity, preserving valid Unicode. |
|
|
|
|
|
Clean Japanese z-score > byte-shuffled Japanese z-score. Shuffled bytes are decoded as ISO-8859-1 to produce a scoreable string. |
| Codepoint reversal of LTR scripts (Russian, Latin) is not a useful smoke test — LTR byte-bigram distributions are nearly symmetric, so the model cannot reliably distinguish forward from reversed text. The Russian test uses codec comparison (cp1251 vs. cp1252) instead, which is the actual real-world failure mode for Cyrillic text. |