Text Quality Scoring (Junk Detection)
The tika-ml-junkdetect module provides a language-agnostic scorer that
distinguishes clean natural-language text from garbled, corrupted, or
mis-decoded content — without needing to know the language in advance.
What it detects
-
Mojibake — text decoded with the wrong character set (e.g., a Windows-1251 Russian document decoded as Windows-1252, producing Latin lookalike garbage)
-
Byte-level corruption — random or partially-overwritten byte sequences that produce structurally invalid UTF-8
-
Reversed or shuffled text — text that contains valid characters but in nonsensical order, as can occur in bidirectional rendering failures or corrupted OCR streams
-
OCR garbage — low-confidence OCR output full of symbol noise
It does not detect incorrect language (e.g., an English document mistakenly labeled as French) — use Language Detection for that.
How it works
The scorer uses a per-script byte-bigram log-probability model trained on clean Wikipedia and MADLAD-400 text. For each input it:
-
Identifies the dominant Unicode script (Latin, Cyrillic, Arabic, Han, etc.) by histogramming
Character.UnicodeScriptover all codepoints. -
Looks up the script’s bigram table — a 256×256 matrix of
log P(byte_b | byte_a)values trained on clean text for that script. -
Computes a mean log-probability across all consecutive byte pairs in the UTF-8 encoding of the input.
-
Z-scores the result against calibration statistics (mean and standard deviation measured on a held-out set of clean text for the same script).
The z-score is the primary output: a score of 0 means "exactly as expected for clean text of this script"; a score of −3 means "three standard deviations worse than clean"; a score of −10 means "almost certainly garbled."
Using the API
The public interface is TextQualityDetector in tika-core.
The implementation lives in tika-ml-junkdetect, which registers itself via
the Java ServiceLoader mechanism.
Add the dependency to your project:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-ml-junkdetect</artifactId>
<version>${tika.version}</version>
</dependency>
Loading the detector
// Via ServiceLoader — picks up any registered TextQualityDetector implementation
TextQualityDetector detector = ServiceLoader.load(TextQualityDetector.class)
.findFirst()
.orElseThrow(() -> new IllegalStateException("No TextQualityDetector on classpath"));
// Or directly, when you know you want JunkDetector specifically
JunkDetector detector = JunkDetector.loadFromClasspath();
JunkDetector is immutable and thread-safe after construction. Load it once
at application startup.
Scoring a string
TextQualityScore score = detector.score("The quick brown fox jumps over the lazy dog.");
System.out.println(score.getZScore()); // e.g. -0.74 — within normal range
System.out.println(score.getPClean()); // e.g. 0.32 — P(clean) via sigmoid
Interpreting the score
| Z-score range | Interpretation |
|---|---|
> 0 |
Better than average clean text — high-quality, well-formed natural language. |
−1 to 0 |
Within normal range for clean text. Most real documents fall here. |
−1 to −2 |
Mildly degraded. May indicate noisy OCR, code-heavy text, or unusual domain language. Not necessarily junk. |
< −2 |
Two or more standard deviations below clean. Worth investigating. A reasonable threshold for triggering re-OCR or re-encoding. |
< −5 |
Almost certainly garbled. Wrong charset decoding, byte-reversed content, or heavy corruption. |
The TextQualityScore also carries:
-
getPClean()—sigmoid(z), a rough probability estimate in [0, 1] that the text is clean. Useful for ranking candidates; the absolute value is not calibrated as a true probability. -
getCiLow()/getCiHigh()— 95% confidence interval on the z-score. Narrow on long texts, wide on short ones. Use these when making threshold decisions on short strings. -
getDominantScript()— the Unicode script name used for scoring (e.g."LATIN","CYRILLIC","ARABIC","HAN"). IfisUnknown()is true, the dominant script had no model and scoring was not possible.
Comparing two candidates
The compare() method is the primary use case for charset detection:
given the same raw bytes decoded two different ways, which decoding
looks more like natural language?
The caller is responsible for decoding the raw bytes; the detector just compares the resulting strings. Each candidate is given a human-readable label (typically the charset name) that is echoed back in the result.
byte[] rawBytes = ...; // bytes from an unknown-encoding file
String ascp1252 = new String(rawBytes, Charset.forName("cp1252"));
String ascp1251 = new String(rawBytes, Charset.forName("cp1251"));
TextQualityComparison result = detector.compare("cp1252", ascp1252, "cp1251", ascp1251);
System.out.println(result.winner()); // "A" or "B"
System.out.println(result.delta()); // z-score separation between the two
if (result.winner().equals("B") && result.delta() > 1.0) {
// cp1251 is confidently the better decoding
}
The delta() is the absolute difference in z-scores between the two candidates.
As a rough guide:
| Delta | Confidence |
|---|---|
< 0.5 |
Very uncertain — both decodings look similar to the model. Fall back to other heuristics. |
0.5 – 1.0 |
Weak signal — winner is likely correct but not assured. |
1.0 – 3.0 |
Useful signal. Trust the winner for most production purposes. |
> 3.0 |
High confidence. One decoding is clearly more language-like. |
Listing known scripts
detector.knownScripts(); // returns Set<String>
// e.g. [ARABIC, ARMENIAN, BENGALI, CYRILLIC, DEVANAGARI, GEORGIAN,
// GREEK, GUJARATI, GURMUKHI, HAN, HANGUL, HEBREW, HIRAGANA,
// KANNADA, KHMER, LAO, LATIN, MALAYALAM, MYANMAR, ORIYA,
// SINHALA, TAMIL, TELUGU, THAANA, THAI, TIBETAN, ...]
If the dominant script of an input is not in this set, score() returns a
TextQualityScore where isUnknown() is true and no z-score is available.
Thresholds and operating points
There is no universally correct threshold. The right cutoff depends on your content and tolerance for false positives (flagging good text as junk).
Starting points:
-
Trigger re-OCR: z < −2.0 (catches ~95% of severe corruption while flagging ~2–5% of legitimate text on average, more for short strings).
-
Charset tiebreaking: prefer the candidate with the higher z-score when
delta() > 1.0; abstain ifdelta() < 0.5. -
Training data filtering: z < −1.5 to remove mojibake and bot-generated noise from NLP corpora.
For short text (under ~50 UTF-8 bytes), use getCiLow() rather than getZScore()
for threshold decisions, since the confidence interval widens substantially.
Limitations
-
Script coverage: only scripts with a trained model can be scored. Unknown scripts return
isUnknown() = true. -
Short text: scoring is unreliable below ~15 UTF-8 bytes. The model needs at least a few bigrams to produce a stable estimate.
-
Closely related charsets in the same script pool: the LATIN model is trained across hundreds of languages, which dilutes the signal for closely related Western European and Baltic encodings (e.g., cp1252 vs. cp1257 on Lithuanian text). The winner is usually correct, but delta may be small (< 0.5).
-
Deliberately obfuscated text: content designed to look like natural language (e.g. by adversarial padding) is not detected.
Further reading
For training methodology, model format, evaluation harness, and guidance on improving the model, see Building the Junk Detector.