Text Quality Scoring (Junk Detection)

The tika-ml-junkdetect module provides a language-agnostic scorer that distinguishes clean natural-language text from garbled, corrupted, or mis-decoded content — without needing to know the language in advance.

What it detects

  • Mojibake — text decoded with the wrong character set (e.g., a Windows-1251 Russian document decoded as Windows-1252, producing Latin lookalike garbage)

  • Byte-level corruption — random or partially-overwritten byte sequences that produce structurally invalid UTF-8

  • Reversed or shuffled text — text that contains valid characters but in nonsensical order, as can occur in bidirectional rendering failures or corrupted OCR streams

  • OCR garbage — low-confidence OCR output full of symbol noise

It does not detect incorrect language (e.g., an English document mistakenly labeled as French) — use Language Detection for that.

How it works

The scorer uses a per-script byte-bigram log-probability model trained on clean Wikipedia and MADLAD-400 text. For each input it:

  1. Identifies the dominant Unicode script (Latin, Cyrillic, Arabic, Han, etc.) by histogramming Character.UnicodeScript over all codepoints.

  2. Looks up the script’s bigram table — a 256×256 matrix of log P(byte_b | byte_a) values trained on clean text for that script.

  3. Computes a mean log-probability across all consecutive byte pairs in the UTF-8 encoding of the input.

  4. Z-scores the result against calibration statistics (mean and standard deviation measured on a held-out set of clean text for the same script).

The z-score is the primary output: a score of 0 means "exactly as expected for clean text of this script"; a score of −3 means "three standard deviations worse than clean"; a score of −10 means "almost certainly garbled."

Using the API

The public interface is TextQualityDetector in tika-core. The implementation lives in tika-ml-junkdetect, which registers itself via the Java ServiceLoader mechanism.

Add the dependency to your project:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-ml-junkdetect</artifactId>
  <version>${tika.version}</version>
</dependency>

Loading the detector

// Via ServiceLoader — picks up any registered TextQualityDetector implementation
TextQualityDetector detector = ServiceLoader.load(TextQualityDetector.class)
        .findFirst()
        .orElseThrow(() -> new IllegalStateException("No TextQualityDetector on classpath"));

// Or directly, when you know you want JunkDetector specifically
JunkDetector detector = JunkDetector.loadFromClasspath();

JunkDetector is immutable and thread-safe after construction. Load it once at application startup.

Scoring a string

TextQualityScore score = detector.score("The quick brown fox jumps over the lazy dog.");
System.out.println(score.getZScore());   // e.g. -0.74 — within normal range
System.out.println(score.getPClean());   // e.g. 0.32  — P(clean) via sigmoid

Interpreting the score

Z-score range Interpretation

> 0

Better than average clean text — high-quality, well-formed natural language.

−1 to 0

Within normal range for clean text. Most real documents fall here.

−1 to −2

Mildly degraded. May indicate noisy OCR, code-heavy text, or unusual domain language. Not necessarily junk.

< −2

Two or more standard deviations below clean. Worth investigating. A reasonable threshold for triggering re-OCR or re-encoding.

< −5

Almost certainly garbled. Wrong charset decoding, byte-reversed content, or heavy corruption.

The TextQualityScore also carries:

  • getPClean()sigmoid(z), a rough probability estimate in [0, 1] that the text is clean. Useful for ranking candidates; the absolute value is not calibrated as a true probability.

  • getCiLow() / getCiHigh() — 95% confidence interval on the z-score. Narrow on long texts, wide on short ones. Use these when making threshold decisions on short strings.

  • getDominantScript() — the Unicode script name used for scoring (e.g. "LATIN", "CYRILLIC", "ARABIC", "HAN"). If isUnknown() is true, the dominant script had no model and scoring was not possible.

Comparing two candidates

The compare() method is the primary use case for charset detection: given the same raw bytes decoded two different ways, which decoding looks more like natural language?

The caller is responsible for decoding the raw bytes; the detector just compares the resulting strings. Each candidate is given a human-readable label (typically the charset name) that is echoed back in the result.

byte[] rawBytes = ...; // bytes from an unknown-encoding file

String ascp1252 = new String(rawBytes, Charset.forName("cp1252"));
String ascp1251 = new String(rawBytes, Charset.forName("cp1251"));

TextQualityComparison result = detector.compare("cp1252", ascp1252, "cp1251", ascp1251);

System.out.println(result.winner());  // "A" or "B"
System.out.println(result.delta());   // z-score separation between the two

if (result.winner().equals("B") && result.delta() > 1.0) {
    // cp1251 is confidently the better decoding
}

The delta() is the absolute difference in z-scores between the two candidates. As a rough guide:

Delta Confidence

< 0.5

Very uncertain — both decodings look similar to the model. Fall back to other heuristics.

0.5 – 1.0

Weak signal — winner is likely correct but not assured.

1.0 – 3.0

Useful signal. Trust the winner for most production purposes.

> 3.0

High confidence. One decoding is clearly more language-like.

Listing known scripts

detector.knownScripts();  // returns Set<String>
// e.g. [ARABIC, ARMENIAN, BENGALI, CYRILLIC, DEVANAGARI, GEORGIAN,
//        GREEK, GUJARATI, GURMUKHI, HAN, HANGUL, HEBREW, HIRAGANA,
//        KANNADA, KHMER, LAO, LATIN, MALAYALAM, MYANMAR, ORIYA,
//        SINHALA, TAMIL, TELUGU, THAANA, THAI, TIBETAN, ...]

If the dominant script of an input is not in this set, score() returns a TextQualityScore where isUnknown() is true and no z-score is available.

Thresholds and operating points

There is no universally correct threshold. The right cutoff depends on your content and tolerance for false positives (flagging good text as junk).

Starting points:

  • Trigger re-OCR: z < −2.0 (catches ~95% of severe corruption while flagging ~2–5% of legitimate text on average, more for short strings).

  • Charset tiebreaking: prefer the candidate with the higher z-score when delta() > 1.0; abstain if delta() < 0.5.

  • Training data filtering: z < −1.5 to remove mojibake and bot-generated noise from NLP corpora.

For short text (under ~50 UTF-8 bytes), use getCiLow() rather than getZScore() for threshold decisions, since the confidence interval widens substantially.

Limitations

  • Script coverage: only scripts with a trained model can be scored. Unknown scripts return isUnknown() = true.

  • Short text: scoring is unreliable below ~15 UTF-8 bytes. The model needs at least a few bigrams to produce a stable estimate.

  • Closely related charsets in the same script pool: the LATIN model is trained across hundreds of languages, which dilutes the signal for closely related Western European and Baltic encodings (e.g., cp1252 vs. cp1257 on Lithuanian text). The winner is usually correct, but delta may be small (< 0.5).

  • Deliberately obfuscated text: content designed to look like natural language (e.g. by adversarial padding) is not detected.

Further reading

For training methodology, model format, evaluation harness, and guidance on improving the model, see Building the Junk Detector.