Class JunkDetector

java.lang.Object
org.apache.tika.ml.junkdetect.JunkDetector
All Implemented Interfaces:
TextQualityDetector

public final class JunkDetector extends Object implements TextQualityDetector
Language-agnostic text quality scorer. Discriminates clean UTF-8 text from mojibake, reversed text, wrong-codec decodings, and other corruption forms.

Scoring combines up to three features, depending on the model version:

  1. Byte-bigram log-probability — 256×256 table of log P(b|a) over consecutive byte pairs in the UTF-8 encoding.
  2. Unicode named-block transition log-probability (version 2+) — N×N table of log P(block_b | block_a) where block IDs are the named Character.UnicodeBlock values (BASIC_LATIN, ARABIC, CJK_UNIFIED_IDEOGRAPHS, etc.).
  3. Control-byte fraction (version 2+) — fraction of bytes in control ranges [0x01–0x08, 0x0B, 0x0C, 0x0E–0x1F, 0x7F].

All features are calibrated (mu/sigma) on held-out dev text so their z-scores are on a common scale.

Features are combined by a per-script logistic regression classifier: w1*z1 + w2*z2 + w3*z3 + w4*z4 + bias, where weights are fit on clean vs. corrupted dev windows. The natural junk threshold is 0 (positive logit = clean); use a negative threshold for conservative detection (e.g., score < -1).

Instances are immutable and thread-safe after construction.

Typical usage:


 JunkDetector detector = JunkDetector.loadFromClasspath();
 TextQualityScore score = detector.score("some text");
 if (score.getZScore() < 0) { ... flag as junk ... }

 // Arbitrate between two charset decodings
 TextQualityComparison result = detector.compare("cp1252", ascp1252, "cp1251", ascp1251);
 String winner = result.winner();  // "A" or "B"
 
  • Field Details

    • DEFAULT_MODEL_RESOURCE

      public static final String DEFAULT_MODEL_RESOURCE
      Classpath resource path for the bundled production model.
      See Also:
  • Method Details

    • loadFromClasspath

      public static JunkDetector loadFromClasspath() throws IOException
      Loads the bundled model from the classpath.
      Throws:
      IOException - if the model resource is missing or malformed
    • provider

      public static JunkDetector provider()
      ServiceLoader provider hook (Java 9+). Allows JunkDetector to be registered as a TextQualityDetector SPI implementation even though its construction goes through loadFromClasspath() rather than a public no-arg constructor.
      Throws:
      UncheckedIOException - if the bundled model cannot be loaded
    • loadFromPath

      public static JunkDetector loadFromPath(Path path) throws IOException
      Loads a model from the given file path. The file may be gzipped or raw.
      Throws:
      IOException
    • load

      public static JunkDetector load(InputStream rawIs) throws IOException
      Loads a model from an InputStream. Gzip-detection is automatic. Supports model versions 1 through 5.
      Throws:
      IOException
    • score

      public TextQualityScore score(String text)
      Scores the given string for text quality.

      The text is split into contiguous runs of the same Unicode script. Each run is scored against its own script model. Logits are combined as a byte-count-weighted average, so mixed-script text (e.g. half LATIN, half HAN) is scored fairly without arbitrarily picking one script. COMMON, INHERITED, and UNKNOWN codepoints (spaces, punctuation, digits) are attached to the preceding script run.

      Specified by:
      score in interface TextQualityDetector
      Parameters:
      text - the string to score; must not be null
      Returns:
      a TextQualityScore; check TextQualityScore.isUnknown() if the input is empty or the script is not covered by the model
    • compare

      public TextQualityComparison compare(String labelA, String candidateA, String labelB, String candidateB)
      Compares two candidate strings and returns which is higher-quality (cleaner text).

      A common use case is charset-decoding arbitration: given raw bytes decoded via two different charsets, pass each decoded string here with a human-readable label (e.g. the charset name) and the detector will pick the one that looks more like natural language.

      Each candidate is scored independently via score(String). The candidate with the higher score wins.

      An UNKNOWN score (script not in model) is treated as neutral (0) rather than -∞. This prevents a garbled-but-recognisable decoding from beating a correct decoding whose script happens to be unknown to the model — for example, a pure-katakana zip entry name decoded as Shift-JIS (UNKNOWN) vs. the same bytes decoded as UTF-8 (garbled LATIN, negative z-score).

      Specified by:
      compare in interface TextQualityDetector
      Parameters:
      labelA - human-readable label for candidate A (e.g. "cp1252")
      candidateA - first candidate string
      labelB - human-readable label for candidate B (e.g. "cp1251")
      candidateB - second candidate string
      Returns:
      a TextQualityComparison with the winning label and confidence delta
    • knownScripts

      public Set<String> knownScripts()
      Returns the set of script names this model knows about.
    • getModelVersion

      public int getModelVersion()
      Returns the version of the loaded model (1, 2, or 3).