org.apache.tika.ml.junkdetect.JunkDetector

All Implemented Interfaces:: TextQualityDetector

public final class JunkDetector extends Object implements TextQualityDetector

Language-agnostic text quality scorer. Discriminates clean UTF-8 text from mojibake, reversed text, wrong-codec decodings, and other corruption forms.

Scoring combines up to three features, depending on the model version:

Byte-bigram log-probability — 256×256 table of log P(b|a) over consecutive byte pairs in the UTF-8 encoding.
Unicode named-block transition log-probability (version 2+) — N×N table of log P(block_b | block_a) where block IDs are the named Character.UnicodeBlock values (BASIC_LATIN, ARABIC, CJK_UNIFIED_IDEOGRAPHS, etc.).
Control-byte fraction (version 2+) — fraction of bytes in control ranges [0x01–0x08, 0x0B, 0x0C, 0x0E–0x1F, 0x7F].

All features are calibrated (mu/sigma) on held-out dev text so their z-scores are on a common scale.

Features are combined by a per-script logistic regression classifier: w1*z1 + w2*z2 + w3*z3 + w4*z4 + bias, where weights are fit on clean vs. corrupted dev windows. The natural junk threshold is 0 (positive logit = clean); use a negative threshold for conservative detection (e.g., score < -1).

Instances are immutable and thread-safe after construction.

Typical usage:


 JunkDetector detector = JunkDetector.loadFromClasspath();
 TextQualityScore score = detector.score("some text");
 if (score.getZScore() < 0) { ... flag as junk ... }

 // Arbitrate between two charset decodings
 TextQualityComparison result = detector.compare("cp1252", ascp1252, "cp1251", ascp1251);
 String winner = result.winner();  // "A" or "B"

Field Summary

Fields

Modifier and Type

Field

Description

static final String

DEFAULT_MODEL_RESOURCE

Classpath resource path for the bundled production model.
Method Summary

Modifier and Type

Method

Description

TextQualityComparison

compare(String labelA, String candidateA, String labelB, String candidateB)

Compares two candidate strings and returns which is higher-quality (cleaner text).

int

getModelVersion()

Returns the version of the loaded model (1, 2, or 3).

Set<String>

knownScripts()

Returns the set of script names this model knows about.

static JunkDetector

load(InputStream rawIs)

Loads a model from an InputStream.

static JunkDetector

loadFromClasspath()

Loads the bundled model from the classpath.

static JunkDetector

loadFromPath(Path path)

Loads a model from the given file path.

static JunkDetector

provider()

ServiceLoader provider hook (Java 9+).

TextQualityScore

score(String text)

Scores the given string for text quality.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_MODEL_RESOURCE
  
  public static final String DEFAULT_MODEL_RESOURCE
  
  Classpath resource path for the bundled production model.
  See Also:
  
  Constant Field Values
Method Details
- loadFromClasspath
  
  public static JunkDetector loadFromClasspath() throws IOException
  
  Loads the bundled model from the classpath.
  
  Throws:
  
  IOException - if the model resource is missing or malformed
- provider
  
  public static JunkDetector provider()
  
  ServiceLoader provider hook (Java 9+). Allows JunkDetector to be registered as a TextQualityDetector SPI implementation even though its construction goes through loadFromClasspath() rather than a public no-arg constructor.
  
  Throws:
  
  UncheckedIOException - if the bundled model cannot be loaded
- loadFromPath
  
  public static JunkDetector loadFromPath(Path path) throws IOException
  
  Loads a model from the given file path. The file may be gzipped or raw.
  
  Throws:
  
  IOException
- load
  
  public static JunkDetector load(InputStream rawIs) throws IOException
  
  Loads a model from an InputStream. Gzip-detection is automatic. Supports model versions 1 through 5.
  
  Throws:
  
  IOException
- score
  
  public TextQualityScore score(String text)
  
  Scores the given string for text quality.
  The text is split into contiguous runs of the same Unicode script. Each run is scored against its own script model. Logits are combined as a byte-count-weighted average, so mixed-script text (e.g. half LATIN, half HAN) is scored fairly without arbitrarily picking one script. COMMON, INHERITED, and UNKNOWN codepoints (spaces, punctuation, digits) are attached to the preceding script run.
  
  Specified by:
  
  score in interface TextQualityDetector
  
  Parameters:
  
  text - the string to score; must not be null
  
  Returns:
  
  a TextQualityScore; check TextQualityScore.isUnknown() if the input is empty or the script is not covered by the model
- compare
  
  public TextQualityComparison compare(String labelA, String candidateA, String labelB, String candidateB)
  
  Compares two candidate strings and returns which is higher-quality (cleaner text).
  A common use case is charset-decoding arbitration: given raw bytes decoded via two different charsets, pass each decoded string here with a human-readable label (e.g. the charset name) and the detector will pick the one that looks more like natural language.
  Each candidate is scored independently via score(String). The candidate with the higher score wins.
  An UNKNOWN score (script not in model) is treated as neutral (0) rather than -∞. This prevents a garbled-but-recognisable decoding from beating a correct decoding whose script happens to be unknown to the model — for example, a pure-katakana zip entry name decoded as Shift-JIS (UNKNOWN) vs. the same bytes decoded as UTF-8 (garbled LATIN, negative z-score).
  
  Specified by:
  
  compare in interface TextQualityDetector
  
  Parameters:
  
  labelA - human-readable label for candidate A (e.g. "cp1252")
  
  candidateA - first candidate string
  
  labelB - human-readable label for candidate B (e.g. "cp1251")
  
  candidateB - second candidate string
  
  Returns:
  
  a TextQualityComparison with the winning label and confidence delta
- knownScripts
  
  public Set<String> knownScripts()
  
  Returns the set of script names this model knows about.
- getModelVersion
  
  public int getModelVersion()
  
  Returns the version of the loaded model (1, 2, or 3).

Class JunkDetector

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

DEFAULT_MODEL_RESOURCE

Method Details

loadFromClasspath

provider

loadFromPath

load

score

compare

knownScripts

getModelVersion