Class JunkDetector
- All Implemented Interfaces:
TextQualityDetector
Scoring combines up to three features, depending on the model version:
- Byte-bigram log-probability — 256×256 table of log P(b|a) over consecutive byte pairs in the UTF-8 encoding.
- Unicode named-block transition log-probability (version 2+) —
N×N table of log P(block_b | block_a) where block IDs are the named
Character.UnicodeBlockvalues (BASIC_LATIN, ARABIC, CJK_UNIFIED_IDEOGRAPHS, etc.). - Control-byte fraction (version 2+) — fraction of bytes in control ranges [0x01–0x08, 0x0B, 0x0C, 0x0E–0x1F, 0x7F].
All features are calibrated (mu/sigma) on held-out dev text so their z-scores are on a common scale.
Features are combined by a per-script logistic regression classifier:
w1*z1 + w2*z2 + w3*z3 + w4*z4 + bias, where weights are fit on
clean vs. corrupted dev windows. The natural junk threshold is 0 (positive
logit = clean); use a negative threshold for conservative detection
(e.g., score < -1).
Instances are immutable and thread-safe after construction.
Typical usage:
JunkDetector detector = JunkDetector.loadFromClasspath();
TextQualityScore score = detector.score("some text");
if (score.getZScore() < 0) { ... flag as junk ... }
// Arbitrate between two charset decodings
TextQualityComparison result = detector.compare("cp1252", ascp1252, "cp1251", ascp1251);
String winner = result.winner(); // "A" or "B"
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringClasspath resource path for the bundled production model. -
Method Summary
Modifier and TypeMethodDescriptionCompares two candidate strings and returns which is higher-quality (cleaner text).intReturns the version of the loaded model (1, 2, or 3).Returns the set of script names this model knows about.static JunkDetectorload(InputStream rawIs) Loads a model from anInputStream.static JunkDetectorLoads the bundled model from the classpath.static JunkDetectorloadFromPath(Path path) Loads a model from the given file path.static JunkDetectorprovider()ServiceLoaderprovider hook (Java 9+).Scores the given string for text quality.
-
Field Details
-
DEFAULT_MODEL_RESOURCE
Classpath resource path for the bundled production model.- See Also:
-
-
Method Details
-
loadFromClasspath
Loads the bundled model from the classpath.- Throws:
IOException- if the model resource is missing or malformed
-
provider
ServiceLoaderprovider hook (Java 9+). AllowsJunkDetectorto be registered as aTextQualityDetectorSPI implementation even though its construction goes throughloadFromClasspath()rather than a public no-arg constructor.- Throws:
UncheckedIOException- if the bundled model cannot be loaded
-
loadFromPath
Loads a model from the given file path. The file may be gzipped or raw.- Throws:
IOException
-
load
Loads a model from anInputStream. Gzip-detection is automatic. Supports model versions 1 through 5.- Throws:
IOException
-
score
Scores the given string for text quality.The text is split into contiguous runs of the same Unicode script. Each run is scored against its own script model. Logits are combined as a byte-count-weighted average, so mixed-script text (e.g. half LATIN, half HAN) is scored fairly without arbitrarily picking one script. COMMON, INHERITED, and UNKNOWN codepoints (spaces, punctuation, digits) are attached to the preceding script run.
- Specified by:
scorein interfaceTextQualityDetector- Parameters:
text- the string to score; must not be null- Returns:
- a
TextQualityScore; checkTextQualityScore.isUnknown()if the input is empty or the script is not covered by the model
-
compare
public TextQualityComparison compare(String labelA, String candidateA, String labelB, String candidateB) Compares two candidate strings and returns which is higher-quality (cleaner text).A common use case is charset-decoding arbitration: given raw bytes decoded via two different charsets, pass each decoded string here with a human-readable label (e.g. the charset name) and the detector will pick the one that looks more like natural language.
Each candidate is scored independently via
score(String). The candidate with the higher score wins.An UNKNOWN score (script not in model) is treated as neutral (0) rather than
-∞. This prevents a garbled-but-recognisable decoding from beating a correct decoding whose script happens to be unknown to the model — for example, a pure-katakana zip entry name decoded as Shift-JIS (UNKNOWN) vs. the same bytes decoded as UTF-8 (garbled LATIN, negative z-score).- Specified by:
comparein interfaceTextQualityDetector- Parameters:
labelA- human-readable label for candidate A (e.g."cp1252")candidateA- first candidate stringlabelB- human-readable label for candidate B (e.g."cp1251")candidateB- second candidate string- Returns:
- a
TextQualityComparisonwith the winning label and confidence delta
-
knownScripts
Returns the set of script names this model knows about. -
getModelVersion
public int getModelVersion()Returns the version of the loaded model (1, 2, or 3).
-