Class TrainJunkModel
BuildJunkTrainingData.
For each script group (identified by a {script}.train.gz file),
four features are trained and then combined by a per-script logistic
regression classifier:
- Byte-bigram log-probability: 256×256 table of log P(b|a) over consecutive byte pairs in the UTF-8 encoding.
- Unicode named-block transition log-probability: N×N table of
log P(block_b | block_a), where block ID is determined by
Character.UnicodeBlock.of(int)— one of the ~327 named Unicode blocks plus one extra bucket for unassigned codepoints. - Control-byte fraction: fraction of bytes in control-character
ranges ([0x01–0x08, 0x0B, 0x0C, 0x0E–0x1F, 0x7F]). Stored as
−fractionso the z-score convention matches the other features (higher = cleaner). - Script-transition log-probability: global table of log P(script_b | script_a)
over raw
Character.UnicodeScriptvalues (excluding COMMON, INHERITED, UNKNOWN), pooled across all training scripts (z4).
All four features are calibrated (mu/sigma) on the dev split so their z-scores are on a common scale. A per-script binary logistic regression classifier is then fit on (z1, z2, z3, z4) using clean dev windows and corrupted versions (inject@5%, char-shuffle) as training examples. The learned weights replace the fixed equal-weight average, allowing the model to automatically downweight noisy features (e.g. high-variance block transitions for MYANMAR) and upweight informative ones (e.g. control-byte fraction for inject@0.01).
At inference, the final score is the linear combination
w1*z1 + w2*z2 + w3*z3 + w4*z4 + bias; positive values indicate clean text.
The natural threshold is 0 (probability 0.5); use a negative threshold for
more conservative junk detection.
Output format: JUNKDET1 gzipped binary, version 5.
Version 1–4 files can still be loaded by JunkDetector on the JVM they were trained on.
[8 bytes] magic "JUNKDET1" (ASCII)
[1 byte] version = 4
[4 bytes] num_scripts (big-endian int)
[2 bytes] block_N — number of distinct named Unicode blocks + 1 (unassigned)
// Block names section (version 5+): block_N-1 entries for JVM-independence
for i in [0, block_N-1):
[2 bytes] name length (big-endian ushort)
[name bytes] Unicode block name (Character.UnicodeBlock.toString())
// Global script-transition section (version 4+)
[1 byte] num_script_buckets
for each bucket:
[2 bytes] name length (big-endian ushort)
[name bytes] bucket name (UnicodeScript.name() or "OTHER")
[num_script_buckets² × 4 bytes] script-transition log-prob table
[4 bytes] mu4 (float32 big-endian)
[4 bytes] sigma4 (float32 big-endian)
// Per-script data (same as v3 but num_features = 4)
for each script (sorted by name):
[2 bytes] name length (big-endian ushort)
[name bytes] script name (UTF-8)
// Feature 1 — byte bigrams
[4 bytes] mu1 (float32 big-endian)
[4 bytes] sigma1 (float32 big-endian)
[65536×4 bytes] byte-bigram log-prob table (256×256)
// Feature 2 — block transitions
[4 bytes] mu2 (float32 big-endian)
[4 bytes] sigma2 (float32 big-endian)
[block_N²×4 bytes] block-transition log-prob table
// Feature 3 — control-byte fraction
[4 bytes] mu3 (float32 big-endian)
[4 bytes] sigma3 (float32 big-endian)
// Linear classifier weights
[1 byte] num_features (= 4 for v4)
[4 bytes] w1 (float32 big-endian)
[4 bytes] w2 (float32 big-endian)
[4 bytes] w3 (float32 big-endian)
[4 bytes] w4 (float32 big-endian)
[4 bytes] bias (float32 big-endian)
-
Constructor Summary
Constructors -
Method Summary
-
Constructor Details
-
TrainJunkModel
public TrainJunkModel()
-
-
Method Details
-
main
- Throws:
IOException
-