Class TrainJunkModel

java.lang.Object
org.apache.tika.ml.junkdetect.tools.TrainJunkModel

public class TrainJunkModel extends Object
Trains the junk detector model from per-script corpus files produced by BuildJunkTrainingData.

For each script group (identified by a {script}.train.gz file), four features are trained and then combined by a per-script logistic regression classifier:

  1. Byte-bigram log-probability: 256×256 table of log P(b|a) over consecutive byte pairs in the UTF-8 encoding.
  2. Unicode named-block transition log-probability: N×N table of log P(block_b | block_a), where block ID is determined by Character.UnicodeBlock.of(int) — one of the ~327 named Unicode blocks plus one extra bucket for unassigned codepoints.
  3. Control-byte fraction: fraction of bytes in control-character ranges ([0x01–0x08, 0x0B, 0x0C, 0x0E–0x1F, 0x7F]). Stored as −fraction so the z-score convention matches the other features (higher = cleaner).
  4. Script-transition log-probability: global table of log P(script_b | script_a) over raw Character.UnicodeScript values (excluding COMMON, INHERITED, UNKNOWN), pooled across all training scripts (z4).

All four features are calibrated (mu/sigma) on the dev split so their z-scores are on a common scale. A per-script binary logistic regression classifier is then fit on (z1, z2, z3, z4) using clean dev windows and corrupted versions (inject@5%, char-shuffle) as training examples. The learned weights replace the fixed equal-weight average, allowing the model to automatically downweight noisy features (e.g. high-variance block transitions for MYANMAR) and upweight informative ones (e.g. control-byte fraction for inject@0.01).

At inference, the final score is the linear combination w1*z1 + w2*z2 + w3*z3 + w4*z4 + bias; positive values indicate clean text. The natural threshold is 0 (probability 0.5); use a negative threshold for more conservative junk detection.

Output format: JUNKDET1 gzipped binary, version 5. Version 1–4 files can still be loaded by JunkDetector on the JVM they were trained on.

   [8 bytes]  magic "JUNKDET1" (ASCII)
   [1 byte]   version = 4
   [4 bytes]  num_scripts (big-endian int)
   [2 bytes]  block_N — number of distinct named Unicode blocks + 1 (unassigned)
   // Block names section (version 5+): block_N-1 entries for JVM-independence
   for i in [0, block_N-1):
     [2 bytes]     name length (big-endian ushort)
     [name bytes]  Unicode block name (Character.UnicodeBlock.toString())
   // Global script-transition section (version 4+)
   [1 byte]   num_script_buckets
   for each bucket:
     [2 bytes]     name length (big-endian ushort)
     [name bytes]  bucket name (UnicodeScript.name() or "OTHER")
   [num_script_buckets² × 4 bytes]  script-transition log-prob table
   [4 bytes]  mu4   (float32 big-endian)
   [4 bytes]  sigma4 (float32 big-endian)
   // Per-script data (same as v3 but num_features = 4)
   for each script (sorted by name):
     [2 bytes]       name length (big-endian ushort)
     [name bytes]    script name (UTF-8)
     // Feature 1 — byte bigrams
     [4 bytes]       mu1   (float32 big-endian)
     [4 bytes]       sigma1 (float32 big-endian)
     [65536×4 bytes] byte-bigram log-prob table (256×256)
     // Feature 2 — block transitions
     [4 bytes]       mu2   (float32 big-endian)
     [4 bytes]       sigma2 (float32 big-endian)
     [block_N²×4 bytes] block-transition log-prob table
     // Feature 3 — control-byte fraction
     [4 bytes]       mu3   (float32 big-endian)
     [4 bytes]       sigma3 (float32 big-endian)
     // Linear classifier weights
     [1 byte]        num_features (= 4 for v4)
     [4 bytes]       w1   (float32 big-endian)
     [4 bytes]       w2   (float32 big-endian)
     [4 bytes]       w3   (float32 big-endian)
     [4 bytes]       w4   (float32 big-endian)
     [4 bytes]       bias (float32 big-endian)
 
  • Constructor Details

    • TrainJunkModel

      public TrainJunkModel()
  • Method Details