Class BuildJunkTrainingData

java.lang.Object
org.apache.tika.ml.junkdetect.tools.BuildJunkTrainingData

public class BuildJunkTrainingData extends Object
Builds per-script positive training data for the junk detector from MADLAD-400 and Wikipedia sentence files.

Script groups are derived entirely from the data: for each language directory the dominant Unicode script is detected by histogramming Character.UnicodeScript over a sample of sentences (COMMON, INHERITED, and UNKNOWN pseudo-scripts excluded). Languages that share the same dominant script are pooled. No script groups are hardcoded.

The total byte budget is distributed across script groups proportionally to each group's empirical byte-bigram entropy, measured from a small sample. Scripts with high entropy (e.g. CJK, which has thousands of distinct 3-byte codepoints) receive a proportionally larger allocation than low-entropy scripts (e.g. Arabic, whose UTF-8 high bytes cluster in a narrow 0xD8-0xDB range). This ensures every script's bigram table is estimated with comparable statistical quality regardless of character-set size.

Within each script group the byte budget is distributed evenly across its member languages, ensuring diversity (no single language dominates).

Input format (sentences_madlad.txt and sentences_wikipedia.txt): lineNum TAB text, UTF-8. MADLAD records contain literal \n escape sequences as sub-sentence separators (full scraped documents); Wikipedia records are individual sentences. Both are split/cleaned to sentence-level strings.

Output:

   output-dir/
     {script}.train.gz   — 80% split, one NFC-normalised sentence per line
     {script}.dev.gz     — 10% split, used for calibration (mu/sigma)
     {script}.test.gz    — 10% split, held out for final evaluation only
     manifest.tsv        — per-script stats: entropy, budget, bytes written, languages
 

Usage:

   java BuildJunkTrainingData \
     --data-dir   ~/datasets/madlad/data \
     --output-dir ~/datasets/madlad/junkdetect \
     [--total-budget-bytes 50000000]
 
  • Constructor Details

    • BuildJunkTrainingData

      public BuildJunkTrainingData()
  • Method Details