org.apache.tika.ml.junkdetect.tools.BuildJunkTrainingData

public class BuildJunkTrainingData extends Object

Builds per-script positive training data for the junk detector from MADLAD-400 and Wikipedia sentence files.

Script groups are derived entirely from the data: for each language directory the dominant Unicode script is detected by histogramming Character.UnicodeScript over a sample of sentences (COMMON, INHERITED, and UNKNOWN pseudo-scripts excluded). Languages that share the same dominant script are pooled. No script groups are hardcoded.

The total byte budget is distributed across script groups proportionally to each group's empirical byte-bigram entropy, measured from a small sample. Scripts with high entropy (e.g. CJK, which has thousands of distinct 3-byte codepoints) receive a proportionally larger allocation than low-entropy scripts (e.g. Arabic, whose UTF-8 high bytes cluster in a narrow 0xD8-0xDB range). This ensures every script's bigram table is estimated with comparable statistical quality regardless of character-set size.

Within each script group the byte budget is distributed evenly across its member languages, ensuring diversity (no single language dominates).

Input format (sentences_madlad.txt and sentences_wikipedia.txt): lineNum TAB text, UTF-8. MADLAD records contain literal \n escape sequences as sub-sentence separators (full scraped documents); Wikipedia records are individual sentences. Both are split/cleaned to sentence-level strings.

Output:

   output-dir/
     {script}.train.gz   — 80% split, one NFC-normalised sentence per line
     {script}.dev.gz     — 10% split, used for calibration (mu/sigma)
     {script}.test.gz    — 10% split, held out for final evaluation only
     manifest.tsv        — per-script stats: entropy, budget, bytes written, languages

Usage:

   java BuildJunkTrainingData \
     --data-dir   ~/datasets/madlad/data \
     --output-dir ~/datasets/madlad/junkdetect \
     [--total-budget-bytes 50000000]

Constructor Summary

Constructors

Constructor

Description

BuildJunkTrainingData()
Method Summary

Modifier and Type

Method

Description

static void

main(String[] args)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- BuildJunkTrainingData
  
  public BuildJunkTrainingData()
Method Details
- main
  
  public static void main(String[] args) throws IOException
  
  Throws:
  
  IOException

Class BuildJunkTrainingData

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

BuildJunkTrainingData

Method Details

main