Class BuildJunkTrainingData
Script groups are derived entirely from the data: for each language directory
the dominant Unicode script is detected by histogramming Character.UnicodeScript
over a sample of sentences (COMMON, INHERITED, and UNKNOWN pseudo-scripts excluded).
Languages that share the same dominant script are pooled. No script groups are
hardcoded.
The total byte budget is distributed across script groups proportionally to each group's empirical byte-bigram entropy, measured from a small sample. Scripts with high entropy (e.g. CJK, which has thousands of distinct 3-byte codepoints) receive a proportionally larger allocation than low-entropy scripts (e.g. Arabic, whose UTF-8 high bytes cluster in a narrow 0xD8-0xDB range). This ensures every script's bigram table is estimated with comparable statistical quality regardless of character-set size.
Within each script group the byte budget is distributed evenly across its member languages, ensuring diversity (no single language dominates).
Input format (sentences_madlad.txt and sentences_wikipedia.txt):
lineNum TAB text, UTF-8. MADLAD records contain literal \n escape
sequences as sub-sentence separators (full scraped documents); Wikipedia records
are individual sentences. Both are split/cleaned to sentence-level strings.
Output:
output-dir/
{script}.train.gz — 80% split, one NFC-normalised sentence per line
{script}.dev.gz — 10% split, used for calibration (mu/sigma)
{script}.test.gz — 10% split, held out for final evaluation only
manifest.tsv — per-script stats: entropy, budget, bytes written, languages
Usage:
java BuildJunkTrainingData \
--data-dir ~/datasets/madlad/data \
--output-dir ~/datasets/madlad/junkdetect \
[--total-budget-bytes 50000000]
-
Constructor Summary
Constructors -
Method Summary
-
Constructor Details
-
BuildJunkTrainingData
public BuildJunkTrainingData()
-
-
Method Details
-
main
- Throws:
IOException
-