Class BuildCharsetTrainingData

java.lang.Object
org.apache.tika.ml.chardetect.tools.BuildCharsetTrainingData

public class BuildCharsetTrainingData extends Object
Generates charset-detection training, devtest, and test data from MADLAD-400 and Cantonese Wikipedia sentence files. Produces gzipped files of [uint16-BE length][raw bytes] records, one file per charset per split.

This is the single authoritative data-generation tool; it replaces the Python build_charset_training.py script entirely. Java is used because it supports charsets unavailable in CPython's standard codec library — IBM1047 (EBCDIC Open Systems Latin-1), x-EUC-TW (Traditional Chinese Unix), IBM420 (EBCDIC Arabic), and IBM424 (EBCDIC Hebrew) — and because eliminating the Python/ebcdic/fastText dependency chain simplifies the build.

Charset design decisions match the former Python generator:

  • Windows superset policy: windows-12XX trained instead of the ISO-8859-X equivalent wherever a superset exists. ISO-8859-3 is retained (Maltese — no Windows equivalent).
  • Superset-only: Big5-HKSCS (not plain Big5), GB18030 (not GBK/GB2312), Shift_JIS via Java's CP932 superset.
  • Structural-only charsets (US-ASCII, ISO-2022-*): devtest/test files are generated for evaluation, but train is skipped because these charsets produce zero high bytes and provide no ML features.
  • Unicode charsets (UTF-8/16/32): applied to every language so the model sees diverse scripts in wide encodings.
  • IBM1047: EBCDIC Open Systems Latin-1, used on z/OS Unix System Services. Trained on the same Western European languages as IBM500. Distinguished from IBM500 primarily by the byte position of '!' (0x5A in IBM1047 vs 0x4F in IBM500) and the line-terminator byte.

Usage:

   java BuildCharsetTrainingData \
     --madlad-dir  ~/datasets/madlad/data \
     --output-dir  ~/datasets/madlad/charset-detect4
 
  • Constructor Details

    • BuildCharsetTrainingData

      public BuildCharsetTrainingData()
  • Method Details