Class BuildCharsetTrainingData
java.lang.Object
org.apache.tika.ml.chardetect.tools.BuildCharsetTrainingData
Generates charset-detection training, devtest, and test data from MADLAD-400
and Cantonese Wikipedia sentence files. Produces gzipped files of
[uint16-BE length][raw bytes] records, one file per charset per split.
This is the single authoritative data-generation tool; it replaces the
Python build_charset_training.py script entirely. Java is used
because it supports charsets unavailable in CPython's standard codec
library — IBM1047 (EBCDIC Open Systems Latin-1), x-EUC-TW (Traditional
Chinese Unix), IBM420 (EBCDIC Arabic), and IBM424 (EBCDIC Hebrew) — and
because eliminating the Python/ebcdic/fastText dependency chain simplifies
the build.
Charset design decisions match the former Python generator:
- Windows superset policy: windows-12XX trained instead of the ISO-8859-X equivalent wherever a superset exists. ISO-8859-3 is retained (Maltese — no Windows equivalent).
- Superset-only: Big5-HKSCS (not plain Big5), GB18030 (not GBK/GB2312), Shift_JIS via Java's CP932 superset.
- Structural-only charsets (US-ASCII, ISO-2022-*): devtest/test files are generated for evaluation, but train is skipped because these charsets produce zero high bytes and provide no ML features.
- Unicode charsets (UTF-8/16/32): applied to every language so the model sees diverse scripts in wide encodings.
- IBM1047: EBCDIC Open Systems Latin-1, used on z/OS Unix System Services. Trained on the same Western European languages as IBM500. Distinguished from IBM500 primarily by the byte position of '!' (0x5A in IBM1047 vs 0x4F in IBM500) and the line-terminator byte.
Usage:
java BuildCharsetTrainingData \
--madlad-dir ~/datasets/madlad/data \
--output-dir ~/datasets/madlad/charset-detect4
-
Constructor Summary
Constructors -
Method Summary
-
Constructor Details
-
BuildCharsetTrainingData
public BuildCharsetTrainingData()
-
-
Method Details
-
main
- Throws:
IOException
-