Class TrainNaiveBayesBigram
For each charset, counts all stride-1 byte bigrams across training
samples, keeps the top-K most frequent (default 2000), and applies
Laplace add-α smoothing for out-of-vocabulary bigrams. Output is a
binary model file consumed by NaiveBayesBigramEncodingDetector.
Standard training-data layout (per
BuildCharsetTrainingData): one <charset>.bin.gz per
class, each containing variable-length [uint16 len][bytes] samples.
Usage:
java TrainNaiveBayesBigram \
--data /path/to/chardet-training \
--output nb-bigram.bin \
[--top-bigrams 2000] \
[--alpha 1.0] \
[--max-samples-per-class 50000] \
[--classes cs1,cs2,...] # optional class filter
Default class set (35 — v6 shipped model): listed
in V6_SHIPPED_CLASSES. Override with --classes.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
-
Field Details
-
MAGIC
public static final int MAGICBinary magic for the saved model — "NBB3".v3 = v2 with int8 quantization applied at save-time. Per-class
logPvalues quantize viascale[c] = maxAbs(class c logP column) / 127; the global IDF table quantizes viaidfScale = maxAbs(idf) / 127. Saved file ~2-3× smaller than v2 at same coverage; in-memory footprint 4× smaller because the detector no longer needs to materialize a float array at load.- See Also:
-
VERSION
public static final int VERSION- See Also:
-
-
Constructor Details
-
TrainNaiveBayesBigram
public TrainNaiveBayesBigram()
-
-
Method Details
-
main
- Throws:
IOException
-