Class TrainNaiveBayesBigram

java.lang.Object
org.apache.tika.ml.chardetect.tools.TrainNaiveBayesBigram

public class TrainNaiveBayesBigram extends Object
Naive-Bayes byte-bigram charset classifier trainer.

For each charset, counts all stride-1 byte bigrams across training samples, keeps the top-K most frequent (default 2000), and applies Laplace add-α smoothing for out-of-vocabulary bigrams. Output is a binary model file consumed by NaiveBayesBigramEncodingDetector.

Standard training-data layout (per BuildCharsetTrainingData): one <charset>.bin.gz per class, each containing variable-length [uint16 len][bytes] samples.

Usage:

   java TrainNaiveBayesBigram \
     --data /path/to/chardet-training \
     --output nb-bigram.bin \
     [--top-bigrams 2000] \
     [--alpha 1.0] \
     [--max-samples-per-class 50000] \
     [--classes cs1,cs2,...]    # optional class filter
 

Default class set (35 — v6 shipped model): listed in V6_SHIPPED_CLASSES. Override with --classes.

  • Field Details

    • MAGIC

      public static final int MAGIC
      Binary magic for the saved model — "NBB3".

      v3 = v2 with int8 quantization applied at save-time. Per-class logP values quantize via scale[c] = maxAbs(class c logP column) / 127; the global IDF table quantizes via idfScale = maxAbs(idf) / 127. Saved file ~2-3× smaller than v2 at same coverage; in-memory footprint 4× smaller because the detector no longer needs to materialize a float array at load.

      See Also:
    • VERSION

      public static final int VERSION
      See Also:
  • Constructor Details

    • TrainNaiveBayesBigram

      public TrainNaiveBayesBigram()
  • Method Details