org.apache.tika.ml.chardetect.tools.TrainNaiveBayesBigram

public class TrainNaiveBayesBigram extends Object

Naive-Bayes byte-bigram charset classifier trainer.

For each charset, counts all stride-1 byte bigrams across training samples, keeps the top-K most frequent (default 2000), and applies Laplace add-α smoothing for out-of-vocabulary bigrams. Output is a binary model file consumed by NaiveBayesBigramEncodingDetector.

Standard training-data layout (per BuildCharsetTrainingData): one <charset>.bin.gz per class, each containing variable-length [uint16 len][bytes] samples.

Usage:

   java TrainNaiveBayesBigram \
     --data /path/to/chardet-training \
     --output nb-bigram.bin \
     [--top-bigrams 2000] \
     [--alpha 1.0] \
     [--max-samples-per-class 50000] \
     [--classes cs1,cs2,...]    # optional class filter

Default class set (35 — v6 shipped model): listed in V6_SHIPPED_CLASSES. Override with --classes.

Field Summary

Fields

Modifier and Type

Field

Description

static final int

MAGIC

Binary magic for the saved model — "NBB3".

static final int

VERSION
Constructor Summary

Constructors

Constructor

Description

TrainNaiveBayesBigram()
Method Summary

Modifier and Type

Method

Description

static void

main(String[] args)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- MAGIC
  
  public static final int MAGIC
  
  Binary magic for the saved model — "NBB3".
  v3 = v2 with int8 quantization applied at save-time. Per-class logP values quantize via scale[c] = maxAbs(class c logP column) / 127; the global IDF table quantizes via idfScale = maxAbs(idf) / 127. Saved file ~2-3× smaller than v2 at same coverage; in-memory footprint 4× smaller because the detector no longer needs to materialize a float array at load.
  See Also:
  
  Constant Field Values
- VERSION
  
  public static final int VERSION
  See Also:
  
  Constant Field Values
Constructor Details
- TrainNaiveBayesBigram
  
  public TrainNaiveBayesBigram()
Method Details
- main
  
  public static void main(String[] args) throws IOException
  
  Throws:
  
  IOException

Class TrainNaiveBayesBigram

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

MAGIC

VERSION

Constructor Details

TrainNaiveBayesBigram

Method Details

main