Class CharSoupModel

java.lang.Object
org.apache.tika.langdetect.charsoup.CharSoupModel

public class CharSoupModel extends Object
INT8-quantized multinomial logistic regression model for language detection.

Binary format (big-endian, magic "LDM1"):

   v1 layout:
   Offset  Field
   0       4B magic: 0x4C444D31
   4       4B version: 1
   8       4B numBuckets (B)
   12      4B numClasses (C)
   16+     Labels: C entries of [2B length + UTF-8 bytes]
           Scales: C × 4B float (per-class dequantization)
           Biases: C × 4B float (per-class bias term)
           Weights: B × C bytes (bucket-major, INT8 signed)

   v2 layout (adds feature flags after numClasses):
   Offset  Field
   0       4B magic: 0x4C444D31
   4       4B version: 2
   8       4B numBuckets (B)
   12      4B numClasses (C)
   16      4B featureFlags (bitmask of FLAG_* constants)
   20+     Labels, Scales, Biases, Weights (same as v1)
 

Weights are stored in bucket-major order: weights[bucket * numClasses + class]. This layout is optimal for the sparse dot-product in predict(int[]) — each non-zero bucket reads a contiguous run of numClasses bytes, ideal for SIMD and cache prefetching.

Feature extraction always uses ScriptAwareFeatureExtractor, which produces character bigrams (with sentinels for non-CJK), whole-word unigrams, CJK character unigrams, and CJK space bridging.

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Feature flag: enable character 4-grams.
    static final int
    Feature flag: enable character 5-grams.
    static final int
    Feature flag: enable non-CJK character unigrams.
    static final int
    Feature flag: L2-normalize the feature vector before prediction.
    static final int
    Feature flag: enable 3-char word prefixes.
    static final int
    Feature flag: enable script-block presence + transition features.
    static final int
    Feature flag: enable skip bigrams.
    static final int
    Feature flag: enable 4-char word suffixes.
    static final int
    Feature flag: enable 3-char word suffixes.
    static final int
    Feature flag: enable character trigrams.
    static final int
    Feature flag: short-word-anchored word bigrams (hash pairs where anchor is 1–3 chars).
    static final int
    Feature flag: non-CJK word length features (exact length, capped).
    static final int
    Feature flag: enable whole-word unigrams.
    static final int
    Default flags for v1 models (word unigrams only).
  • Constructor Summary

    Constructors
    Constructor
    Description
    CharSoupModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights)
    Construct from class-major byte[][] weights with default feature configuration (word unigrams only — backward compatible with v1).
    CharSoupModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights, int featureFlags)
    Construct from class-major byte[][] weights with explicit feature flags.
  • Method Summary

    Modifier and Type
    Method
    Description
    Create the production FeatureExtractor for this model by dispatching on the featureFlags embedded in the binary.
    static float
    entropy(float[] probs)
    Shannon entropy (in bits) of a probability distribution.
    float[]
     
    int
     
    getLabel(int classIndex)
     
     
    int
     
    int
     
    float[]
     
    byte[][]
    Return weights in class-major [class][bucket] layout.
    Load a model from an input stream.
    loadFromClasspath(String resourcePath)
    Load a model from the classpath.
    float[]
    predict(int[] features)
    Compute softmax probabilities for the given feature vector.
    float[]
    predictLogits(int[] features)
    Compute raw logits (pre-softmax scores) for the given feature vector.
    void
    Write the model in LDM2 binary format (includes feature flags).
    static float[]
    softmax(float[] logits)
    In-place softmax with numerical stability.
    withFeatureFlags(int newFlags)
    Returns a new model with the same weights but a different feature-flags bitmask.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • FLAG_TRIGRAMS

      public static final int FLAG_TRIGRAMS
      Feature flag: enable character trigrams.
      See Also:
    • FLAG_SKIP_BIGRAMS

      public static final int FLAG_SKIP_BIGRAMS
      Feature flag: enable skip bigrams.
      See Also:
    • FLAG_SUFFIXES

      public static final int FLAG_SUFFIXES
      Feature flag: enable 3-char word suffixes.
      See Also:
    • FLAG_SUFFIX4

      public static final int FLAG_SUFFIX4
      Feature flag: enable 4-char word suffixes.
      See Also:
    • FLAG_PREFIX

      public static final int FLAG_PREFIX
      Feature flag: enable 3-char word prefixes.
      See Also:
    • FLAG_WORD_UNIGRAMS

      public static final int FLAG_WORD_UNIGRAMS
      Feature flag: enable whole-word unigrams.
      See Also:
    • FLAG_CHAR_UNIGRAMS

      public static final int FLAG_CHAR_UNIGRAMS
      Feature flag: enable non-CJK character unigrams.
      See Also:
    • FLAG_4GRAMS

      public static final int FLAG_4GRAMS
      Feature flag: enable character 4-grams.
      See Also:
    • FLAG_5GRAMS

      public static final int FLAG_5GRAMS
      Feature flag: enable character 5-grams.
      See Also:
    • FLAG_SCRIPT_BLOCKS

      public static final int FLAG_SCRIPT_BLOCKS
      Feature flag: enable script-block presence + transition features.
      See Also:
    • FLAG_L2_NORM

      public static final int FLAG_L2_NORM
      Feature flag: L2-normalize the feature vector before prediction.
      See Also:
    • FLAG_WORD_BIGRAMS

      public static final int FLAG_WORD_BIGRAMS
      Feature flag: short-word-anchored word bigrams (hash pairs where anchor is 1–3 chars).
      See Also:
    • FLAG_WORD_LENGTH

      public static final int FLAG_WORD_LENGTH
      Feature flag: non-CJK word length features (exact length, capped).
      See Also:
    • V1_DEFAULT_FLAGS

      public static final int V1_DEFAULT_FLAGS
      Default flags for v1 models (word unigrams only).
      See Also:
  • Constructor Details

    • CharSoupModel

      public CharSoupModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights)
      Construct from class-major byte[][] weights with default feature configuration (word unigrams only — backward compatible with v1).
    • CharSoupModel

      public CharSoupModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights, int featureFlags)
      Construct from class-major byte[][] weights with explicit feature flags.
      Parameters:
      featureFlags - bitmask of FLAG_* constants
  • Method Details

    • loadFromClasspath

      public static CharSoupModel loadFromClasspath(String resourcePath) throws IOException
      Load a model from the classpath.
      Throws:
      IOException
    • load

      public static CharSoupModel load(InputStream is) throws IOException
      Load a model from an input stream. Supports both v1 (LDM1) and v2 (LDM2) formats.
      Throws:
      IOException
    • save

      public void save(OutputStream os) throws IOException
      Write the model in LDM2 binary format (includes feature flags).
      Throws:
      IOException
    • predict

      public float[] predict(int[] features)
      Compute softmax probabilities for the given feature vector. Uses a sparse inner loop — only non-zero buckets are visited.
      Parameters:
      features - int array of size numBuckets
      Returns:
      float array of size numClasses (softmax probabilities, sum ≈ 1.0)
    • predictLogits

      public float[] predictLogits(int[] features)
      Compute raw logits (pre-softmax scores) for the given feature vector. Higher logits indicate stronger match. Unlike predict(int[]), this preserves the full dynamic range of the model's output, which is useful when comparing confidence across different input texts.
      Parameters:
      features - int array of size numBuckets
      Returns:
      float array of size numClasses (raw logits, not normalized)
    • softmax

      public static float[] softmax(float[] logits)
      In-place softmax with numerical stability.
    • entropy

      public static float entropy(float[] probs)
      Shannon entropy (in bits) of a probability distribution.
    • getNumBuckets

      public int getNumBuckets()
    • getNumClasses

      public int getNumClasses()
    • getLabels

      public String[] getLabels()
    • getLabel

      public String getLabel(int classIndex)
    • getScales

      public float[] getScales()
    • getBiases

      public float[] getBiases()
    • getWeights

      public byte[][] getWeights()
      Return weights in class-major [class][bucket] layout. Creates a new array each call.
    • createExtractor

      public FeatureExtractor createExtractor()
      Create the production FeatureExtractor for this model by dispatching on the featureFlags embedded in the binary.

      Supported flag sets:

      Throws:
      IllegalStateException - if the flags do not match any known production extractor. Experimental configs should use ResearchFeatureExtractor in the test module.
    • getFeatureFlags

      public int getFeatureFlags()
    • withFeatureFlags

      public CharSoupModel withFeatureFlags(int newFlags)
      Returns a new model with the same weights but a different feature-flags bitmask. Useful for correcting flags on models saved before this field was properly set.
      Parameters:
      newFlags - bitmask of FLAG_* constants
      Returns:
      copy of this model with updated feature flags