Class LinearModel

java.lang.Object
org.apache.tika.ml.LinearModel

public class LinearModel extends Object
INT8-quantized multinomial logistic regression model for classification.

Binary format (big-endian, magic "LDM1"):

   Offset  Field
   0       4B magic: 0x4C444D31
   4       4B version: 1 or 2
   8       4B numBuckets (B)
   12      4B numClasses (C)
   16+     Labels: C entries of [2B length + UTF-8 bytes]
           Scales: C × 4B float (per-class dequantization)
           Biases: C × 4B float (per-class bias term)
           (V2 only)
           1B hasCalibration flag
           If hasCalibration: ClassMean: C × 4B float, ClassStd: C × 4B float
           Weights: B × C bytes (bucket-major, INT8 signed)
 

Weights are stored in bucket-major order: weights[bucket * numClasses + class]. This layout is optimal for the sparse dot-product in predict(int[]) — each non-zero bucket reads a contiguous run of numClasses bytes, ideal for SIMD and cache prefetching.

Calibration (V2): optional per-class mean/std of training-set logits. When present, predictCalibratedLogits(int[]) standardizes raw logits so cross-specialist pooling can compare "unusually confident" signals on equal footing. V1 files are still readable; calibration is absent and predictCalibratedLogits(int[]) falls back to raw logits.

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
     
    static final int
    Latest version we emit.
    static final int
     
    static final int
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    LinearModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights)
    Construct without calibration (V1-compatible).
    LinearModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights, float[] classMean, float[] classStd)
    Construct with optional calibration.
  • Method Summary

    Modifier and Type
    Method
    Description
    static float
    entropy(float[] probs)
    Shannon entropy (in bits) of a probability distribution.
    float[]
     
    float[]
     
    float[]
     
    getLabel(int classIndex)
     
     
    int
     
    int
     
    float[]
     
    byte[][]
    Return weights in class-major [class][bucket] layout.
    boolean
    true if this model carries per-class calibration statistics.
    Load a model from an input stream.
    loadFromClasspath(String resourcePath)
    Load a model from the classpath.
    Load a model from a file on disk.
    float[]
    predict(int[] features)
    Compute softmax probabilities for the given feature vector.
    float[]
    predictCalibratedLogits(int[] features)
    Compute calibrated logits: (raw - classMean[c]) / classStd[c] for each class, if the model carries calibration statistics, else raw logits (no-op).
    float[]
    predictLogits(int[] features)
    Compute raw logits for the given feature vector (before softmax).
    float[]
    predictLogitsDense(float[] features)
    Compute logits for a dense float feature vector.
    void
    Write the model in LDM binary format.
    static float[]
    softmax(float[] logits)
    In-place softmax with numerical stability.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

  • Constructor Details

    • LinearModel

      public LinearModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights)
      Construct without calibration (V1-compatible). Transposes class-major weights to bucket-major flat layout internally.
    • LinearModel

      public LinearModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights, float[] classMean, float[] classStd)
      Construct with optional calibration. Pass classMean and classStd (each of length numClasses) to enable z-score calibration in predictCalibratedLogits(int[]); pass null for both to skip. Any classStd[c] == 0 is rewritten to 1.0f to avoid divide-by-zero.
  • Method Details

    • loadFromClasspath

      public static LinearModel loadFromClasspath(String resourcePath) throws IOException
      Load a model from the classpath. Transparently handles both plain LDM1 binaries and gzip-compressed LDM1 binaries (detected by magic bytes).
      Throws:
      IOException
    • loadFromPath

      public static LinearModel loadFromPath(Path path) throws IOException
      Load a model from a file on disk. Transparently handles both plain and gzip-compressed LDM1 files.
      Throws:
      IOException
    • load

      public static LinearModel load(InputStream is) throws IOException
      Load a model from an input stream. Transparently handles both plain LDM1 binaries and gzip-compressed ones: if the first two bytes are the gzip magic 0x1F 0x8B the stream is wrapped in a GZIPInputStream before reading.
      Throws:
      IOException
    • save

      public void save(OutputStream os) throws IOException
      Write the model in LDM binary format. Emits V2 (with or without calibration block depending on whether this model has calibration).
      Throws:
      IOException
    • predictLogits

      public float[] predictLogits(int[] features)
      Compute raw logits for the given feature vector (before softmax). Uses a sparse inner loop — only non-zero buckets are visited.
      Parameters:
      features - int array of size numBuckets
      Returns:
      float array of size numClasses (raw, unnormalized logits)
    • predictLogitsDense

      public float[] predictLogitsDense(float[] features)
      Compute logits for a dense float feature vector. Unlike predictLogits(int[]), which assumes sparse integer counts and applies per-bucket clipping to suppress single-feature dominance in hashed representations, this method just performs a plain dot product — appropriate for adjudicator / meta-model feature vectors where each slot is already a calibrated quantity (specialist logit, z-score, one-hot flag, etc.).
      Parameters:
      features - float array of length numBuckets
      Returns:
      float array of length numClasses (raw logits)
    • predict

      public float[] predict(int[] features)
      Compute softmax probabilities for the given feature vector.
      Parameters:
      features - int array of size numBuckets
      Returns:
      float array of size numClasses (softmax probabilities, sum ≈ 1.0)
    • predictCalibratedLogits

      public float[] predictCalibratedLogits(int[] features)
      Compute calibrated logits: (raw - classMean[c]) / classStd[c] for each class, if the model carries calibration statistics, else raw logits (no-op). Calibrated logits are comparable across specialists with different natural logit scales — they express "how many standard deviations above this class's training-set mean" rather than raw weight arithmetic.
    • hasCalibration

      public boolean hasCalibration()
      true if this model carries per-class calibration statistics.
    • getClassMean

      public float[] getClassMean()
    • getClassStd

      public float[] getClassStd()
    • softmax

      public static float[] softmax(float[] logits)
      In-place softmax with numerical stability.
    • entropy

      public static float entropy(float[] probs)
      Shannon entropy (in bits) of a probability distribution.
    • getNumBuckets

      public int getNumBuckets()
    • getNumClasses

      public int getNumClasses()
    • getLabels

      public String[] getLabels()
    • getLabel

      public String getLabel(int classIndex)
    • getScales

      public float[] getScales()
    • getBiases

      public float[] getBiases()
    • getWeights

      public byte[][] getWeights()
      Return weights in class-major [class][bucket] layout. Creates a new array each call.