Package org.apache.tika.ml
Class LinearModel
java.lang.Object
org.apache.tika.ml.LinearModel
INT8-quantized multinomial logistic regression model for classification.
Binary format (big-endian, magic "LDM1"):
Offset Field
0 4B magic: 0x4C444D31
4 4B version: 1 or 2
8 4B numBuckets (B)
12 4B numClasses (C)
16+ Labels: C entries of [2B length + UTF-8 bytes]
Scales: C × 4B float (per-class dequantization)
Biases: C × 4B float (per-class bias term)
(V2 only)
1B hasCalibration flag
If hasCalibration: ClassMean: C × 4B float, ClassStd: C × 4B float
Weights: B × C bytes (bucket-major, INT8 signed)
Weights are stored in bucket-major order:
weights[bucket * numClasses + class]. This layout
is optimal for the sparse dot-product in predict(int[])
— each non-zero bucket reads a contiguous run of
numClasses bytes, ideal for SIMD and cache
prefetching.
Calibration (V2): optional per-class mean/std of training-set logits.
When present, predictCalibratedLogits(int[]) standardizes raw logits
so cross-specialist pooling can compare "unusually confident" signals on
equal footing. V1 files are still readable; calibration is absent and
predictCalibratedLogits(int[]) falls back to raw logits.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intLatest version we emit.static final intstatic final int -
Constructor Summary
ConstructorsConstructorDescriptionLinearModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights) Construct without calibration (V1-compatible).LinearModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights, float[] classMean, float[] classStd) Construct with optional calibration. -
Method Summary
Modifier and TypeMethodDescriptionstatic floatentropy(float[] probs) Shannon entropy (in bits) of a probability distribution.float[]float[]float[]getLabel(int classIndex) String[]intintfloat[]byte[][]Return weights in class-major[class][bucket]layout.booleantrueif this model carries per-class calibration statistics.static LinearModelload(InputStream is) Load a model from an input stream.static LinearModelloadFromClasspath(String resourcePath) Load a model from the classpath.static LinearModelloadFromPath(Path path) Load a model from a file on disk.float[]predict(int[] features) Compute softmax probabilities for the given feature vector.float[]predictCalibratedLogits(int[] features) Compute calibrated logits:(raw - classMean[c]) / classStd[c]for each class, if the model carries calibration statistics, else raw logits (no-op).float[]predictLogits(int[] features) Compute raw logits for the given feature vector (before softmax).float[]predictLogitsDense(float[] features) Compute logits for a dense float feature vector.voidsave(OutputStream os) Write the model in LDM binary format.static float[]softmax(float[] logits) In-place softmax with numerical stability.
-
Field Details
-
MAGIC
public static final int MAGIC- See Also:
-
VERSION_V1
public static final int VERSION_V1- See Also:
-
VERSION_V2
public static final int VERSION_V2- See Also:
-
VERSION
public static final int VERSIONLatest version we emit.- See Also:
-
-
Constructor Details
-
LinearModel
public LinearModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights) Construct without calibration (V1-compatible). Transposes class-major weights to bucket-major flat layout internally. -
LinearModel
public LinearModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights, float[] classMean, float[] classStd) Construct with optional calibration. PassclassMeanandclassStd(each of lengthnumClasses) to enable z-score calibration inpredictCalibratedLogits(int[]); passnullfor both to skip. AnyclassStd[c] == 0is rewritten to1.0fto avoid divide-by-zero.
-
-
Method Details
-
loadFromClasspath
Load a model from the classpath. Transparently handles both plain LDM1 binaries and gzip-compressed LDM1 binaries (detected by magic bytes).- Throws:
IOException
-
loadFromPath
Load a model from a file on disk. Transparently handles both plain and gzip-compressed LDM1 files.- Throws:
IOException
-
load
Load a model from an input stream. Transparently handles both plain LDM1 binaries and gzip-compressed ones: if the first two bytes are the gzip magic0x1F 0x8Bthe stream is wrapped in aGZIPInputStreambefore reading.- Throws:
IOException
-
save
Write the model in LDM binary format. Emits V2 (with or without calibration block depending on whether this model has calibration).- Throws:
IOException
-
predictLogits
public float[] predictLogits(int[] features) Compute raw logits for the given feature vector (before softmax). Uses a sparse inner loop — only non-zero buckets are visited.- Parameters:
features- int array of sizenumBuckets- Returns:
- float array of size
numClasses(raw, unnormalized logits)
-
predictLogitsDense
public float[] predictLogitsDense(float[] features) Compute logits for a dense float feature vector. UnlikepredictLogits(int[]), which assumes sparse integer counts and applies per-bucket clipping to suppress single-feature dominance in hashed representations, this method just performs a plain dot product — appropriate for adjudicator / meta-model feature vectors where each slot is already a calibrated quantity (specialist logit, z-score, one-hot flag, etc.).- Parameters:
features- float array of lengthnumBuckets- Returns:
- float array of length
numClasses(raw logits)
-
predict
public float[] predict(int[] features) Compute softmax probabilities for the given feature vector.- Parameters:
features- int array of sizenumBuckets- Returns:
- float array of size
numClasses(softmax probabilities, sum ≈ 1.0)
-
predictCalibratedLogits
public float[] predictCalibratedLogits(int[] features) Compute calibrated logits:(raw - classMean[c]) / classStd[c]for each class, if the model carries calibration statistics, else raw logits (no-op). Calibrated logits are comparable across specialists with different natural logit scales — they express "how many standard deviations above this class's training-set mean" rather than raw weight arithmetic. -
hasCalibration
public boolean hasCalibration()trueif this model carries per-class calibration statistics. -
getClassMean
public float[] getClassMean() -
getClassStd
public float[] getClassStd() -
softmax
public static float[] softmax(float[] logits) In-place softmax with numerical stability. -
entropy
public static float entropy(float[] probs) Shannon entropy (in bits) of a probability distribution. -
getNumBuckets
public int getNumBuckets() -
getNumClasses
public int getNumClasses() -
getLabels
-
getLabel
-
getScales
public float[] getScales() -
getBiases
public float[] getBiases() -
getWeights
public byte[][] getWeights()Return weights in class-major[class][bucket]layout. Creates a new array each call.
-