Class CharSoupModel
java.lang.Object
org.apache.tika.langdetect.charsoup.CharSoupModel
INT8-quantized multinomial logistic regression model for
language detection.
Binary format (big-endian, magic "LDM1"):
v1 layout:
Offset Field
0 4B magic: 0x4C444D31
4 4B version: 1
8 4B numBuckets (B)
12 4B numClasses (C)
16+ Labels: C entries of [2B length + UTF-8 bytes]
Scales: C × 4B float (per-class dequantization)
Biases: C × 4B float (per-class bias term)
Weights: B × C bytes (bucket-major, INT8 signed)
v2 layout (adds feature flags after numClasses):
Offset Field
0 4B magic: 0x4C444D31
4 4B version: 2
8 4B numBuckets (B)
12 4B numClasses (C)
16 4B featureFlags (bitmask of FLAG_* constants)
20+ Labels, Scales, Biases, Weights (same as v1)
Weights are stored in bucket-major order:
weights[bucket * numClasses + class]. This layout
is optimal for the sparse dot-product in predict(int[])
— each non-zero bucket reads a contiguous run of
numClasses bytes, ideal for SIMD and cache
prefetching.
Feature extraction always uses
ScriptAwareFeatureExtractor, which produces
character bigrams (with sentinels for non-CJK), whole-word
unigrams, CJK character unigrams, and CJK space bridging.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intFeature flag: enable character 4-grams.static final intFeature flag: enable character 5-grams.static final intFeature flag: enable non-CJK character unigrams.static final intFeature flag: L2-normalize the feature vector before prediction.static final intFeature flag: enable 3-char word prefixes.static final intFeature flag: enable script-block presence + transition features.static final intFeature flag: enable skip bigrams.static final intFeature flag: enable 4-char word suffixes.static final intFeature flag: enable 3-char word suffixes.static final intFeature flag: enable character trigrams.static final intFeature flag: short-word-anchored word bigrams (hash pairs where anchor is 1–3 chars).static final intFeature flag: non-CJK word length features (exact length, capped).static final intFeature flag: enable whole-word unigrams.static final intDefault flags for v1 models (word unigrams only). -
Constructor Summary
ConstructorsConstructorDescriptionCharSoupModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights) Construct from class-majorbyte[][]weights with default feature configuration (word unigrams only — backward compatible with v1).CharSoupModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights, int featureFlags) Construct from class-majorbyte[][]weights with explicit feature flags. -
Method Summary
Modifier and TypeMethodDescriptionCreate the productionFeatureExtractorfor this model by dispatching on thefeatureFlagsembedded in the binary.static floatentropy(float[] probs) Shannon entropy (in bits) of a probability distribution.float[]intgetLabel(int classIndex) String[]intintfloat[]byte[][]Return weights in class-major[class][bucket]layout.static CharSoupModelload(InputStream is) Load a model from an input stream.static CharSoupModelloadFromClasspath(String resourcePath) Load a model from the classpath.float[]predict(int[] features) Compute softmax probabilities for the given feature vector.float[]predictLogits(int[] features) Compute raw logits (pre-softmax scores) for the given feature vector.voidsave(OutputStream os) Write the model in LDM2 binary format (includes feature flags).static float[]softmax(float[] logits) In-place softmax with numerical stability.withFeatureFlags(int newFlags) Returns a new model with the same weights but a different feature-flags bitmask.
-
Field Details
-
FLAG_TRIGRAMS
public static final int FLAG_TRIGRAMSFeature flag: enable character trigrams.- See Also:
-
FLAG_SKIP_BIGRAMS
public static final int FLAG_SKIP_BIGRAMSFeature flag: enable skip bigrams.- See Also:
-
FLAG_SUFFIXES
public static final int FLAG_SUFFIXESFeature flag: enable 3-char word suffixes.- See Also:
-
FLAG_SUFFIX4
public static final int FLAG_SUFFIX4Feature flag: enable 4-char word suffixes.- See Also:
-
FLAG_PREFIX
public static final int FLAG_PREFIXFeature flag: enable 3-char word prefixes.- See Also:
-
FLAG_WORD_UNIGRAMS
public static final int FLAG_WORD_UNIGRAMSFeature flag: enable whole-word unigrams.- See Also:
-
FLAG_CHAR_UNIGRAMS
public static final int FLAG_CHAR_UNIGRAMSFeature flag: enable non-CJK character unigrams.- See Also:
-
FLAG_4GRAMS
public static final int FLAG_4GRAMSFeature flag: enable character 4-grams.- See Also:
-
FLAG_5GRAMS
public static final int FLAG_5GRAMSFeature flag: enable character 5-grams.- See Also:
-
FLAG_SCRIPT_BLOCKS
public static final int FLAG_SCRIPT_BLOCKSFeature flag: enable script-block presence + transition features.- See Also:
-
FLAG_L2_NORM
public static final int FLAG_L2_NORMFeature flag: L2-normalize the feature vector before prediction.- See Also:
-
FLAG_WORD_BIGRAMS
public static final int FLAG_WORD_BIGRAMSFeature flag: short-word-anchored word bigrams (hash pairs where anchor is 1–3 chars).- See Also:
-
FLAG_WORD_LENGTH
public static final int FLAG_WORD_LENGTHFeature flag: non-CJK word length features (exact length, capped).- See Also:
-
V1_DEFAULT_FLAGS
public static final int V1_DEFAULT_FLAGSDefault flags for v1 models (word unigrams only).- See Also:
-
-
Constructor Details
-
CharSoupModel
public CharSoupModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights) Construct from class-majorbyte[][]weights with default feature configuration (word unigrams only — backward compatible with v1). -
CharSoupModel
public CharSoupModel(int numBuckets, int numClasses, String[] labels, float[] scales, float[] biases, byte[][] weights, int featureFlags) Construct from class-majorbyte[][]weights with explicit feature flags.- Parameters:
featureFlags- bitmask ofFLAG_*constants
-
-
Method Details
-
loadFromClasspath
Load a model from the classpath.- Throws:
IOException
-
load
Load a model from an input stream. Supports both v1 (LDM1) and v2 (LDM2) formats.- Throws:
IOException
-
save
Write the model in LDM2 binary format (includes feature flags).- Throws:
IOException
-
predict
public float[] predict(int[] features) Compute softmax probabilities for the given feature vector. Uses a sparse inner loop — only non-zero buckets are visited.- Parameters:
features- int array of sizenumBuckets- Returns:
- float array of size
numClasses(softmax probabilities, sum ≈ 1.0)
-
predictLogits
public float[] predictLogits(int[] features) Compute raw logits (pre-softmax scores) for the given feature vector. Higher logits indicate stronger match. Unlikepredict(int[]), this preserves the full dynamic range of the model's output, which is useful when comparing confidence across different input texts.- Parameters:
features- int array of sizenumBuckets- Returns:
- float array of size
numClasses(raw logits, not normalized)
-
softmax
public static float[] softmax(float[] logits) In-place softmax with numerical stability. -
entropy
public static float entropy(float[] probs) Shannon entropy (in bits) of a probability distribution. -
getNumBuckets
public int getNumBuckets() -
getNumClasses
public int getNumClasses() -
getLabels
-
getLabel
-
getScales
public float[] getScales() -
getBiases
public float[] getBiases() -
getWeights
public byte[][] getWeights()Return weights in class-major[class][bucket]layout. Creates a new array each call. -
createExtractor
Create the productionFeatureExtractorfor this model by dispatching on thefeatureFlagsembedded in the binary.Supported flag sets:
ScriptAwareFeatureExtractor.FEATURE_FLAGS— general modelShortTextFeatureExtractor.FEATURE_FLAGS— short-text model
- Throws:
IllegalStateException- if the flags do not match any known production extractor. Experimental configs should useResearchFeatureExtractorin the test module.
-
getFeatureFlags
public int getFeatureFlags() -
withFeatureFlags
Returns a new model with the same weights but a different feature-flags bitmask. Useful for correcting flags on models saved before this field was properly set.- Parameters:
newFlags- bitmask ofFLAG_*constants- Returns:
- copy of this model with updated feature flags
-