Class SaltedNgramFeatureExtractor
- All Implemented Interfaces:
FeatureExtractor
Design principles
- Single FNV basis constant for all features. A one-byte salt prefix distinguishes feature types; n-gram order is differentiated by the number of codepoints fed into the hash chain.
- N-grams always contain N real characters — no sentinel padding.
- Word position is encoded via salt bytes (BOW, EOW, FULL_WORD, MID).
- No script salting on n-grams — different scripts use different codepoint ranges, so hashes naturally separate.
- Short complete words (1–4 chars) get a FULL_WORD salt on their matching n-gram order, replacing the separate word-unigram feature.
- Script block features (presence counts + transition counts) provide explicit script signal for the linear classifier.
- CJK/kana character unigrams use a dedicated salt (no word boundaries in CJK).
Feature types
- Character bigrams — all contiguous pairs within a word, plus BOW/EOW/FULL_WORD variants.
- Character trigrams — all contiguous triples, with position salt.
- Character 4-grams — all contiguous quads, with position salt.
- CJK/kana unigrams — individual ideographic/kana codepoints.
- Script blocks — per-script letter counts and transition counts.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intstatic final int -
Constructor Summary
ConstructorsConstructorDescriptionSaltedNgramFeatureExtractor(int numBuckets) SaltedNgramFeatureExtractor(int numBuckets, boolean useWordBigrams) SaltedNgramFeatureExtractor(int numBuckets, boolean useWordBigrams, boolean useWordLength) -
Method Summary
Modifier and TypeMethodDescriptionint[]Full preprocessing + feature extraction pipeline.voidExtract into caller-supplied buffer (zeroed first).intextractAndCount(String rawText, int[] counts) Extract features intocountsand return the total n-gram emission count.int[]Extract from already-preprocessed text.voidextractFromPreprocessed(String text, int[] counts, boolean clear) Extract from already-preprocessed text into a caller-supplied buffer.intReturns the bitmask ofCharSoupModelFLAG_*constants that describes which feature types this extractor emits.int
-
Field Details
-
FEATURE_FLAGS
public static final int FEATURE_FLAGS- See Also:
-
FEATURE_FLAGS_WITH_WORD_BIGRAMS
public static final int FEATURE_FLAGS_WITH_WORD_BIGRAMS- See Also:
-
FEATURE_FLAGS_V11
public static final int FEATURE_FLAGS_V11- See Also:
-
-
Constructor Details
-
SaltedNgramFeatureExtractor
public SaltedNgramFeatureExtractor(int numBuckets) -
SaltedNgramFeatureExtractor
public SaltedNgramFeatureExtractor(int numBuckets, boolean useWordBigrams) -
SaltedNgramFeatureExtractor
public SaltedNgramFeatureExtractor(int numBuckets, boolean useWordBigrams, boolean useWordLength)
-
-
Method Details
-
extract
Description copied from interface:FeatureExtractorFull preprocessing + feature extraction pipeline.- Specified by:
extractin interfaceFeatureExtractor- Parameters:
rawText- raw input text (may benull)- Returns:
- int array of size
FeatureExtractor.getNumBuckets()with feature counts
-
extract
Description copied from interface:FeatureExtractorExtract into caller-supplied buffer (zeroed first).- Specified by:
extractin interfaceFeatureExtractor- Parameters:
rawText- raw input text (may benull)counts- pre-allocated int array of sizeFeatureExtractor.getNumBuckets()(will be zeroed)
-
extractFromPreprocessed
Description copied from interface:FeatureExtractorExtract from already-preprocessed text.- Specified by:
extractFromPreprocessedin interfaceFeatureExtractor- Parameters:
text- text already passed throughCharSoupFeatureExtractor.preprocess(String)- Returns:
- int array of size
FeatureExtractor.getNumBuckets()with feature counts
-
extractFromPreprocessed
Description copied from interface:FeatureExtractorExtract from already-preprocessed text into a caller-supplied buffer.- Specified by:
extractFromPreprocessedin interfaceFeatureExtractor- Parameters:
text- text already passed throughCharSoupFeatureExtractor.preprocess(String)counts- pre-allocated int array of sizeFeatureExtractor.getNumBuckets()clear- iftrue, zero the array before extracting; iffalse, accumulate on top of existing counts
-
extractAndCount
Description copied from interface:FeatureExtractorExtract features intocountsand return the total n-gram emission count.The count is the raw number of individual n-gram tokens processed before bucket hashing. It is a script-neutral measure of how much signal the input carries: whitespace-only input yields 0; ~200 chars of typical Latin or CJK prose yields roughly 400. This is the right threshold variable for length-gated confusables because it is insensitive to padding spaces or punctuation-heavy inputs, and it naturally accounts for the higher feature density of CJK text vs. Latin text.
The default implementation sums the feature vector after extraction, which is correct because every emission does
counts[bucket]++; the sum therefore equals the total emission count regardless of hash collisions.- Specified by:
extractAndCountin interfaceFeatureExtractor- Parameters:
rawText- raw input text (may benull)counts- pre-allocated int array of sizeFeatureExtractor.getNumBuckets()(will be zeroed)- Returns:
- total n-gram emission count (≥ 0)
-
getNumBuckets
public int getNumBuckets()- Specified by:
getNumBucketsin interfaceFeatureExtractor- Returns:
- number of hash buckets (feature vector size)
-
getFeatureFlags
public int getFeatureFlags()Description copied from interface:FeatureExtractorReturns the bitmask ofCharSoupModelFLAG_*constants that describes which feature types this extractor emits.This must match the
featureFlagsstored in anyCharSoupModelused with this extractor. A mismatch means the model was trained with a different feature set and will produce garbage scores.- Specified by:
getFeatureFlagsin interfaceFeatureExtractor- Returns:
- bitmask of active feature flags
-