Interface FeatureExtractor
- All Known Implementing Classes:
SaltedNgramFeatureExtractor,ScriptAwareFeatureExtractor,ShortTextFeatureExtractor
CharSoupFeatureExtractor.preprocess(String)) but may differ in how
they extract and hash features from the preprocessed text.-
Method Summary
Modifier and TypeMethodDescriptionint[]Full preprocessing + feature extraction pipeline.voidExtract into caller-supplied buffer (zeroed first).default intextractAndCount(String rawText, int[] counts) Extract features intocountsand return the total n-gram emission count.int[]extractFromPreprocessed(String preprocessedText) Extract from already-preprocessed text.voidextractFromPreprocessed(String preprocessedText, int[] counts, boolean clear) Extract from already-preprocessed text into a caller-supplied buffer.intReturns the bitmask ofCharSoupModelFLAG_*constants that describes which feature types this extractor emits.int
-
Method Details
-
extract
Full preprocessing + feature extraction pipeline.- Parameters:
rawText- raw input text (may benull)- Returns:
- int array of size
getNumBuckets()with feature counts
-
extract
Extract into caller-supplied buffer (zeroed first).- Parameters:
rawText- raw input text (may benull)counts- pre-allocated int array of sizegetNumBuckets()(will be zeroed)
-
extractFromPreprocessed
Extract from already-preprocessed text.- Parameters:
preprocessedText- text already passed throughCharSoupFeatureExtractor.preprocess(String)- Returns:
- int array of size
getNumBuckets()with feature counts
-
extractFromPreprocessed
Extract from already-preprocessed text into a caller-supplied buffer.- Parameters:
preprocessedText- text already passed throughCharSoupFeatureExtractor.preprocess(String)counts- pre-allocated int array of sizegetNumBuckets()clear- iftrue, zero the array before extracting; iffalse, accumulate on top of existing counts
-
extractAndCount
Extract features intocountsand return the total n-gram emission count.The count is the raw number of individual n-gram tokens processed before bucket hashing. It is a script-neutral measure of how much signal the input carries: whitespace-only input yields 0; ~200 chars of typical Latin or CJK prose yields roughly 400. This is the right threshold variable for length-gated confusables because it is insensitive to padding spaces or punctuation-heavy inputs, and it naturally accounts for the higher feature density of CJK text vs. Latin text.
The default implementation sums the feature vector after extraction, which is correct because every emission does
counts[bucket]++; the sum therefore equals the total emission count regardless of hash collisions.- Parameters:
rawText- raw input text (may benull)counts- pre-allocated int array of sizegetNumBuckets()(will be zeroed)- Returns:
- total n-gram emission count (≥ 0)
-
getNumBuckets
int getNumBuckets()- Returns:
- number of hash buckets (feature vector size)
-
getFeatureFlags
int getFeatureFlags()Returns the bitmask ofCharSoupModelFLAG_*constants that describes which feature types this extractor emits.This must match the
featureFlagsstored in anyCharSoupModelused with this extractor. A mismatch means the model was trained with a different feature set and will produce garbage scores.- Returns:
- bitmask of active feature flags
-