All Known Implementing Classes:: SaltedNgramFeatureExtractor, ScriptAwareFeatureExtractor, ShortTextFeatureExtractor

public interface FeatureExtractor

Common interface for feature extractors used by the bigram language detector. Implementations must share the same preprocessing pipeline (CharSoupFeatureExtractor.preprocess(String)) but may differ in how they extract and hash features from the preprocessed text.

Method Summary

Modifier and Type

Method

Description

int[]

extract(String rawText)

Full preprocessing + feature extraction pipeline.

void

extract(String rawText, int[] counts)

Extract into caller-supplied buffer (zeroed first).

default int

extractAndCount(String rawText, int[] counts)

Extract features into counts and return the total n-gram emission count.

int[]

extractFromPreprocessed(String preprocessedText)

Extract from already-preprocessed text.

void

extractFromPreprocessed(String preprocessedText, int[] counts, boolean clear)

Extract from already-preprocessed text into a caller-supplied buffer.

int

getFeatureFlags()

Returns the bitmask of CharSoupModel FLAG_* constants that describes which feature types this extractor emits.

int

getNumBuckets()

Method Details
- extract
  
  int[] extract(String rawText)
  
  Full preprocessing + feature extraction pipeline.
  
  Parameters:
  
  rawText - raw input text (may be null)
  
  Returns:
  
  int array of size getNumBuckets() with feature counts
- extract
  
  void extract(String rawText, int[] counts)
  
  Extract into caller-supplied buffer (zeroed first).
  
  Parameters:
  
  rawText - raw input text (may be null)
  
  counts - pre-allocated int array of size getNumBuckets() (will be zeroed)
- extractFromPreprocessed
  
  int[] extractFromPreprocessed(String preprocessedText)
  
  Extract from already-preprocessed text.
  
  Parameters:
  
  preprocessedText - text already passed through CharSoupFeatureExtractor.preprocess(String)
  
  Returns:
  
  int array of size getNumBuckets() with feature counts
- extractFromPreprocessed
  
  void extractFromPreprocessed(String preprocessedText, int[] counts, boolean clear)
  
  Extract from already-preprocessed text into a caller-supplied buffer.
  
  Parameters:
  
  preprocessedText - text already passed through CharSoupFeatureExtractor.preprocess(String)
  
  counts - pre-allocated int array of size getNumBuckets()
  
  clear - if true, zero the array before extracting; if false, accumulate on top of existing counts
- extractAndCount
  
  default int extractAndCount(String rawText, int[] counts)
  
  Extract features into counts and return the total n-gram emission count.
  The count is the raw number of individual n-gram tokens processed before bucket hashing. It is a script-neutral measure of how much signal the input carries: whitespace-only input yields 0; ~200 chars of typical Latin or CJK prose yields roughly 400. This is the right threshold variable for length-gated confusables because it is insensitive to padding spaces or punctuation-heavy inputs, and it naturally accounts for the higher feature density of CJK text vs. Latin text.
  The default implementation sums the feature vector after extraction, which is correct because every emission does counts[bucket]++; the sum therefore equals the total emission count regardless of hash collisions.
  
  Parameters:
  
  rawText - raw input text (may be null)
  
  counts - pre-allocated int array of size getNumBuckets() (will be zeroed)
  
  Returns:
  
  total n-gram emission count (≥ 0)
- getNumBuckets
  
  int getNumBuckets()
  
  Returns:
  
  number of hash buckets (feature vector size)
- getFeatureFlags
  
  int getFeatureFlags()
  
  Returns the bitmask of CharSoupModel FLAG_* constants that describes which feature types this extractor emits.
  This must match the featureFlags stored in any CharSoupModel used with this extractor. A mismatch means the model was trained with a different feature set and will produce garbage scores.
  
  Returns:
  
  bitmask of active feature flags

Interface FeatureExtractor

Method Summary

Method Details

extract

extract

extractFromPreprocessed

extractFromPreprocessed

extractAndCount

getNumBuckets

getFeatureFlags