Interface FeatureExtractor

All Known Implementing Classes:
SaltedNgramFeatureExtractor, ScriptAwareFeatureExtractor, ShortTextFeatureExtractor

public interface FeatureExtractor
Common interface for feature extractors used by the bigram language detector. Implementations must share the same preprocessing pipeline (CharSoupFeatureExtractor.preprocess(String)) but may differ in how they extract and hash features from the preprocessed text.
  • Method Summary

    Modifier and Type
    Method
    Description
    int[]
    extract(String rawText)
    Full preprocessing + feature extraction pipeline.
    void
    extract(String rawText, int[] counts)
    Extract into caller-supplied buffer (zeroed first).
    default int
    extractAndCount(String rawText, int[] counts)
    Extract features into counts and return the total n-gram emission count.
    int[]
    extractFromPreprocessed(String preprocessedText)
    Extract from already-preprocessed text.
    void
    extractFromPreprocessed(String preprocessedText, int[] counts, boolean clear)
    Extract from already-preprocessed text into a caller-supplied buffer.
    int
    Returns the bitmask of CharSoupModel FLAG_* constants that describes which feature types this extractor emits.
    int
     
  • Method Details

    • extract

      int[] extract(String rawText)
      Full preprocessing + feature extraction pipeline.
      Parameters:
      rawText - raw input text (may be null)
      Returns:
      int array of size getNumBuckets() with feature counts
    • extract

      void extract(String rawText, int[] counts)
      Extract into caller-supplied buffer (zeroed first).
      Parameters:
      rawText - raw input text (may be null)
      counts - pre-allocated int array of size getNumBuckets() (will be zeroed)
    • extractFromPreprocessed

      int[] extractFromPreprocessed(String preprocessedText)
      Extract from already-preprocessed text.
      Parameters:
      preprocessedText - text already passed through CharSoupFeatureExtractor.preprocess(String)
      Returns:
      int array of size getNumBuckets() with feature counts
    • extractFromPreprocessed

      void extractFromPreprocessed(String preprocessedText, int[] counts, boolean clear)
      Extract from already-preprocessed text into a caller-supplied buffer.
      Parameters:
      preprocessedText - text already passed through CharSoupFeatureExtractor.preprocess(String)
      counts - pre-allocated int array of size getNumBuckets()
      clear - if true, zero the array before extracting; if false, accumulate on top of existing counts
    • extractAndCount

      default int extractAndCount(String rawText, int[] counts)
      Extract features into counts and return the total n-gram emission count.

      The count is the raw number of individual n-gram tokens processed before bucket hashing. It is a script-neutral measure of how much signal the input carries: whitespace-only input yields 0; ~200 chars of typical Latin or CJK prose yields roughly 400. This is the right threshold variable for length-gated confusables because it is insensitive to padding spaces or punctuation-heavy inputs, and it naturally accounts for the higher feature density of CJK text vs. Latin text.

      The default implementation sums the feature vector after extraction, which is correct because every emission does counts[bucket]++; the sum therefore equals the total emission count regardless of hash collisions.

      Parameters:
      rawText - raw input text (may be null)
      counts - pre-allocated int array of size getNumBuckets() (will be zeroed)
      Returns:
      total n-gram emission count (≥ 0)
    • getNumBuckets

      int getNumBuckets()
      Returns:
      number of hash buckets (feature vector size)
    • getFeatureFlags

      int getFeatureFlags()
      Returns the bitmask of CharSoupModel FLAG_* constants that describes which feature types this extractor emits.

      This must match the featureFlags stored in any CharSoupModel used with this extractor. A mismatch means the model was trained with a different feature set and will produce garbage scores.

      Returns:
      bitmask of active feature flags