org.apache.tika.langdetect.charsoup.SaltedNgramFeatureExtractor

All Implemented Interfaces:: FeatureExtractor

public class SaltedNgramFeatureExtractor extends Object implements FeatureExtractor

Feature extractor using positional salt (BOW/EOW/FULL_WORD) instead of sentinel characters in n-grams.

Design principles

Single FNV basis constant for all features. A one-byte salt prefix distinguishes feature types; n-gram order is differentiated by the number of codepoints fed into the hash chain.
N-grams always contain N real characters — no sentinel padding.
Word position is encoded via salt bytes (BOW, EOW, FULL_WORD, MID).
No script salting on n-grams — different scripts use different codepoint ranges, so hashes naturally separate.
Short complete words (1–4 chars) get a FULL_WORD salt on their matching n-gram order, replacing the separate word-unigram feature.
Script block features (presence counts + transition counts) provide explicit script signal for the linear classifier.
CJK/kana character unigrams use a dedicated salt (no word boundaries in CJK).

Feature types

Character bigrams — all contiguous pairs within a word, plus BOW/EOW/FULL_WORD variants.
Character trigrams — all contiguous triples, with position salt.
Character 4-grams — all contiguous quads, with position salt.
CJK/kana unigrams — individual ideographic/kana codepoints.
Script blocks — per-script letter counts and transition counts.

Field Summary

Fields

Modifier and Type

Field

Description

static final int

FEATURE_FLAGS

static final int

FEATURE_FLAGS_V11

static final int

FEATURE_FLAGS_WITH_WORD_BIGRAMS
Constructor Summary

Constructors

Constructor

Description

SaltedNgramFeatureExtractor(int numBuckets)

SaltedNgramFeatureExtractor(int numBuckets, boolean useWordBigrams)

SaltedNgramFeatureExtractor(int numBuckets, boolean useWordBigrams, boolean useWordLength)
Method Summary

Modifier and Type

Method

Description

int[]

extract(String rawText)

Full preprocessing + feature extraction pipeline.

void

extract(String rawText, int[] counts)

Extract into caller-supplied buffer (zeroed first).

int

extractAndCount(String rawText, int[] counts)

Extract features into counts and return the total n-gram emission count.

int[]

extractFromPreprocessed(String text)

Extract from already-preprocessed text.

void

extractFromPreprocessed(String text, int[] counts, boolean clear)

Extract from already-preprocessed text into a caller-supplied buffer.

int

getFeatureFlags()

Returns the bitmask of CharSoupModel FLAG_* constants that describes which feature types this extractor emits.

int

getNumBuckets()

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- FEATURE_FLAGS
  
  public static final int FEATURE_FLAGS
  See Also:
  
  Constant Field Values
- FEATURE_FLAGS_WITH_WORD_BIGRAMS
  
  public static final int FEATURE_FLAGS_WITH_WORD_BIGRAMS
  See Also:
  
  Constant Field Values
- FEATURE_FLAGS_V11
  
  public static final int FEATURE_FLAGS_V11
  See Also:
  
  Constant Field Values
Constructor Details
- SaltedNgramFeatureExtractor
  
  public SaltedNgramFeatureExtractor(int numBuckets)
- SaltedNgramFeatureExtractor
  
  public SaltedNgramFeatureExtractor(int numBuckets, boolean useWordBigrams)
- SaltedNgramFeatureExtractor
  
  public SaltedNgramFeatureExtractor(int numBuckets, boolean useWordBigrams, boolean useWordLength)
Method Details
- extract
  
  public int[] extract(String rawText)
  
  Description copied from interface: FeatureExtractor
  
  Full preprocessing + feature extraction pipeline.
  
  Specified by:
  
  extract in interface FeatureExtractor
  
  Parameters:
  
  rawText - raw input text (may be null)
  
  Returns:
  
  int array of size FeatureExtractor.getNumBuckets() with feature counts
- extract
  
  public void extract(String rawText, int[] counts)
  
  Description copied from interface: FeatureExtractor
  
  Extract into caller-supplied buffer (zeroed first).
  
  Specified by:
  
  extract in interface FeatureExtractor
  
  Parameters:
  
  rawText - raw input text (may be null)
  
  counts - pre-allocated int array of size FeatureExtractor.getNumBuckets() (will be zeroed)
- extractFromPreprocessed
  
  public int[] extractFromPreprocessed(String text)
  
  Description copied from interface: FeatureExtractor
  
  Extract from already-preprocessed text.
  
  Specified by:
  
  extractFromPreprocessed in interface FeatureExtractor
  
  Parameters:
  
  text - text already passed through CharSoupFeatureExtractor.preprocess(String)
  
  Returns:
  
  int array of size FeatureExtractor.getNumBuckets() with feature counts
- extractFromPreprocessed
  
  public void extractFromPreprocessed(String text, int[] counts, boolean clear)
  
  Description copied from interface: FeatureExtractor
  
  Extract from already-preprocessed text into a caller-supplied buffer.
  
  Specified by:
  
  extractFromPreprocessed in interface FeatureExtractor
  
  Parameters:
  
  text - text already passed through CharSoupFeatureExtractor.preprocess(String)
  
  counts - pre-allocated int array of size FeatureExtractor.getNumBuckets()
  
  clear - if true, zero the array before extracting; if false, accumulate on top of existing counts
- extractAndCount
  
  public int extractAndCount(String rawText, int[] counts)
  
  Description copied from interface: FeatureExtractor
  
  Extract features into counts and return the total n-gram emission count.
  The count is the raw number of individual n-gram tokens processed before bucket hashing. It is a script-neutral measure of how much signal the input carries: whitespace-only input yields 0; ~200 chars of typical Latin or CJK prose yields roughly 400. This is the right threshold variable for length-gated confusables because it is insensitive to padding spaces or punctuation-heavy inputs, and it naturally accounts for the higher feature density of CJK text vs. Latin text.
  The default implementation sums the feature vector after extraction, which is correct because every emission does counts[bucket]++; the sum therefore equals the total emission count regardless of hash collisions.
  
  Specified by:
  
  extractAndCount in interface FeatureExtractor
  
  Parameters:
  
  rawText - raw input text (may be null)
  
  counts - pre-allocated int array of size FeatureExtractor.getNumBuckets() (will be zeroed)
  
  Returns:
  
  total n-gram emission count (≥ 0)
- getNumBuckets
  
  public int getNumBuckets()
  
  Specified by:
  
  getNumBuckets in interface FeatureExtractor
  
  Returns:
  
  number of hash buckets (feature vector size)
- getFeatureFlags
  
  public int getFeatureFlags()
  
  Description copied from interface: FeatureExtractor
  
  Returns the bitmask of CharSoupModel FLAG_* constants that describes which feature types this extractor emits.
  This must match the featureFlags stored in any CharSoupModel used with this extractor. A mismatch means the model was trained with a different feature set and will produce garbage scores.
  
  Specified by:
  
  getFeatureFlags in interface FeatureExtractor
  
  Returns:
  
  bitmask of active feature flags

Class SaltedNgramFeatureExtractor

Design principles

Feature types

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

FEATURE_FLAGS

FEATURE_FLAGS_WITH_WORD_BIGRAMS

FEATURE_FLAGS_V11

Constructor Details

SaltedNgramFeatureExtractor

SaltedNgramFeatureExtractor

SaltedNgramFeatureExtractor

Method Details

extract

extract

extractFromPreprocessed

extractFromPreprocessed

extractAndCount

getNumBuckets

getFeatureFlags