java.lang.Object

org.apache.tika.eval.core.tokens.TikaEvalTokenizer

public class TikaEvalTokenizer extends Object

Tokenizer for tika-eval text analysis. Provides two modes:

TikaEvalTokenizer.Mode.STANDARD — for general token counting. Emits all alphabetic, ideographic, and numeric tokens with no minimum length and no skip list. Used by AnalyzerManager for NUM_TOKENS / NUM_UNIQUE_TOKENS.
TikaEvalTokenizer.Mode.COMMON_TOKENS — for building and querying common-token frequency lists. Alphabetic only (no numbers), minimum 3 characters, common HTML markup terms excluded. Used by CommonTokenCountManager and the common token generator.

Both modes share the same preprocessing pipeline:

URL/email stripping and truncation via CharSoupFeatureExtractor.preprocess(String)
NFKD normalization for accent-insensitive matching (combining marks are dropped by CharSoupFeatureExtractor.isTransparent(int))
Case folding via Character.toLowerCase(int)
CJK character bigrams (no unigrams)

This class is intentionally separate from WordTokenizer to avoid parameterization in the language-detection hot path.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

TikaEvalTokenizer.Mode

Tokenization mode.
Method Summary

Modifier and Type

Method

Description

static List<String>

tokenize(String rawText)

Tokenize in TikaEvalTokenizer.Mode.COMMON_TOKENS mode and return tokens as a list.

static void

tokenize(String rawText, Consumer<String> consumer)

Tokenize in TikaEvalTokenizer.Mode.COMMON_TOKENS mode, streaming tokens to a consumer.

static List<String>

tokenize(String rawText, TikaEvalTokenizer.Mode mode)

Tokenize in the specified mode and return tokens as a list.

static void

tokenize(String rawText, TikaEvalTokenizer.Mode mode, int maxTokens, Consumer<String> consumer)

Tokenize in the specified mode, streaming at most maxTokens tokens to a consumer.

static void

tokenize(String rawText, TikaEvalTokenizer.Mode mode, Consumer<String> consumer)

Tokenize in the specified mode, streaming tokens to a consumer.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- tokenize
  
  public static List<String> tokenize(String rawText)
  
  Tokenize in TikaEvalTokenizer.Mode.COMMON_TOKENS mode and return tokens as a list.
  
  Parameters:
  
  rawText - raw input text
  
  Returns:
  
  filtered token list
- tokenize
  
  public static List<String> tokenize(String rawText, TikaEvalTokenizer.Mode mode)
  
  Tokenize in the specified mode and return tokens as a list.
  
  Parameters:
  
  rawText - raw input text
  
  mode - tokenization mode
  
  Returns:
  
  token list
- tokenize
  
  public static void tokenize(String rawText, Consumer<String> consumer)
  
  Tokenize in TikaEvalTokenizer.Mode.COMMON_TOKENS mode, streaming tokens to a consumer.
  
  Parameters:
  
  rawText - raw input text
  
  consumer - receives each token
- tokenize
  
  public static void tokenize(String rawText, TikaEvalTokenizer.Mode mode, Consumer<String> consumer)
  
  Tokenize in the specified mode, streaming tokens to a consumer.
  
  Parameters:
  
  rawText - raw input text
  
  mode - tokenization mode
  
  consumer - receives each token
- tokenize
  
  public static void tokenize(String rawText, TikaEvalTokenizer.Mode mode, int maxTokens, Consumer<String> consumer)
  
  Tokenize in the specified mode, streaming at most maxTokens tokens to a consumer. Iteration stops as soon as the limit is reached — no wasted work on the remainder of the string.
  
  Parameters:
  
  rawText - raw input text
  
  mode - tokenization mode
  
  maxTokens - maximum number of tokens to emit; use Integer.MAX_VALUE for no limit
  
  consumer - receives each token

Class TikaEvalTokenizer

Nested Class Summary

Method Summary

Methods inherited from class java.lang.Object

Method Details

tokenize

tokenize

tokenize

tokenize

tokenize