Class TikaEvalTokenizer

java.lang.Object
org.apache.tika.eval.core.tokens.TikaEvalTokenizer

public class TikaEvalTokenizer extends Object
Tokenizer for tika-eval text analysis. Provides two modes:

Both modes share the same preprocessing pipeline:

  1. URL/email stripping and truncation via CharSoupFeatureExtractor.preprocess(String)
  2. NFKD normalization for accent-insensitive matching (combining marks are dropped by CharSoupFeatureExtractor.isTransparent(int))
  3. Case folding via Character.toLowerCase(int)
  4. CJK character bigrams (no unigrams)

This class is intentionally separate from WordTokenizer to avoid parameterization in the language-detection hot path.

  • Method Details

    • tokenize

      public static List<String> tokenize(String rawText)
      Tokenize in TikaEvalTokenizer.Mode.COMMON_TOKENS mode and return tokens as a list.
      Parameters:
      rawText - raw input text
      Returns:
      filtered token list
    • tokenize

      public static List<String> tokenize(String rawText, TikaEvalTokenizer.Mode mode)
      Tokenize in the specified mode and return tokens as a list.
      Parameters:
      rawText - raw input text
      mode - tokenization mode
      Returns:
      token list
    • tokenize

      public static void tokenize(String rawText, Consumer<String> consumer)
      Tokenize in TikaEvalTokenizer.Mode.COMMON_TOKENS mode, streaming tokens to a consumer.
      Parameters:
      rawText - raw input text
      consumer - receives each token
    • tokenize

      public static void tokenize(String rawText, TikaEvalTokenizer.Mode mode, Consumer<String> consumer)
      Tokenize in the specified mode, streaming tokens to a consumer.
      Parameters:
      rawText - raw input text
      mode - tokenization mode
      consumer - receives each token
    • tokenize

      public static void tokenize(String rawText, TikaEvalTokenizer.Mode mode, int maxTokens, Consumer<String> consumer)
      Tokenize in the specified mode, streaming at most maxTokens tokens to a consumer. Iteration stops as soon as the limit is reached — no wasted work on the remainder of the string.
      Parameters:
      rawText - raw input text
      mode - tokenization mode
      maxTokens - maximum number of tokens to emit; use Integer.MAX_VALUE for no limit
      consumer - receives each token