Class AnalyzerManager

java.lang.Object
org.apache.tika.eval.core.tokens.AnalyzerManager

public class AnalyzerManager extends Object
Manages tokenization for tika-eval. Uses TikaEvalTokenizer in STANDARD mode, which includes alphabetic, ideographic, and numeric tokens with NFKD normalization, case folding, and CJK bigrams. No minimum length filter or skip list is applied — those are only used in COMMON_TOKENS mode.
  • Method Details

    • newInstance

      public static AnalyzerManager newInstance(int maxTokens)
    • tokenize

      public TokenCounts tokenize(String text)
      Tokenize the given text and return a TokenCounts object.
      Parameters:
      text - input text
      Returns:
      token counts
    • tokenize

      public void tokenize(String text, Consumer<String> consumer)
      Tokenize and stream tokens to a consumer, respecting maxTokens limit.
      Parameters:
      text - input text
      consumer - receives each token string
    • getMaxTokens

      public int getMaxTokens()
      Get the max token limit.