Class WordTokenizer

java.lang.Object
org.apache.tika.langdetect.charsoup.WordTokenizer

public class WordTokenizer extends Object
General-purpose word tokenizer that shares the same preprocessing pipeline as CharSoupFeatureExtractor: NFC normalization, URL/email stripping, case folding via Character.toLowerCase(int).

This tokenizer is designed to replace Lucene's analyzer pipeline in tika-eval. It handles both alphabetic and ideographic scripts:

  • Alphabetic scripts: accumulates letters into words, emits on word boundary (non-letter codepoint)
  • Ideographic characters: emits character bigrams (pairs of adjacent ideographic characters), equivalent to Lucene's CJKBigramFilter

Mixed runs (e.g., alphabetic followed by ideographic) are handled correctly: the alphabetic word is emitted at the boundary, then ideographic bigrams begin.

  • Method Details

    • tokenize

      public static List<String> tokenize(String rawText)
      Tokenize the given raw text with full preprocessing (truncate, strip URLs/emails, NFC normalize, case fold) and return tokens as a list. Only alphabetic and ideographic tokens are emitted (no numbers).
      Parameters:
      rawText - raw input text
      Returns:
      list of token strings (words for alphabetic, bigrams for ideographic)
    • tokenize

      public static void tokenize(String rawText, Consumer<String> consumer)
      Tokenize with full preprocessing, streaming tokens to a consumer. Only alphabetic and ideographic tokens are emitted (no numbers).
      Parameters:
      rawText - raw input text
      consumer - receives each token
    • tokenizeAlphanumeric

      public static void tokenizeAlphanumeric(String rawText, Consumer<String> consumer)
      Tokenize the given raw text with full preprocessing, including numeric tokens. Alphabetic words and digit-only runs are emitted as separate tokens. Ideographic text produces character bigrams as usual.
      Parameters:
      rawText - raw input text
      consumer - receives each token