java.lang.Object

org.apache.tika.langdetect.charsoup.WordTokenizer

public class WordTokenizer extends Object

General-purpose word tokenizer that shares the same preprocessing pipeline as CharSoupFeatureExtractor: NFC normalization, URL/email stripping, case folding via Character.toLowerCase(int).

This tokenizer is designed to replace Lucene's analyzer pipeline in tika-eval. It handles both alphabetic and ideographic scripts:

Alphabetic scripts: accumulates letters into words, emits on word boundary (non-letter codepoint)
Ideographic characters: emits character bigrams (pairs of adjacent ideographic characters), equivalent to Lucene's CJKBigramFilter

Mixed runs (e.g., alphabetic followed by ideographic) are handled correctly: the alphabetic word is emitted at the boundary, then ideographic bigrams begin.

Method Summary

Modifier and Type

Method

Description

static List<String>

tokenize(String rawText)

Tokenize the given raw text with full preprocessing (truncate, strip URLs/emails, NFC normalize, case fold) and return tokens as a list.

static void

tokenize(String rawText, Consumer<String> consumer)

Tokenize with full preprocessing, streaming tokens to a consumer.

static void

tokenizeAlphanumeric(String rawText, Consumer<String> consumer)

Tokenize the given raw text with full preprocessing, including numeric tokens.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- tokenize
  
  public static List<String> tokenize(String rawText)
  
  Tokenize the given raw text with full preprocessing (truncate, strip URLs/emails, NFC normalize, case fold) and return tokens as a list. Only alphabetic and ideographic tokens are emitted (no numbers).
  
  Parameters:
  
  rawText - raw input text
  
  Returns:
  
  list of token strings (words for alphabetic, bigrams for ideographic)
- tokenize
  
  public static void tokenize(String rawText, Consumer<String> consumer)
  
  Tokenize with full preprocessing, streaming tokens to a consumer. Only alphabetic and ideographic tokens are emitted (no numbers).
  
  Parameters:
  
  rawText - raw input text
  
  consumer - receives each token
- tokenizeAlphanumeric
  
  public static void tokenizeAlphanumeric(String rawText, Consumer<String> consumer)
  
  Tokenize the given raw text with full preprocessing, including numeric tokens. Alphabetic words and digit-only runs are emitted as separate tokens. Ideographic text produces character bigrams as usual.
  
  Parameters:
  
  rawText - raw input text
  
  consumer - receives each token

Class WordTokenizer

Method Summary

Methods inherited from class java.lang.Object

Method Details

tokenize

tokenize

tokenizeAlphanumeric