Class WordTokenizer
java.lang.Object
org.apache.tika.langdetect.charsoup.WordTokenizer
General-purpose word tokenizer that shares the same preprocessing pipeline
as
CharSoupFeatureExtractor: NFC normalization, URL/email stripping,
case folding via Character.toLowerCase(int).
This tokenizer is designed to replace Lucene's analyzer pipeline in tika-eval. It handles both alphabetic and ideographic scripts:
- Alphabetic scripts: accumulates letters into words, emits on word boundary (non-letter codepoint)
- Ideographic characters: emits character bigrams (pairs of adjacent ideographic characters), equivalent to Lucene's CJKBigramFilter
Mixed runs (e.g., alphabetic followed by ideographic) are handled correctly: the alphabetic word is emitted at the boundary, then ideographic bigrams begin.
-
Method Summary
Modifier and TypeMethodDescriptionTokenize the given raw text with full preprocessing (truncate, strip URLs/emails, NFC normalize, case fold) and return tokens as a list.static voidTokenize with full preprocessing, streaming tokens to a consumer.static voidtokenizeAlphanumeric(String rawText, Consumer<String> consumer) Tokenize the given raw text with full preprocessing, including numeric tokens.
-
Method Details
-
tokenize
Tokenize the given raw text with full preprocessing (truncate, strip URLs/emails, NFC normalize, case fold) and return tokens as a list. Only alphabetic and ideographic tokens are emitted (no numbers).- Parameters:
rawText- raw input text- Returns:
- list of token strings (words for alphabetic, bigrams for ideographic)
-
tokenize
Tokenize with full preprocessing, streaming tokens to a consumer. Only alphabetic and ideographic tokens are emitted (no numbers).- Parameters:
rawText- raw input textconsumer- receives each token
-
tokenizeAlphanumeric
Tokenize the given raw text with full preprocessing, including numeric tokens. Alphabetic words and digit-only runs are emitted as separate tokens. Ideographic text produces character bigrams as usual.- Parameters:
rawText- raw input textconsumer- receives each token
-