Package org.apache.tika.eval.core.tokens
Class TikaEvalTokenizer
java.lang.Object
org.apache.tika.eval.core.tokens.TikaEvalTokenizer
Tokenizer for tika-eval text analysis. Provides two modes:
TikaEvalTokenizer.Mode.STANDARD— for general token counting. Emits all alphabetic, ideographic, and numeric tokens with no minimum length and no skip list. Used byAnalyzerManagerforNUM_TOKENS/NUM_UNIQUE_TOKENS.TikaEvalTokenizer.Mode.COMMON_TOKENS— for building and querying common-token frequency lists. Alphabetic only (no numbers), minimum 3 characters, common HTML markup terms excluded. Used byCommonTokenCountManagerand the common token generator.
Both modes share the same preprocessing pipeline:
- URL/email stripping and truncation via
CharSoupFeatureExtractor.preprocess(String) - NFKD normalization for accent-insensitive matching (combining
marks are dropped by
CharSoupFeatureExtractor.isTransparent(int)) - Case folding via
Character.toLowerCase(int) - CJK character bigrams (no unigrams)
This class is intentionally separate from
WordTokenizer to avoid
parameterization in the language-detection hot path.
-
Nested Class Summary
Nested Classes -
Method Summary
Modifier and TypeMethodDescriptionTokenize inTikaEvalTokenizer.Mode.COMMON_TOKENSmode and return tokens as a list.static voidTokenize inTikaEvalTokenizer.Mode.COMMON_TOKENSmode, streaming tokens to a consumer.tokenize(String rawText, TikaEvalTokenizer.Mode mode) Tokenize in the specified mode and return tokens as a list.static voidtokenize(String rawText, TikaEvalTokenizer.Mode mode, int maxTokens, Consumer<String> consumer) Tokenize in the specified mode, streaming at mostmaxTokenstokens to a consumer.static voidtokenize(String rawText, TikaEvalTokenizer.Mode mode, Consumer<String> consumer) Tokenize in the specified mode, streaming tokens to a consumer.
-
Method Details
-
tokenize
Tokenize inTikaEvalTokenizer.Mode.COMMON_TOKENSmode and return tokens as a list.- Parameters:
rawText- raw input text- Returns:
- filtered token list
-
tokenize
Tokenize in the specified mode and return tokens as a list.- Parameters:
rawText- raw input textmode- tokenization mode- Returns:
- token list
-
tokenize
Tokenize inTikaEvalTokenizer.Mode.COMMON_TOKENSmode, streaming tokens to a consumer.- Parameters:
rawText- raw input textconsumer- receives each token
-
tokenize
Tokenize in the specified mode, streaming tokens to a consumer.- Parameters:
rawText- raw input textmode- tokenization modeconsumer- receives each token
-
tokenize
public static void tokenize(String rawText, TikaEvalTokenizer.Mode mode, int maxTokens, Consumer<String> consumer) Tokenize in the specified mode, streaming at mostmaxTokenstokens to a consumer. Iteration stops as soon as the limit is reached — no wasted work on the remainder of the string.- Parameters:
rawText- raw input textmode- tokenization modemaxTokens- maximum number of tokens to emit; useInteger.MAX_VALUEfor no limitconsumer- receives each token
-