Package org.apache.tika.langdetect.charsoup
package org.apache.tika.langdetect.charsoup
-
ClassDescriptionExtracts character n-gram features from text using the hashing trick (FNV-1a).CharSoup language detector using INT8-quantized multinomial logistic regression trained on Wikipedia (primary corpus) with MADLAD supplements for thin languages.A
MetadataFilterthat runs CharSoup language detection on the extracted text content and writes the detected language and confidence into the metadata.INT8-quantized multinomial logistic regression model for language detection.Loads the shared confusable language groups fromconfusables.txton the classpath.Common interface for feature extractors used by the bigram language detector.Feature extractor using positional salt (BOW/EOW/FULL_WORD) instead of sentinel characters in n-grams.Production feature extractor for the CharSoup language detection model.Coarse Unicode script categories for language detection.Production feature extractor for the CharSoup short-text language detection model.General-purpose word tokenizer that shares the same preprocessing pipeline asCharSoupFeatureExtractor: NFC normalization, URL/email stripping, case folding viaCharacter.toLowerCase(int).