org.apache.tika.langdetect.charsoup (Apache Tika 4.0.0-alpha-1 API)

package org.apache.tika.langdetect.charsoup

Related Packages

Package

Description

org.apache.tika.langdetect

org.apache.tika.langdetect.lingo24

org.apache.tika.langdetect.mitll

org.apache.tika.langdetect.opennlp

org.apache.tika.langdetect.optimaize
Class

Description

CharSoupFeatureExtractor

Extracts character n-gram features from text using the hashing trick (FNV-1a).

CharSoupLanguageDetector

CharSoup language detector using INT8-quantized multinomial logistic regression trained on Wikipedia (primary corpus) with MADLAD supplements for thin languages.

CharSoupMetadataFilter

A MetadataFilter that runs CharSoup language detection on the extracted text content and writes the detected language and confidence into the metadata.

CharSoupModel

INT8-quantized multinomial logistic regression model for language detection.

ConfusableGroups

Loads the shared confusable language groups from confusables.txt on the classpath.

FeatureExtractor

Common interface for feature extractors used by the bigram language detector.

SaltedNgramFeatureExtractor

Feature extractor using positional salt (BOW/EOW/FULL_WORD) instead of sentinel characters in n-grams.

ScriptAwareFeatureExtractor

Production feature extractor for the CharSoup language detection model.

ScriptCategory

Coarse Unicode script categories for language detection.

ShortTextFeatureExtractor

Production feature extractor for the CharSoup short-text language detection model.

WordTokenizer

General-purpose word tokenizer that shares the same preprocessing pipeline as CharSoupFeatureExtractor: NFC normalization, URL/email stripping, case folding via Character.toLowerCase(int).

Package org.apache.tika.langdetect.charsoup