Package org.apache.tika.ml.chardetect
package org.apache.tika.ml.chardetect
-
ClassDescriptionCharset relationships used for lenient (lenient) evaluation of charset detectors.Cheap byte-wise decode-equivalence check for single-byte charsets.Byte-level HTML tag stripper used as a preprocess for charset detection.Result of a strip operation: new content length and the number of well-formed tags (including comments) successfully parsed.Naive-Bayes pipeline detector: structural checks for wide Unicode + BOMs before falling through to the bigram NB classifier for everything else.Naive-Bayes byte-bigram charset classifier.Pooled candidate from
LogLinearCombiner: label, raw summed score (larger is better, not normalized), and the specialists that contributed.Raw per-class logits from a single MoE specialist.SPI contract for an MoE charset-detection specialist.Fast, rule-based encoding checks that run before the statistical model.Outcome of the UTF-8 structural check.Feature extractor for the UTF-16 specialist of the mixture-of-experts charset detector.UTF-16 specialist detector of the mixture-of-experts charset detection architecture.