Package org.apache.tika.quality
Interface TextQualityDetector
- All Known Implementing Classes:
JunkDetector
public interface TextQualityDetector
Scores a string for text quality and arbitrates between two candidate strings.
Implementations are expected to be immutable and thread-safe after construction.
Implementations are registered via the standard Java ServiceLoader
mechanism: place the fully-qualified class name in
META-INF/services/org.apache.tika.quality.TextQualityDetector.
Typical usage:
TextQualityDetector detector = ServiceLoader.load(TextQualityDetector.class)
.findFirst().orElseThrow();
// Score a string
TextQualityScore score = detector.score(text);
if (score.getZScore() < -2.0) { ... flag or re-process ... }
// Arbitrate between two charset decodings
TextQualityComparison cmp = detector.compare("cp1252", decodedAsCp1252,
"cp1251", decodedAsCp1251);
String winner = cmp.winner(); // "A" or "B"
-
Method Summary
Modifier and TypeMethodDescriptionCompares two candidate strings and returns which is higher-quality (cleaner text).Scores the given string for text quality.
-
Method Details
-
score
Scores the given string for text quality.- Parameters:
text- the string to score; must not be null- Returns:
- a
TextQualityScore; checkTextQualityScore.isUnknown()if the input is empty or the script is not covered by the model
-
compare
Compares two candidate strings and returns which is higher-quality (cleaner text).A common use case is charset-decoding arbitration: given raw bytes decoded via two different charsets, pass each decoded string here with a human-readable label (e.g. the charset name) and the detector will pick the one that looks more like natural language.
- Parameters:
labelA- human-readable label for candidate A (e.g."cp1252")candidateA- first candidate stringlabelB- human-readable label for candidate B (e.g."cp1251")candidateB- second candidate string- Returns:
- a
TextQualityComparisonwith the winning label and confidence delta
-