Interface TextQualityDetector

All Known Implementing Classes:
JunkDetector

public interface TextQualityDetector
Scores a string for text quality and arbitrates between two candidate strings.

Implementations are expected to be immutable and thread-safe after construction.

Implementations are registered via the standard Java ServiceLoader mechanism: place the fully-qualified class name in META-INF/services/org.apache.tika.quality.TextQualityDetector.

Typical usage:


 TextQualityDetector detector = ServiceLoader.load(TextQualityDetector.class)
         .findFirst().orElseThrow();

 // Score a string
 TextQualityScore score = detector.score(text);
 if (score.getZScore() < -2.0) { ... flag or re-process ... }

 // Arbitrate between two charset decodings
 TextQualityComparison cmp = detector.compare("cp1252", decodedAsCp1252,
                                               "cp1251", decodedAsCp1251);
 String winner = cmp.winner();  // "A" or "B"
 
  • Method Details

    • score

      TextQualityScore score(String text)
      Scores the given string for text quality.
      Parameters:
      text - the string to score; must not be null
      Returns:
      a TextQualityScore; check TextQualityScore.isUnknown() if the input is empty or the script is not covered by the model
    • compare

      TextQualityComparison compare(String labelA, String candidateA, String labelB, String candidateB)
      Compares two candidate strings and returns which is higher-quality (cleaner text).

      A common use case is charset-decoding arbitration: given raw bytes decoded via two different charsets, pass each decoded string here with a human-readable label (e.g. the charset name) and the detector will pick the one that looks more like natural language.

      Parameters:
      labelA - human-readable label for candidate A (e.g. "cp1252")
      candidateA - first candidate string
      labelB - human-readable label for candidate B (e.g. "cp1251")
      candidateB - second candidate string
      Returns:
      a TextQualityComparison with the winning label and confidence delta