Class TextQualityScore

java.lang.Object
org.apache.tika.quality.TextQualityScore

public final class TextQualityScore extends Object
Result of scoring a string for text quality via a TextQualityDetector.

zScore is the primary output: how many standard deviations below typical clean text this string scores on its dominant script's model. Negative means worse than average clean text; more negative means worse.

pClean is a probability estimate in [0,1] that this is clean text.

ciLow / ciHigh are the 95% confidence interval bounds on zScore. For short strings these bounds are wide; for long strings they narrow. Prefer ciLow < threshold over zScore < threshold when triggering actions, to reduce false positives on short strings.

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final float
    Sentinel z-score returned when scoring could not be run (e.g. null or empty input).
  • Constructor Summary

    Constructors
    Constructor
    Description
    TextQualityScore(float zScore, float pClean, float ciLow, float ciHigh, String dominantScript)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    float
    Upper bound of the 95% confidence interval on zScore.
    float
    Lower bound of the 95% confidence interval on zScore.
    Name of the dominant Unicode script detected, e.g.
    float
    Probability in [0,1] that this string is clean text.
    float
    Z-score relative to clean text for the detected script. 0 = average clean; negative = worse.
    boolean
    True if scoring could not be performed (e.g. empty or unsupported-script input).
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • UNKNOWN

      public static final float UNKNOWN
      Sentinel z-score returned when scoring could not be run (e.g. null or empty input).
      See Also:
  • Constructor Details

    • TextQualityScore

      public TextQualityScore(float zScore, float pClean, float ciLow, float ciHigh, String dominantScript)
  • Method Details

    • getZScore

      public float getZScore()
      Z-score relative to clean text for the detected script. 0 = average clean; negative = worse.
    • getPClean

      public float getPClean()
      Probability in [0,1] that this string is clean text.
    • getCiLow

      public float getCiLow()
      Lower bound of the 95% confidence interval on zScore.
    • getCiHigh

      public float getCiHigh()
      Upper bound of the 95% confidence interval on zScore.
    • getDominantScript

      public String getDominantScript()
      Name of the dominant Unicode script detected, e.g. "LATIN", "CYRILLIC", "ARABIC".
    • isUnknown

      public boolean isUnknown()
      True if scoring could not be performed (e.g. empty or unsupported-script input).
    • toString

      public String toString()
      Overrides:
      toString in class Object