Package org.apache.tika.quality
Class TextQualityScore
java.lang.Object
org.apache.tika.quality.TextQualityScore
Result of scoring a string for text quality via a
TextQualityDetector.
zScore is the primary output: how many standard deviations below
typical clean text this string scores on its dominant script's model.
Negative means worse than average clean text; more negative means worse.
pClean is a probability estimate in [0,1] that this is clean text.
ciLow / ciHigh are the 95% confidence interval bounds on
zScore. For short strings these bounds are wide; for long strings
they narrow. Prefer ciLow < threshold over zScore < threshold
when triggering actions, to reduce false positives on short strings.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final floatSentinel z-score returned when scoring could not be run (e.g. null or empty input). -
Constructor Summary
ConstructorsConstructorDescriptionTextQualityScore(float zScore, float pClean, float ciLow, float ciHigh, String dominantScript) -
Method Summary
Modifier and TypeMethodDescriptionfloatUpper bound of the 95% confidence interval on zScore.floatgetCiLow()Lower bound of the 95% confidence interval on zScore.Name of the dominant Unicode script detected, e.g.floatProbability in [0,1] that this string is clean text.floatZ-score relative to clean text for the detected script. 0 = average clean; negative = worse.booleanTrue if scoring could not be performed (e.g. empty or unsupported-script input).toString()
-
Field Details
-
UNKNOWN
public static final float UNKNOWNSentinel z-score returned when scoring could not be run (e.g. null or empty input).- See Also:
-
-
Constructor Details
-
TextQualityScore
public TextQualityScore(float zScore, float pClean, float ciLow, float ciHigh, String dominantScript)
-
-
Method Details
-
getZScore
public float getZScore()Z-score relative to clean text for the detected script. 0 = average clean; negative = worse. -
getPClean
public float getPClean()Probability in [0,1] that this string is clean text. -
getCiLow
public float getCiLow()Lower bound of the 95% confidence interval on zScore. -
getCiHigh
public float getCiHigh()Upper bound of the 95% confidence interval on zScore. -
getDominantScript
Name of the dominant Unicode script detected, e.g. "LATIN", "CYRILLIC", "ARABIC". -
isUnknown
public boolean isUnknown()True if scoring could not be performed (e.g. empty or unsupported-script input). -
toString
-