Package org.apache.tika.detect
Class EncodingResult
java.lang.Object
org.apache.tika.detect.EncodingResult
A charset detection result pairing a
Charset with a confidence score
and a EncodingResult.ResultType indicating the nature of the evidence.
Result types
EncodingResult.ResultType.DECLARATIVE— the document explicitly stated its encoding (BOM, HTML<meta charset>). These are authoritative claims about author intent and get preference over inferred results when consistent with the actual bytes.EncodingResult.ResultType.STRUCTURAL— byte-grammar proof (ISO-2022 escape sequences, UTF-8 multibyte validation). The encoding is proven by the byte structure itself, independent of any declaration.EncodingResult.ResultType.STATISTICAL— probabilistic inference from a statistical model. Theconfidencefloat is meaningful here for ranking among candidates; for DECLARATIVE and STRUCTURAL results it is conventionally1.0but carries no additional information.
- Since:
- Apache Tika 4.0
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumThe nature of the evidence that produced this result. -
Constructor Summary
ConstructorsConstructorDescriptionEncodingResult(Charset charset, float confidence) Constructs a STATISTICAL result.EncodingResult(Charset charset, float confidence, String label) Constructs a STATISTICAL result with a detector-specific label.EncodingResult(Charset charset, float confidence, String label, EncodingResult.ResultType resultType) Constructs a result with an explicitEncodingResult.ResultType. -
Method Summary
-
Constructor Details
-
EncodingResult
Constructs a STATISTICAL result. Existing detectors that do not yet classify their evidence type default to statistical (probabilistic) treatment, which is the safe, arbitratable assumption.- Parameters:
charset- the detected charset; must not benullconfidence- detection confidence in[0.0, 1.0]
-
EncodingResult
Constructs a STATISTICAL result with a detector-specific label.- Parameters:
charset- the detected charset; must not benullconfidence- detection confidence in[0.0, 1.0]label- the detector's original label (e.g."IBM420-ltr"); ifnull, defaults tocharset.name()
-
EncodingResult
public EncodingResult(Charset charset, float confidence, String label, EncodingResult.ResultType resultType) Constructs a result with an explicitEncodingResult.ResultType.- Parameters:
charset- the detected charset; must not benullconfidence- detection confidence in[0.0, 1.0]label- the detector's original label; ifnull, defaults tocharset.name()resultType- the nature of the evidence; must not benull
-
-
Method Details
-
getCharset
-
getConfidence
public float getConfidence()Detection confidence in[0.0, 1.0]. Meaningful for ranking amongEncodingResult.ResultType.STATISTICALcandidates. ForEncodingResult.ResultType.DECLARATIVEandEncodingResult.ResultType.STRUCTURALresults the value is conventionally1.0but carries no additional information beyond the type itself. -
getResultType
The nature of the evidence that produced this result.- See Also:
-
getLabel
The detector's original label for this result. Usually identical togetCharset().name(), but preserved when the detector uses finer-grained labels than the Java charset registry supports (e.g."IBM420-ltr","IBM420-rtl","windows-874"). -
toString
-