Package org.apache.tika.detect
Class CharsetSupersets
java.lang.Object
org.apache.tika.detect.CharsetSupersets
Maps detected charsets to safer superset charsets for decoding.
When Tika detects a charset that is a strict subset of a broader encoding, it is safer to decode with the superset — the superset handles all byte sequences the subset can produce, plus the extension characters the subset cannot represent. Decoding with only the subset risks mojibake on any extension characters present in the document.
Policy: Content-Type and detected-encoding metadata report the detected
charset. Actual stream decoding uses the superset. The superset used is recorded
in TikaCoreProperties.DECODED_CHARSET.
Superset map
- EUC-KR → x-windows-949 (MS949 is a strict superset: all EUC-KR byte sequences decode identically, extension chars in x-windows-949 would mojibake under EUC-KR)
- Big5 → Big5-HKSCS (HKSCS adds Hong Kong Supplementary Characters)
- GB2312 → GB18030 (GB18030 is a strict superset of both GB2312 and GBK)
- GBK → GB18030 (GB18030 is a strict superset; enables 4-byte extension sequences)
- Shift_JIS → windows-31j (MS932 is a strict superset with NEC/IBM extensions)
-
Field Summary
FieldsModifier and TypeFieldDescriptionMaps detected charset canonical names (case-sensitive, as returned byCharset.name()) to their superset charset canonical name. -
Method Summary
Modifier and TypeMethodDescriptionstatic CharsetsupersetOf(Charset detected) Returns the superset charset to use for decoding, ornullifdetectedhas no superset override.
-
Field Details
-
SUPERSET_MAP
Maps detected charset canonical names (case-sensitive, as returned byCharset.name()) to their superset charset canonical name.
-
-
Method Details
-
supersetOf
Returns the superset charset to use for decoding, ornullifdetectedhas no superset override.- Parameters:
detected- the charset returned by the encoding detector- Returns:
- superset charset, or
nullif none is defined
-