Class CharsetSupersets

java.lang.Object
org.apache.tika.detect.CharsetSupersets

public final class CharsetSupersets extends Object
Maps detected charsets to safer superset charsets for decoding.

When Tika detects a charset that is a strict subset of a broader encoding, it is safer to decode with the superset — the superset handles all byte sequences the subset can produce, plus the extension characters the subset cannot represent. Decoding with only the subset risks mojibake on any extension characters present in the document.

Policy: Content-Type and detected-encoding metadata report the detected charset. Actual stream decoding uses the superset. The superset used is recorded in TikaCoreProperties.DECODED_CHARSET.

Superset map

  • EUC-KR → x-windows-949 (MS949 is a strict superset: all EUC-KR byte sequences decode identically, extension chars in x-windows-949 would mojibake under EUC-KR)
  • Big5 → Big5-HKSCS (HKSCS adds Hong Kong Supplementary Characters)
  • GB2312 → GB18030 (GB18030 is a strict superset of both GB2312 and GBK)
  • GBK → GB18030 (GB18030 is a strict superset; enables 4-byte extension sequences)
  • Shift_JIS → windows-31j (MS932 is a strict superset with NEC/IBM extensions)
  • Field Details

    • SUPERSET_MAP

      public static final Map<String,String> SUPERSET_MAP
      Maps detected charset canonical names (case-sensitive, as returned by Charset.name()) to their superset charset canonical name.
  • Method Details

    • supersetOf

      public static Charset supersetOf(Charset detected)
      Returns the superset charset to use for decoding, or null if detected has no superset override.
      Parameters:
      detected - the charset returned by the encoding detector
      Returns:
      superset charset, or null if none is defined