org.apache.tika.detect.CharsetSupersets

public final class CharsetSupersets extends Object

Maps detected charsets to safer superset charsets for decoding.

When Tika detects a charset that is a strict subset of a broader encoding, it is safer to decode with the superset — the superset handles all byte sequences the subset can produce, plus the extension characters the subset cannot represent. Decoding with only the subset risks mojibake on any extension characters present in the document.

Policy: Content-Type and detected-encoding metadata report the detected charset. Actual stream decoding uses the superset. The superset used is recorded in TikaCoreProperties.DECODED_CHARSET.

Superset map

EUC-KR → x-windows-949 (MS949 is a strict superset: all EUC-KR byte sequences decode identically, extension chars in x-windows-949 would mojibake under EUC-KR)
Big5 → Big5-HKSCS (HKSCS adds Hong Kong Supplementary Characters)
GB2312 → GB18030 (GB18030 is a strict superset of both GB2312 and GBK)
GBK → GB18030 (GB18030 is a strict superset; enables 4-byte extension sequences)
Shift_JIS → windows-31j (MS932 is a strict superset with NEC/IBM extensions)

Field Summary

Fields

Modifier and Type

Field

Description

static final Map<String,String>

SUPERSET_MAP

Maps detected charset canonical names (case-sensitive, as returned by Charset.name()) to their superset charset canonical name.
Method Summary

Modifier and Type

Method

Description

static Charset

supersetOf(Charset detected)

Returns the superset charset to use for decoding, or null if detected has no superset override.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- SUPERSET_MAP
  
  public static final Map<String,String> SUPERSET_MAP
  
  Maps detected charset canonical names (case-sensitive, as returned by Charset.name()) to their superset charset canonical name.
Method Details
- supersetOf
  
  public static Charset supersetOf(Charset detected)
  
  Returns the superset charset to use for decoding, or null if detected has no superset override.
  
  Parameters:
  
  detected - the charset returned by the encoding detector
  
  Returns:
  
  superset charset, or null if none is defined

Class CharsetSupersets

Superset map

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

SUPERSET_MAP

Method Details

supersetOf