org.apache.tika.ml.chardetect.CharsetConfusables

public final class CharsetConfusables extends Object

Charset relationships used for lenient (lenient) evaluation of charset detectors.

Two kinds of relationship are modelled:

Symmetric confusable groups

A pair (or larger set) of charsets is symmetrically confusable when a significant fraction of real-world byte sequences are identical under all members, making it unreasonable to penalise a detector for choosing either direction. Examples:

ISO-8859-1 / ISO-8859-15 / windows-1252 — differ only in 8 code points and the C1 control range (0x80–0x9F)
UTF-16-LE / UTF-16-BE without a BOM
IBM424-ltr / IBM424-rtl — same code page, differ only in text-reversal convention
KOI8-R / KOI8-U — share all Cyrillic letters, differ only in four Ukrainian characters

Superset / subset chains

Some charsets stand in a strict superset/subset relationship:

GB2312 ⊂ GBK ⊂ GB18030
Big5 ⊂ Big5-HKSCS

For these, the relationship is directional:

Predicting the superset when the true charset is the subset is always safe — the superset decoder handles all subset byte sequences correctly (e.g. predicting GB18030 for a GB2312 file).
Predicting the subset when the true charset is the superset is not safe — the subset decoder may corrupt characters that exist only in the superset (e.g. predicting GB2312 for a file that uses GB18030-only characters).

Summary

Use isLenientMatch(String, String) for evaluation: it returns true when the prediction is acceptable given the above rules. Use GROUPS + buildGroupIndices(java.lang.String[]) + collapseGroups(float[], int[][]) for inference-time probability pooling (direction does not matter when merging probability mass).

This class is kept in sync with the charset configuration in BuildCharsetTrainingData (see CHARSET_JAVA and the language-to-charset mappings).

Field Summary

Fields

Modifier and Type

Field

Description

static final List<Set<String>>

GROUPS

All confusable groups (both symmetric and superset chains), used for probability collapsing during inference via collapseGroups(float[], int[][]).

static final Map<String,String>

ISO_TO_WINDOWS

Maps each ISO-8859-X charset to its Windows-12XX equivalent.

static final Set<String>

SBCS_LATIN_FAMILY

Single-byte Latin-family charsets that may decode byte-identically to windows-1252 on sparse probes (where the only high bytes present fall in positions the family agrees on — e.g. 0xE4='ä' in every member).

static final Map<String,String>

SUPERSET_OF

Directional superset relationships: key is a charset, value is its immediate superset.

static final List<Set<String>>

SYMMETRIC_GROUPS

Symmetric-only confusable groups.
Method Summary

Modifier and Type

Method

Description

static int[][]

buildGroupIndices(String[] labels)

Build a per-class group-index array from a label array (e.g. from a LinearModel), using GROUPS (both symmetric and superset chains) for probability collapsing in inference.

static float[]

collapseGroups(float[] probs, int[][] groupIndices)

Collapse confusable group probabilities: within each group, sum all members' probabilities and assign the total to the highest-scoring member; the other members get 0.

static boolean

isLenientMatch(String actual, String predicted)

Return true if predicting predicted when the true charset is actual is an acceptable ("lenient") result.

static Set<String>

symmetricPeersOf(String charset)

Return the set of charsets that are symmetrically confusable with charset, not including charset itself.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- GROUPS
  
  public static final List<Set<String>> GROUPS
  
  All confusable groups (both symmetric and superset chains), used for probability collapsing during inference via collapseGroups(float[], int[][]). Direction does not matter here — we just want to pool probability mass among related charsets and award it to the highest-scoring member.
- SYMMETRIC_GROUPS
  
  public static final List<Set<String>> SYMMETRIC_GROUPS
  
  Symmetric-only confusable groups. Both directions are equally acceptable as lenient matches (neither charset is "safer" than the other).
- ISO_TO_WINDOWS
  
  public static final Map<String,String> ISO_TO_WINDOWS
  
  Maps each ISO-8859-X charset to its Windows-12XX equivalent.
  When bytes in the C1 range (0x80–0x9F) are present in a probe, the ISO encoding is ruled out — those byte values are C1 control characters in every ISO-8859-X standard and never appear in real text. The corresponding Windows-12XX encoding uses that range for printable characters (smart quotes, Euro sign, em-dash, …) and is the correct attribution.
  
  Note that ISO-8859-15 maps to windows-1252 because ISO-8859-15 also leaves 0x80–0x9F as C1 controls; its differences from ISO-8859-1 are all in the 0xA0–0xFF region.
- SUPERSET_OF
  
  public static final Map<String,String> SUPERSET_OF
  Directional superset relationships: key is a charset, value is its immediate superset. A chain is expressed as successive entries:
  GB2312 → GBK → GB18030 Big5 → Big5-HKSCS
  Predicting any ancestor of actual in this map is a lenient match. Predicting any descendant is not a lenient match.
- SBCS_LATIN_FAMILY
  
  public static final Set<String> SBCS_LATIN_FAMILY
  
  Single-byte Latin-family charsets that may decode byte-identically to windows-1252 on sparse probes (where the only high bytes present fall in positions the family agrees on — e.g. 0xE4='ä' in every member).
  Used by the Latin-windows-1252 fallback rule in MojibusterEncodingDetector: if the top candidate is a member of this set AND the probe decodes byte-identically under windows-1252, swap to windows-1252 as the unmarked Latin default. This is a narrower replacement for an earlier general "decode-equivalence expansion" design — see charset-detection.md for the full design-options discussion.
Method Details
- isLenientMatch
  
  public static boolean isLenientMatch(String actual, String predicted)
  
  Return true if predicting predicted when the true charset is actual is an acceptable ("lenient") result.
  Symmetric groups: both directions are acceptable (e.g. predicting ISO-8859-1 for windows-1252 or vice versa).
  Superset chains: only predicting a superset of actual is acceptable (e.g. predicting GB18030 for GB2312), because the superset decoder can always handle the subset's byte sequences. Predicting a subset is not acceptable because the subset decoder may corrupt extended characters.
- symmetricPeersOf
  
  public static Set<String> symmetricPeersOf(String charset)
  
  Return the set of charsets that are symmetrically confusable with charset, not including charset itself. Returns an empty set for charsets not in any symmetric group.
- buildGroupIndices
  
  public static int[][] buildGroupIndices(String[] labels)
  
  Build a per-class group-index array from a label array (e.g. from a LinearModel), using GROUPS (both symmetric and superset chains) for probability collapsing in inference.
  
  Parameters:
  
  labels - class labels in model order
  
  Returns:
  
  int[numClasses][] group-index map
- collapseGroups
  
  public static float[] collapseGroups(float[] probs, int[][] groupIndices)
  
  Collapse confusable group probabilities: within each group, sum all members' probabilities and assign the total to the highest-scoring member; the other members get 0.
  
  Parameters:
  
  probs - raw softmax output (not modified)
  
  groupIndices - result of buildGroupIndices(String[])
  
  Returns:
  
  new probability array with group probabilities collapsed

Class CharsetConfusables

Symmetric confusable groups

Superset / subset chains

Summary

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

GROUPS

SYMMETRIC_GROUPS

ISO_TO_WINDOWS

SUPERSET_OF

SBCS_LATIN_FAMILY

Method Details

isLenientMatch

symmetricPeersOf

buildGroupIndices

collapseGroups