Class CharsetConfusables

java.lang.Object
org.apache.tika.ml.chardetect.CharsetConfusables

public final class CharsetConfusables extends Object
Charset relationships used for lenient (lenient) evaluation of charset detectors.

Two kinds of relationship are modelled:

Symmetric confusable groups

A pair (or larger set) of charsets is symmetrically confusable when a significant fraction of real-world byte sequences are identical under all members, making it unreasonable to penalise a detector for choosing either direction. Examples:

  • ISO-8859-1 / ISO-8859-15 / windows-1252 — differ only in 8 code points and the C1 control range (0x80–0x9F)
  • UTF-16-LE / UTF-16-BE without a BOM
  • IBM424-ltr / IBM424-rtl — same code page, differ only in text-reversal convention
  • KOI8-R / KOI8-U — share all Cyrillic letters, differ only in four Ukrainian characters

Superset / subset chains

Some charsets stand in a strict superset/subset relationship:

  • GB2312 ⊂ GBK ⊂ GB18030
  • Big5 ⊂ Big5-HKSCS

For these, the relationship is directional:

  • Predicting the superset when the true charset is the subset is always safe — the superset decoder handles all subset byte sequences correctly (e.g. predicting GB18030 for a GB2312 file).
  • Predicting the subset when the true charset is the superset is not safe — the subset decoder may corrupt characters that exist only in the superset (e.g. predicting GB2312 for a file that uses GB18030-only characters).

Summary

Use isLenientMatch(String, String) for evaluation: it returns true when the prediction is acceptable given the above rules. Use GROUPS + buildGroupIndices(java.lang.String[]) + collapseGroups(float[], int[][]) for inference-time probability pooling (direction does not matter when merging probability mass).

This class is kept in sync with the charset configuration in BuildCharsetTrainingData (see CHARSET_JAVA and the language-to-charset mappings).

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final List<Set<String>>
    All confusable groups (both symmetric and superset chains), used for probability collapsing during inference via collapseGroups(float[], int[][]).
    static final Map<String,String>
    Maps each ISO-8859-X charset to its Windows-12XX equivalent.
    static final Set<String>
    Single-byte Latin-family charsets that may decode byte-identically to windows-1252 on sparse probes (where the only high bytes present fall in positions the family agrees on — e.g. 0xE4='ä' in every member).
    static final Map<String,String>
    Directional superset relationships: key is a charset, value is its immediate superset.
    static final List<Set<String>>
    Symmetric-only confusable groups.
  • Method Summary

    Modifier and Type
    Method
    Description
    static int[][]
    Build a per-class group-index array from a label array (e.g. from a LinearModel), using GROUPS (both symmetric and superset chains) for probability collapsing in inference.
    static float[]
    collapseGroups(float[] probs, int[][] groupIndices)
    Collapse confusable group probabilities: within each group, sum all members' probabilities and assign the total to the highest-scoring member; the other members get 0.
    static boolean
    isLenientMatch(String actual, String predicted)
    Return true if predicting predicted when the true charset is actual is an acceptable ("lenient") result.
    static Set<String>
    Return the set of charsets that are symmetrically confusable with charset, not including charset itself.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • GROUPS

      public static final List<Set<String>> GROUPS
      All confusable groups (both symmetric and superset chains), used for probability collapsing during inference via collapseGroups(float[], int[][]). Direction does not matter here — we just want to pool probability mass among related charsets and award it to the highest-scoring member.
    • SYMMETRIC_GROUPS

      public static final List<Set<String>> SYMMETRIC_GROUPS
      Symmetric-only confusable groups. Both directions are equally acceptable as lenient matches (neither charset is "safer" than the other).
    • ISO_TO_WINDOWS

      public static final Map<String,String> ISO_TO_WINDOWS
      Maps each ISO-8859-X charset to its Windows-12XX equivalent.

      When bytes in the C1 range (0x80–0x9F) are present in a probe, the ISO encoding is ruled out — those byte values are C1 control characters in every ISO-8859-X standard and never appear in real text. The corresponding Windows-12XX encoding uses that range for printable characters (smart quotes, Euro sign, em-dash, …) and is the correct attribution.

      Note that ISO-8859-15 maps to windows-1252 because ISO-8859-15 also leaves 0x80–0x9F as C1 controls; its differences from ISO-8859-1 are all in the 0xA0–0xFF region.

    • SUPERSET_OF

      public static final Map<String,String> SUPERSET_OF
      Directional superset relationships: key is a charset, value is its immediate superset. A chain is expressed as successive entries:
         GB2312 → GBK → GB18030
         Big5   → Big5-HKSCS
       
      Predicting any ancestor of actual in this map is a lenient match. Predicting any descendant is not a lenient match.
    • SBCS_LATIN_FAMILY

      public static final Set<String> SBCS_LATIN_FAMILY
      Single-byte Latin-family charsets that may decode byte-identically to windows-1252 on sparse probes (where the only high bytes present fall in positions the family agrees on — e.g. 0xE4='ä' in every member).

      Used by the Latin-windows-1252 fallback rule in MojibusterEncodingDetector: if the top candidate is a member of this set AND the probe decodes byte-identically under windows-1252, swap to windows-1252 as the unmarked Latin default. This is a narrower replacement for an earlier general "decode-equivalence expansion" design — see charset-detection.md for the full design-options discussion.

  • Method Details

    • isLenientMatch

      public static boolean isLenientMatch(String actual, String predicted)
      Return true if predicting predicted when the true charset is actual is an acceptable ("lenient") result.

      Symmetric groups: both directions are acceptable (e.g. predicting ISO-8859-1 for windows-1252 or vice versa).
      Superset chains: only predicting a superset of actual is acceptable (e.g. predicting GB18030 for GB2312), because the superset decoder can always handle the subset's byte sequences. Predicting a subset is not acceptable because the subset decoder may corrupt extended characters.

    • symmetricPeersOf

      public static Set<String> symmetricPeersOf(String charset)
      Return the set of charsets that are symmetrically confusable with charset, not including charset itself. Returns an empty set for charsets not in any symmetric group.
    • buildGroupIndices

      public static int[][] buildGroupIndices(String[] labels)
      Build a per-class group-index array from a label array (e.g. from a LinearModel), using GROUPS (both symmetric and superset chains) for probability collapsing in inference.
      Parameters:
      labels - class labels in model order
      Returns:
      int[numClasses][] group-index map
    • collapseGroups

      public static float[] collapseGroups(float[] probs, int[][] groupIndices)
      Collapse confusable group probabilities: within each group, sum all members' probabilities and assign the total to the highest-scoring member; the other members get 0.
      Parameters:
      probs - raw softmax output (not modified)
      groupIndices - result of buildGroupIndices(String[])
      Returns:
      new probability array with group probabilities collapsed