Class CharsetConfusables
Two kinds of relationship are modelled:
Symmetric confusable groups
A pair (or larger set) of charsets is symmetrically confusable when a significant fraction of real-world byte sequences are identical under all members, making it unreasonable to penalise a detector for choosing either direction. Examples:
- ISO-8859-1 / ISO-8859-15 / windows-1252 — differ only in 8 code points and the C1 control range (0x80–0x9F)
- UTF-16-LE / UTF-16-BE without a BOM
- IBM424-ltr / IBM424-rtl — same code page, differ only in text-reversal convention
- KOI8-R / KOI8-U — share all Cyrillic letters, differ only in four Ukrainian characters
Superset / subset chains
Some charsets stand in a strict superset/subset relationship:
- GB2312 ⊂ GBK ⊂ GB18030
- Big5 ⊂ Big5-HKSCS
For these, the relationship is directional:
- Predicting the superset when the true charset is the subset is always safe — the superset decoder handles all subset byte sequences correctly (e.g. predicting GB18030 for a GB2312 file).
- Predicting the subset when the true charset is the superset is not safe — the subset decoder may corrupt characters that exist only in the superset (e.g. predicting GB2312 for a file that uses GB18030-only characters).
Summary
Use isLenientMatch(String, String) for evaluation: it returns
true when the prediction is acceptable given the above rules.
Use GROUPS + buildGroupIndices(java.lang.String[]) +
collapseGroups(float[], int[][]) for inference-time probability pooling (direction
does not matter when merging probability mass).
This class is kept in sync with the charset configuration in
BuildCharsetTrainingData (see CHARSET_JAVA and the
language-to-charset mappings).
-
Field Summary
FieldsModifier and TypeFieldDescriptionAll confusable groups (both symmetric and superset chains), used for probability collapsing during inference viacollapseGroups(float[], int[][]).Maps each ISO-8859-X charset to its Windows-12XX equivalent.Single-byte Latin-family charsets that may decode byte-identically to windows-1252 on sparse probes (where the only high bytes present fall in positions the family agrees on — e.g. 0xE4='ä' in every member).Directional superset relationships: key is a charset, value is its immediate superset.Symmetric-only confusable groups. -
Method Summary
Modifier and TypeMethodDescriptionstatic int[][]buildGroupIndices(String[] labels) Build a per-class group-index array from a label array (e.g. from aLinearModel), usingGROUPS(both symmetric and superset chains) for probability collapsing in inference.static float[]collapseGroups(float[] probs, int[][] groupIndices) Collapse confusable group probabilities: within each group, sum all members' probabilities and assign the total to the highest-scoring member; the other members get 0.static booleanisLenientMatch(String actual, String predicted) Returntrueif predictingpredictedwhen the true charset isactualis an acceptable ("lenient") result.symmetricPeersOf(String charset) Return the set of charsets that are symmetrically confusable withcharset, not includingcharsetitself.
-
Field Details
-
GROUPS
All confusable groups (both symmetric and superset chains), used for probability collapsing during inference viacollapseGroups(float[], int[][]). Direction does not matter here — we just want to pool probability mass among related charsets and award it to the highest-scoring member. -
SYMMETRIC_GROUPS
Symmetric-only confusable groups. Both directions are equally acceptable as lenient matches (neither charset is "safer" than the other). -
ISO_TO_WINDOWS
Maps each ISO-8859-X charset to its Windows-12XX equivalent.When bytes in the C1 range (
0x80–0x9F) are present in a probe, the ISO encoding is ruled out — those byte values are C1 control characters in every ISO-8859-X standard and never appear in real text. The corresponding Windows-12XX encoding uses that range for printable characters (smart quotes, Euro sign, em-dash, …) and is the correct attribution.Note that ISO-8859-15 maps to windows-1252 because ISO-8859-15 also leaves 0x80–0x9F as C1 controls; its differences from ISO-8859-1 are all in the 0xA0–0xFF region.
-
SUPERSET_OF
Directional superset relationships: key is a charset, value is its immediate superset. A chain is expressed as successive entries:GB2312 → GBK → GB18030 Big5 → Big5-HKSCS
Predicting any ancestor ofactualin this map is a lenient match. Predicting any descendant is not a lenient match. -
SBCS_LATIN_FAMILY
Single-byte Latin-family charsets that may decode byte-identically to windows-1252 on sparse probes (where the only high bytes present fall in positions the family agrees on — e.g. 0xE4='ä' in every member).Used by the Latin-windows-1252 fallback rule in
MojibusterEncodingDetector: if the top candidate is a member of this set AND the probe decodes byte-identically under windows-1252, swap to windows-1252 as the unmarked Latin default. This is a narrower replacement for an earlier general "decode-equivalence expansion" design — seecharset-detection.mdfor the full design-options discussion.
-
-
Method Details
-
isLenientMatch
Returntrueif predictingpredictedwhen the true charset isactualis an acceptable ("lenient") result.Symmetric groups: both directions are acceptable (e.g. predicting ISO-8859-1 for windows-1252 or vice versa).
Superset chains: only predicting a superset ofactualis acceptable (e.g. predicting GB18030 for GB2312), because the superset decoder can always handle the subset's byte sequences. Predicting a subset is not acceptable because the subset decoder may corrupt extended characters. -
symmetricPeersOf
Return the set of charsets that are symmetrically confusable withcharset, not includingcharsetitself. Returns an empty set for charsets not in any symmetric group. -
buildGroupIndices
Build a per-class group-index array from a label array (e.g. from aLinearModel), usingGROUPS(both symmetric and superset chains) for probability collapsing in inference.- Parameters:
labels- class labels in model order- Returns:
int[numClasses][]group-index map
-
collapseGroups
public static float[] collapseGroups(float[] probs, int[][] groupIndices) Collapse confusable group probabilities: within each group, sum all members' probabilities and assign the total to the highest-scoring member; the other members get 0.- Parameters:
probs- raw softmax output (not modified)groupIndices- result ofbuildGroupIndices(String[])- Returns:
- new probability array with group probabilities collapsed
-