Class StructuralEncodingRules
Pipeline
checkAscii(byte[]): no bytes >= 0x80 → UTF-8 (ASCII is a subset)detectIso2022(byte[]): ISO-2022 escape sequences present → ISO-2022-JP, ISO-2022-KR, or ISO-2022-CN depending on the designation sequencecheckUtf8(byte[]): validate UTF-8 multi-byte grammar; returns aStructuralEncodingRules.Utf8Resultindicating whether the bytes are definitively UTF-8, definitively not UTF-8, or ambiguous (pass to model).
UTF-16/32 detection is handled upstream by
org.apache.tika.utils.ByteEncodingHint and is not repeated here.
IBM424 (EBCDIC Hebrew) is detected via checkIbm424(byte[]): the Hebrew
letters in this code page occupy bytes 0x41–0x6A, which fall entirely below
the 0x80 threshold used by the statistical model's feature extractor. The
EBCDIC space (0x40) vs ASCII space (0x20) frequency ratio provides a cheap
first-pass EBCDIC gate before the Hebrew letter frequencies are checked.
All methods are stateless and safe to call from multiple threads.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumOutcome of the UTF-8 structural check. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intMinimum probe length beforehas2ByteColumnAsymmetry(byte[])produces meaningful diversity counts. -
Method Summary
Modifier and TypeMethodDescriptionstatic booleancheckAscii(byte[] bytes) Returnstrueifbytescontains no bytes with value >= 0x80 (i.e. pure 7-bit ASCII, which is a strict subset of UTF-8).static booleancheckAscii(byte[] bytes, int offset, int length) static booleancheckHz(byte[] bytes) Returnstrueif HZ-GB-2312 switching sequences are present.static booleancheckHz(byte[] bytes, int offset, int length) static booleancheckIbm424(byte[] bytes) Detects IBM424 (EBCDIC Hebrew) by examining the sub-0x80 byte landscape.static booleancheckIbm424(byte[] bytes, int offset, int length) static booleancheckIbm500(byte[] bytes) Detects IBM500 (International EBCDIC / EBCDIC-500) by looking for the combination of the EBCDIC space byte and high-byte Latin letter density.static booleancheckIbm500(byte[] bytes, int offset, int length) static booleancheckIso2022Jp(byte[] bytes) Deprecated.checkUtf8(byte[] bytes) Validates the UTF-8 byte grammar of the sample and returns one of three outcomes:StructuralEncodingRules.Utf8Result.LIKELY_UTF8: all multi-byte sequences are valid and the sample contains enough high bytes to be informative.checkUtf8(byte[] bytes, int offset, int length) static intcountUtf8Errors(byte[] bytes) Counts the number of malformed UTF-8 sequences in the sample — one event per bad lead, orphaned continuation, overlong, surrogate, or out-of-range codepoint, regardless of how many bytes the bad sequence spans.static intcountUtf8Errors(byte[] bytes, int offset, int length) static CharsetdetectIso2022(byte[] bytes) Detects ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN by scanning for their characteristic ESC designation sequences.static CharsetdetectIso2022(byte[] bytes, int offset, int length) static booleanhas2ByteColumnAsymmetry(byte[] bytes) Returnstrueif the probe's byte distribution across stride-2 columns is sufficiently asymmetric to be plausible UTF-16 of some script.static booleanhas2ByteColumnAsymmetryEvidence(byte[] bytes) Evidence-based variant ofhas2ByteColumnAsymmetry(byte[])with no conservative short-probe default: returnstrueonly when the bytes themselves demonstrate column asymmetry, regardless of probe length.static booleanhasC1Bytes(byte[] bytes) Returnstrueif the probe contains any byte in the C1 control range0x80–0x9F.static booleanhasC1Bytes(byte[] bytes, int offset, int length) static booleanhasCrlfBytes(byte[] bytes) Returnstrueif the probe contains at least one CRLF pair (0x0D 0x0A).static booleanhasCrlfBytes(byte[] bytes, int offset, int length) static booleanhasGb18030FourByteSequence(byte[] bytes) Returnstrueif the probe contains at least one GB18030-specific 4-byte sequence.static booleanhasGb18030FourByteSequence(byte[] bytes, int offset, int length) static booleanisEbcdicLikely(byte[] bytes) Returnstrueif the probe is plausibly EBCDIC based on the word-separator distribution.
-
Field Details
-
MIN_COLUMN_ASYMMETRY_PROBE
public static final int MIN_COLUMN_ASYMMETRY_PROBEMinimum probe length beforehas2ByteColumnAsymmetry(byte[])produces meaningful diversity counts. Short probes or probes with limited vocabulary may have too few distinct byte values per column to compare reliably; on anything below this threshold we fall back to the pre-gate behaviour (model +WideUnicodeDetectorpositive signal). Set above the size of typical short probes (a few hundred bytes) so real CJK UTF-16 text has room to diversify its high-byte column.- See Also:
-
-
Method Details
-
checkAscii
public static boolean checkAscii(byte[] bytes) Returnstrueifbytescontains no bytes with value >= 0x80 (i.e. pure 7-bit ASCII, which is a strict subset of UTF-8). -
checkAscii
public static boolean checkAscii(byte[] bytes, int offset, int length) -
detectIso2022
Detects ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN by scanning for their characteristic ESC designation sequences.All three share the
ESC $(0x1B 0x24) prefix, so we must read further to distinguish them:ISO-2022-JP: ESC $ B (JIS X 0208-1983) ESC $ @ (JIS X 0208-1978) ESC $ ( D (JIS X 0212 supplementary) ISO-2022-KR: ESC $ ) C ISO-2022-CN: ESC $ ) A (GB2312) ESC $ ) G (CNS 11643 plane 1) ESC $ * H (CNS 11643 plane 2)If
ESC $is found but no recognised third byte follows (or the buffer is too short), ISO-2022-JP is returned as the most common default.- Returns:
- the detected ISO-2022 charset, or
nullif no ISO-2022 escape sequence is found
-
detectIso2022
-
checkHz
public static boolean checkHz(byte[] bytes) Returnstrueif HZ-GB-2312 switching sequences are present.HZ is a 7-bit encoding: it uses
~\{} ({@code 0x7E 0x7B}) to enter two-byte GB2312 mode and {@code ~\}(0x7E 0x7D) to return to ASCII mode. Like ISO-2022, all bytes are below 0x80, so the model would see no features and must be bypassed with this structural check. -
checkHz
public static boolean checkHz(byte[] bytes, int offset, int length) -
checkIbm424
public static boolean checkIbm424(byte[] bytes) Detects IBM424 (EBCDIC Hebrew) by examining the sub-0x80 byte landscape.Why this is needed
In EBCDIC, the space character is
0x40(not0x20as in ASCII). In IBM424 specifically, the 22 Hebrew base letters plus their five final forms occupy three byte clusters entirely below0x80:0x41–0x49 alef … tet (9 letters) 0x51–0x59 yod … samekh (9 letters) 0x62–0x6A ayin … tav (9 letters + final-pe, tsadi, etc.)
The statistical model ignores all bytes below
0x80, so these letters are invisible to it. This structural rule detects them directly.Algorithm
- EBCDIC gate: byte
0x40(EBCDIC space) must appear significantly more often than0x20(ASCII space). In normal Latin text0x40is the rare@character; in any EBCDIC text it is the word separator and appears at ~10–20% of bytes. - Hebrew letter gate: the combined frequency of bytes in the three Hebrew clusters above must exceed 0.12 of the sample length. Genuine Hebrew text has ~65% of its printable characters in these ranges. ASCII text with the same byte values (upper-case A–I, Q–Y, lower-case b–j) stays well below this threshold in practice.
- Returns:
trueif the byte stream is almost certainly IBM424
- EBCDIC gate: byte
-
isEbcdicLikely
public static boolean isEbcdicLikely(byte[] bytes) Returnstrueif the probe is plausibly EBCDIC based on the word-separator distribution. In every EBCDIC variant (IBM420, IBM424, IBM500, IBM1047) the space character is0x40, not0x20; a stretch of EBCDIC text therefore has0x40as its single most common byte, at roughly 10–20% of the sample. Conversely, any ASCII or ISO-8859-X / windows-12XX / DOS / Mac / CJK text uses0x20(or0x09 / 0x0A) as its whitespace and has0x40only as the rare@character (typically less than 0.1% of bytes).This is a negative gate: when it returns
false, the probe cannot be any EBCDIC variant, and downstream scoring should exclude EBCDIC labels from consideration even if the statistical model ranks them highly.Threshold rationale: we require both (a)
0x40at least 3% of the sample and (b)0x40at least 3× more frequent than0x20. Gate (b) alone is not sufficient because sparse binary content can have neither byte; gate (a) alone is not sufficient because some text formats (CSV with@-separated fields, e-mail address lists) can exceed 3%0x40while clearly being ASCII-spaced. Both gates together match real EBCDIC text reliably across IBM420/424/500/1047 variants.- Parameters:
bytes- the probe to analyse- Returns:
trueif the probe's whitespace distribution is consistent with EBCDIC;falseif it is clearly ASCII-spaced
-
has2ByteColumnAsymmetry
public static boolean has2ByteColumnAsymmetry(byte[] bytes) Returnstrueif the probe's byte distribution across stride-2 columns is sufficiently asymmetric to be plausible UTF-16 of some script.Every UTF-16 variant has one byte column concentrated in a script-specific Unicode block prefix while the other column is diverse: UTF-16 Latin pairs to
(ascii, 0x00)so one column is0x00(1 value) vs ASCII range (~70 values); UTF-16 Cyrillic / Greek / Arabic / Hebrew pair to a single high-byte block prefix (0x04,0x03,0x06,0x05); UTF-16 CJK Unified uses0x4E-0x9F(~80 distinct high bytes) against ~256 low bytes; Hangul uses0xAC-0xD7(~44 high bytes).Non-UTF-16 text — including scattered-null binaries and mixed-content files — has roughly balanced column diversity (both columns saturate near 256 distinct byte values on long probes).
This is a negative gate: when it returns
false, the probe cannot be any UTF-16 variant, and UTF-16 labels should be masked from model output even when the stride-2 bigram features score them highly (e.g. a Greek plaintext file with 0.36% scattered nulls being mis-scored as UTF-16-LE).A diversity ratio of 3× (more diverse column has at least 3× as many distinct values as the more concentrated column) admits all UTF-16 variants including CJK (ratio ~3.2) while rejecting scattered-null false positives (ratio ~1:1).
For probes shorter than
MIN_COLUMN_ASYMMETRY_PROBE, this method returnstrueconservatively — column counts from short samples are not statistically meaningful, so the caller should rely onWideUnicodeDetectorpositive signal and downstream CharSoup arbitration rather than masking.- Parameters:
bytes- the probe to analyse- Returns:
trueif the probe has UTF-16-compatible column asymmetry (or is too short to judge);falseif column diversity is too balanced to be any UTF-16 variant
-
has2ByteColumnAsymmetryEvidence
public static boolean has2ByteColumnAsymmetryEvidence(byte[] bytes) Evidence-based variant ofhas2ByteColumnAsymmetry(byte[])with no conservative short-probe default: returnstrueonly when the bytes themselves demonstrate column asymmetry, regardless of probe length. Use this to gate positive UTF-16 detection (e.g. invokingUtf16SpecialistEncodingDetector), where absence of evidence must mean "not UTF-16", not "unknown".Rejects probes below 16 bytes outright: with fewer than 8 pairs, column-distinct counts don't discriminate any UTF-16 variant from legacy double-byte encodings like GBK or Shift_JIS, which also have constrained lead-byte columns on short samples.
-
checkIbm424
public static boolean checkIbm424(byte[] bytes, int offset, int length) -
checkIbm500
public static boolean checkIbm500(byte[] bytes) Detects IBM500 (International EBCDIC / EBCDIC-500) by looking for the combination of the EBCDIC space byte and high-byte Latin letter density.Why this is needed
In IBM500 every Latin letter is encoded as a byte ≥ 0x80:
0x81–0x89 a–i (lowercase) 0x91–0x99 j–r (lowercase) 0xA2–0xA9 s–z (lowercase) 0xC1–0xC9 A–I (uppercase) 0xD1–0xD9 J–R (uppercase) 0xE2–0xE9 S–Z (uppercase)
At full probe length the statistical model distinguishes IBM500 from IBM424 without difficulty. At very short probes (≤ 20 bytes) the model sees too few bytes to be confident and tends to confuse the two EBCDIC code pages. This structural gate fires early — before the model — using the cheap EBCDIC-space dominance check followed by a Latin-letter density check.
Algorithm
- EBCDIC gate: same as
checkIbm424(byte[])— byte0x40must dominate over0x20. This distinguishes any EBCDIC encoding from ASCII/UTF-8/Latin-1 where0x40is the rare@character. - Latin letter density: the combined frequency of bytes in the six IBM500 Latin-letter clusters above must exceed 0.25 of the sample. Normal Latin text has ~60–70% letter bytes; the threshold is intentionally conservative to fire reliably at 20 bytes.
checkIbm424(byte[])should be called first. If it fires the probe is IBM424 (Hebrew EBCDIC); only if it does not fire should this method be consulted for IBM500.- Returns:
trueif the byte stream is almost certainly IBM500
- EBCDIC gate: same as
-
checkIbm500
public static boolean checkIbm500(byte[] bytes, int offset, int length) -
hasCrlfBytes
public static boolean hasCrlfBytes(byte[] bytes) Returnstrueif the probe contains at least one CRLF pair (0x0D 0x0A).Files originating on Windows use CRLF as the line separator. The presence of a
0x0D 0x0Apair in a probe that is otherwise 7-bit ASCII is weak evidence that the file was created on Windows and therefore more likely to use a Windows code page (e.g. windows-1252) than a Unix-origin ISO-8859-X encoding for any high-byte content beyond the probe window.A bare
0x0Dwithout a following0x0Ais not counted: classic Mac OS used bare CR as its line ending, and that is a different case that does not imply Windows origin. -
hasCrlfBytes
public static boolean hasCrlfBytes(byte[] bytes, int offset, int length) -
hasC1Bytes
public static boolean hasC1Bytes(byte[] bytes) Returnstrueif the probe contains any byte in the C1 control range0x80–0x9F.In every ISO-8859-X encoding those byte values are C1 control characters that never appear in real text. In every Windows-12XX encoding they are printable characters (smart quotes, Euro sign, em-dash, …). Their presence is therefore definitive proof that the content is not a valid ISO-8859-X encoding and should be attributed to the corresponding Windows-12XX variant instead.
-
hasC1Bytes
public static boolean hasC1Bytes(byte[] bytes, int offset, int length) -
hasGb18030FourByteSequence
public static boolean hasGb18030FourByteSequence(byte[] bytes) Returnstrueif the probe contains at least one GB18030-specific 4-byte sequence.GB18030 4-byte structure
Byte 1 (lead): 0x81–0xFE Byte 2 (second): 0x30–0x39 ← ASCII digits Byte 3 (third): 0x81–0xFE Byte 4 (trail): 0x30–0x39 ← ASCII digits
In GBK and GB2312 all trail bytes are in
0x40–0xFE, so a digit (0x30–0x39) in the second or fourth position is impossible. A single matching 4-tuple is therefore definitive proof that the content was encoded with GB18030 and must be decoded with a GB18030-capable codec to avoid replacement characters for the affected code points. -
hasGb18030FourByteSequence
public static boolean hasGb18030FourByteSequence(byte[] bytes, int offset, int length) -
checkIso2022Jp
Deprecated.UsedetectIso2022(byte[])which distinguishes JP/KR/CN. -
checkUtf8
Validates the UTF-8 byte grammar of the sample and returns one of three outcomes:StructuralEncodingRules.Utf8Result.LIKELY_UTF8: all multi-byte sequences are valid and the sample contains enough high bytes to be informative. Use UTF-8.StructuralEncodingRules.Utf8Result.NOT_UTF8: at least one invalid byte sequence was found. Remove UTF-8 from the candidate set.StructuralEncodingRules.Utf8Result.AMBIGUOUS: the sample is structurally valid UTF-8 but contains very few high bytes (almost pure ASCII), so validity is uninformative. Pass to the model.
-
checkUtf8
-
countUtf8Errors
public static int countUtf8Errors(byte[] bytes) Counts the number of malformed UTF-8 sequences in the sample — one event per bad lead, orphaned continuation, overlong, surrogate, or out-of-range codepoint, regardless of how many bytes the bad sequence spans. UnlikecheckUtf8(byte[]), this does not early-exit on the first bad sequence; it scans the entire range, resyncing after each error. Returns 0 for a clean UTF-8 stream.Useful for "tolerant" UTF-8 acceptance: a real-world UTF-8 file with a few corrupted sequences (copy-paste artefact, truncated upstream, MIME transport flip) should still be recognized as UTF-8 rather than rejected outright. Caller decides what error count is tolerable (typically as a fraction of probe length).
The count matches Java's
new String(bytes, UTF_8)'s U+FFFD-per-error semantics (one replacement per malformed sequence).- Returns:
- number of malformed UTF-8 sequence events
-
countUtf8Errors
public static int countUtf8Errors(byte[] bytes, int offset, int length)
-
detectIso2022(byte[])which distinguishes JP/KR/CN.