org.apache.tika.ml.chardetect.StructuralEncodingRules

public final class StructuralEncodingRules extends Object

Fast, rule-based encoding checks that run before the statistical model.

Pipeline

checkAscii(byte[]): no bytes >= 0x80 → UTF-8 (ASCII is a subset)
detectIso2022(byte[]): ISO-2022 escape sequences present → ISO-2022-JP, ISO-2022-KR, or ISO-2022-CN depending on the designation sequence
checkUtf8(byte[]): validate UTF-8 multi-byte grammar; returns a StructuralEncodingRules.Utf8Result indicating whether the bytes are definitively UTF-8, definitively not UTF-8, or ambiguous (pass to model).

UTF-16/32 detection is handled upstream by org.apache.tika.utils.ByteEncodingHint and is not repeated here.

IBM424 (EBCDIC Hebrew) is detected via checkIbm424(byte[]): the Hebrew letters in this code page occupy bytes 0x41–0x6A, which fall entirely below the 0x80 threshold used by the statistical model's feature extractor. The EBCDIC space (0x40) vs ASCII space (0x20) frequency ratio provides a cheap first-pass EBCDIC gate before the Hebrew letter frequencies are checked.

All methods are stateless and safe to call from multiple threads.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

StructuralEncodingRules.Utf8Result

Outcome of the UTF-8 structural check.
Field Summary

Fields

Modifier and Type

Field

Description

static final int

MIN_COLUMN_ASYMMETRY_PROBE

Minimum probe length before has2ByteColumnAsymmetry(byte[]) produces meaningful diversity counts.
Method Summary

Modifier and Type

Method

Description

static boolean

checkAscii(byte[] bytes)

Returns true if bytes contains no bytes with value >= 0x80 (i.e. pure 7-bit ASCII, which is a strict subset of UTF-8).

static boolean

checkAscii(byte[] bytes, int offset, int length)

static boolean

checkHz(byte[] bytes)

Returns true if HZ-GB-2312 switching sequences are present.

static boolean

checkHz(byte[] bytes, int offset, int length)

static boolean

checkIbm424(byte[] bytes)

Detects IBM424 (EBCDIC Hebrew) by examining the sub-0x80 byte landscape.

static boolean

checkIbm424(byte[] bytes, int offset, int length)

static boolean

checkIbm500(byte[] bytes)

Detects IBM500 (International EBCDIC / EBCDIC-500) by looking for the combination of the EBCDIC space byte and high-byte Latin letter density.

static boolean

checkIbm500(byte[] bytes, int offset, int length)

static boolean

checkIso2022Jp(byte[] bytes)

Deprecated.
Use detectIso2022(byte[]) which distinguishes JP/KR/CN.

static StructuralEncodingRules.Utf8Result

checkUtf8(byte[] bytes)

Validates the UTF-8 byte grammar of the sample and returns one of three outcomes: StructuralEncodingRules.Utf8Result.LIKELY_UTF8: all multi-byte sequences are valid and the sample contains enough high bytes to be informative.

static StructuralEncodingRules.Utf8Result

checkUtf8(byte[] bytes, int offset, int length)

static int

countUtf8Errors(byte[] bytes)

Counts the number of malformed UTF-8 sequences in the sample — one event per bad lead, orphaned continuation, overlong, surrogate, or out-of-range codepoint, regardless of how many bytes the bad sequence spans.

static int

countUtf8Errors(byte[] bytes, int offset, int length)

static Charset

detectIso2022(byte[] bytes)

Detects ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN by scanning for their characteristic ESC designation sequences.

static Charset

detectIso2022(byte[] bytes, int offset, int length)

static boolean

has2ByteColumnAsymmetry(byte[] bytes)

Returns true if the probe's byte distribution across stride-2 columns is sufficiently asymmetric to be plausible UTF-16 of some script.

static boolean

has2ByteColumnAsymmetryEvidence(byte[] bytes)

Evidence-based variant of has2ByteColumnAsymmetry(byte[]) with no conservative short-probe default: returns true only when the bytes themselves demonstrate column asymmetry, regardless of probe length.

static boolean

hasC1Bytes(byte[] bytes)

Returns true if the probe contains any byte in the C1 control range 0x80–0x9F.

static boolean

hasC1Bytes(byte[] bytes, int offset, int length)

static boolean

hasCrlfBytes(byte[] bytes)

Returns true if the probe contains at least one CRLF pair (0x0D 0x0A).

static boolean

hasCrlfBytes(byte[] bytes, int offset, int length)

static boolean

hasGb18030FourByteSequence(byte[] bytes)

Returns true if the probe contains at least one GB18030-specific 4-byte sequence.

static boolean

hasGb18030FourByteSequence(byte[] bytes, int offset, int length)

static boolean

isEbcdicLikely(byte[] bytes)

Returns true if the probe is plausibly EBCDIC based on the word-separator distribution.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- MIN_COLUMN_ASYMMETRY_PROBE
  
  public static final int MIN_COLUMN_ASYMMETRY_PROBE
  
  Minimum probe length before has2ByteColumnAsymmetry(byte[]) produces meaningful diversity counts. Short probes or probes with limited vocabulary may have too few distinct byte values per column to compare reliably; on anything below this threshold we fall back to the pre-gate behaviour (model + WideUnicodeDetector positive signal). Set above the size of typical short probes (a few hundred bytes) so real CJK UTF-16 text has room to diversify its high-byte column.
  See Also:
  
  Constant Field Values
Method Details
- checkAscii
  
  public static boolean checkAscii(byte[] bytes)
  
  Returns true if bytes contains no bytes with value >= 0x80 (i.e. pure 7-bit ASCII, which is a strict subset of UTF-8).
- checkAscii
  
  public static boolean checkAscii(byte[] bytes, int offset, int length)
- detectIso2022
  
  public static Charset detectIso2022(byte[] bytes)
  Detects ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN by scanning for their characteristic ESC designation sequences.
  All three share the ESC $ (0x1B 0x24) prefix, so we must read further to distinguish them:
  
  ISO-2022-JP: ESC $ B (JIS X 0208-1983) ESC $ @ (JIS X 0208-1978) ESC $ ( D (JIS X 0212 supplementary) ISO-2022-KR: ESC $ ) C ISO-2022-CN: ESC $ ) A (GB2312) ESC $ ) G (CNS 11643 plane 1) ESC $ * H (CNS 11643 plane 2)
  
  If ESC $ is found but no recognised third byte follows (or the buffer is too short), ISO-2022-JP is returned as the most common default.
  Returns:
  
  the detected ISO-2022 charset, or null if no ISO-2022 escape sequence is found
- detectIso2022
  
  public static Charset detectIso2022(byte[] bytes, int offset, int length)
- checkHz
  
  public static boolean checkHz(byte[] bytes)
  
  Returns true if HZ-GB-2312 switching sequences are present.
  HZ is a 7-bit encoding: it uses ~\{} ({@code 0x7E 0x7B}) to enter two-byte GB2312 mode and {@code ~\} (0x7E 0x7D) to return to ASCII mode. Like ISO-2022, all bytes are below 0x80, so the model would see no features and must be bypassed with this structural check.
- checkHz
  
  public static boolean checkHz(byte[] bytes, int offset, int length)
- checkIbm424
  
  public static boolean checkIbm424(byte[] bytes)
  Detects IBM424 (EBCDIC Hebrew) by examining the sub-0x80 byte landscape.
  Why this is needed
  
  In EBCDIC, the space character is 0x40 (not 0x20 as in ASCII). In IBM424 specifically, the 22 Hebrew base letters plus their five final forms occupy three byte clusters entirely below 0x80:
  
  0x41–0x49 alef … tet (9 letters) 0x51–0x59 yod … samekh (9 letters) 0x62–0x6A ayin … tav (9 letters + final-pe, tsadi, etc.)
  
  The statistical model ignores all bytes below 0x80, so these letters are invisible to it. This structural rule detects them directly.
  
  Algorithm
  
  EBCDIC gate: byte 0x40 (EBCDIC space) must appear significantly more often than 0x20 (ASCII space). In normal Latin text 0x40 is the rare @ character; in any EBCDIC text it is the word separator and appears at ~10–20% of bytes.
  
  Hebrew letter gate: the combined frequency of bytes in the three Hebrew clusters above must exceed 0.12 of the sample length. Genuine Hebrew text has ~65% of its printable characters in these ranges. ASCII text with the same byte values (upper-case A–I, Q–Y, lower-case b–j) stays well below this threshold in practice.
  Returns:
  
  true if the byte stream is almost certainly IBM424
- isEbcdicLikely
  
  public static boolean isEbcdicLikely(byte[] bytes)
  
  Returns true if the probe is plausibly EBCDIC based on the word-separator distribution. In every EBCDIC variant (IBM420, IBM424, IBM500, IBM1047) the space character is 0x40, not 0x20; a stretch of EBCDIC text therefore has 0x40 as its single most common byte, at roughly 10–20% of the sample. Conversely, any ASCII or ISO-8859-X / windows-12XX / DOS / Mac / CJK text uses 0x20 (or 0x09 / 0x0A) as its whitespace and has 0x40 only as the rare @ character (typically less than 0.1% of bytes).
  This is a negative gate: when it returns false, the probe cannot be any EBCDIC variant, and downstream scoring should exclude EBCDIC labels from consideration even if the statistical model ranks them highly.
  
  Threshold rationale: we require both (a) 0x40 at least 3% of the sample and (b) 0x40 at least 3× more frequent than 0x20. Gate (b) alone is not sufficient because sparse binary content can have neither byte; gate (a) alone is not sufficient because some text formats (CSV with @-separated fields, e-mail address lists) can exceed 3% 0x40 while clearly being ASCII-spaced. Both gates together match real EBCDIC text reliably across IBM420/424/500/1047 variants.
  
  Parameters:
  
  bytes - the probe to analyse
  
  Returns:
  
  true if the probe's whitespace distribution is consistent with EBCDIC; false if it is clearly ASCII-spaced
- has2ByteColumnAsymmetry
  
  public static boolean has2ByteColumnAsymmetry(byte[] bytes)
  
  Returns true if the probe's byte distribution across stride-2 columns is sufficiently asymmetric to be plausible UTF-16 of some script.
  Every UTF-16 variant has one byte column concentrated in a script-specific Unicode block prefix while the other column is diverse: UTF-16 Latin pairs to (ascii, 0x00) so one column is 0x00 (1 value) vs ASCII range (~70 values); UTF-16 Cyrillic / Greek / Arabic / Hebrew pair to a single high-byte block prefix (0x04, 0x03, 0x06, 0x05); UTF-16 CJK Unified uses 0x4E-0x9F (~80 distinct high bytes) against ~256 low bytes; Hangul uses 0xAC-0xD7 (~44 high bytes).
  
  Non-UTF-16 text — including scattered-null binaries and mixed-content files — has roughly balanced column diversity (both columns saturate near 256 distinct byte values on long probes).
  
  This is a negative gate: when it returns false, the probe cannot be any UTF-16 variant, and UTF-16 labels should be masked from model output even when the stride-2 bigram features score them highly (e.g. a Greek plaintext file with 0.36% scattered nulls being mis-scored as UTF-16-LE).
  
  A diversity ratio of 3× (more diverse column has at least 3× as many distinct values as the more concentrated column) admits all UTF-16 variants including CJK (ratio ~3.2) while rejecting scattered-null false positives (ratio ~1:1).
  
  For probes shorter than MIN_COLUMN_ASYMMETRY_PROBE, this method returns true conservatively — column counts from short samples are not statistically meaningful, so the caller should rely on WideUnicodeDetector positive signal and downstream CharSoup arbitration rather than masking.
  
  Parameters:
  
  bytes - the probe to analyse
  
  Returns:
  
  true if the probe has UTF-16-compatible column asymmetry (or is too short to judge); false if column diversity is too balanced to be any UTF-16 variant
- has2ByteColumnAsymmetryEvidence
  
  public static boolean has2ByteColumnAsymmetryEvidence(byte[] bytes)
  
  Evidence-based variant of has2ByteColumnAsymmetry(byte[]) with no conservative short-probe default: returns true only when the bytes themselves demonstrate column asymmetry, regardless of probe length. Use this to gate positive UTF-16 detection (e.g. invoking Utf16SpecialistEncodingDetector), where absence of evidence must mean "not UTF-16", not "unknown".
  Rejects probes below 16 bytes outright: with fewer than 8 pairs, column-distinct counts don't discriminate any UTF-16 variant from legacy double-byte encodings like GBK or Shift_JIS, which also have constrained lead-byte columns on short samples.
- checkIbm424
  
  public static boolean checkIbm424(byte[] bytes, int offset, int length)
- checkIbm500
  
  public static boolean checkIbm500(byte[] bytes)
  Detects IBM500 (International EBCDIC / EBCDIC-500) by looking for the combination of the EBCDIC space byte and high-byte Latin letter density.
  Why this is needed
  
  In IBM500 every Latin letter is encoded as a byte ≥ 0x80:
  
  0x81–0x89 a–i (lowercase) 0x91–0x99 j–r (lowercase) 0xA2–0xA9 s–z (lowercase) 0xC1–0xC9 A–I (uppercase) 0xD1–0xD9 J–R (uppercase) 0xE2–0xE9 S–Z (uppercase)
  
  At full probe length the statistical model distinguishes IBM500 from IBM424 without difficulty. At very short probes (≤ 20 bytes) the model sees too few bytes to be confident and tends to confuse the two EBCDIC code pages. This structural gate fires early — before the model — using the cheap EBCDIC-space dominance check followed by a Latin-letter density check.
  
  Algorithm
  
  EBCDIC gate: same as checkIbm424(byte[]) — byte 0x40 must dominate over 0x20. This distinguishes any EBCDIC encoding from ASCII/UTF-8/Latin-1 where 0x40 is the rare @ character.
  
  Latin letter density: the combined frequency of bytes in the six IBM500 Latin-letter clusters above must exceed 0.25 of the sample. Normal Latin text has ~60–70% letter bytes; the threshold is intentionally conservative to fire reliably at 20 bytes.
  
  checkIbm424(byte[]) should be called first. If it fires the probe is IBM424 (Hebrew EBCDIC); only if it does not fire should this method be consulted for IBM500.
  Returns:
  
  true if the byte stream is almost certainly IBM500
- checkIbm500
  
  public static boolean checkIbm500(byte[] bytes, int offset, int length)
- hasCrlfBytes
  
  public static boolean hasCrlfBytes(byte[] bytes)
  
  Returns true if the probe contains at least one CRLF pair (0x0D 0x0A).
  Files originating on Windows use CRLF as the line separator. The presence of a 0x0D 0x0A pair in a probe that is otherwise 7-bit ASCII is weak evidence that the file was created on Windows and therefore more likely to use a Windows code page (e.g. windows-1252) than a Unix-origin ISO-8859-X encoding for any high-byte content beyond the probe window.
  
  A bare 0x0D without a following 0x0A is not counted: classic Mac OS used bare CR as its line ending, and that is a different case that does not imply Windows origin.
- hasCrlfBytes
  
  public static boolean hasCrlfBytes(byte[] bytes, int offset, int length)
- hasC1Bytes
  
  public static boolean hasC1Bytes(byte[] bytes)
  
  Returns true if the probe contains any byte in the C1 control range 0x80–0x9F.
  In every ISO-8859-X encoding those byte values are C1 control characters that never appear in real text. In every Windows-12XX encoding they are printable characters (smart quotes, Euro sign, em-dash, …). Their presence is therefore definitive proof that the content is not a valid ISO-8859-X encoding and should be attributed to the corresponding Windows-12XX variant instead.
- hasC1Bytes
  
  public static boolean hasC1Bytes(byte[] bytes, int offset, int length)
- hasGb18030FourByteSequence
  
  public static boolean hasGb18030FourByteSequence(byte[] bytes)
  Returns true if the probe contains at least one GB18030-specific 4-byte sequence.
  GB18030 4-byte structure
  
  Byte 1 (lead): 0x81–0xFE Byte 2 (second): 0x30–0x39 ← ASCII digits Byte 3 (third): 0x81–0xFE Byte 4 (trail): 0x30–0x39 ← ASCII digits
  
  In GBK and GB2312 all trail bytes are in 0x40–0xFE, so a digit (0x30–0x39) in the second or fourth position is impossible. A single matching 4-tuple is therefore definitive proof that the content was encoded with GB18030 and must be decoded with a GB18030-capable codec to avoid replacement characters for the affected code points.
- hasGb18030FourByteSequence
  
  public static boolean hasGb18030FourByteSequence(byte[] bytes, int offset, int length)
- checkIso2022Jp
  
  @Deprecated public static boolean checkIso2022Jp(byte[] bytes)
  
  Deprecated.
  Use detectIso2022(byte[]) which distinguishes JP/KR/CN.
- checkUtf8
  
  public static StructuralEncodingRules.Utf8Result checkUtf8(byte[] bytes)
  Validates the UTF-8 byte grammar of the sample and returns one of three outcomes:
  
  StructuralEncodingRules.Utf8Result.LIKELY_UTF8: all multi-byte sequences are valid and the sample contains enough high bytes to be informative. Use UTF-8.
  
  StructuralEncodingRules.Utf8Result.NOT_UTF8: at least one invalid byte sequence was found. Remove UTF-8 from the candidate set.
  
  StructuralEncodingRules.Utf8Result.AMBIGUOUS: the sample is structurally valid UTF-8 but contains very few high bytes (almost pure ASCII), so validity is uninformative. Pass to the model.
- checkUtf8
  
  public static StructuralEncodingRules.Utf8Result checkUtf8(byte[] bytes, int offset, int length)
- countUtf8Errors
  
  public static int countUtf8Errors(byte[] bytes)
  
  Counts the number of malformed UTF-8 sequences in the sample — one event per bad lead, orphaned continuation, overlong, surrogate, or out-of-range codepoint, regardless of how many bytes the bad sequence spans. Unlike checkUtf8(byte[]), this does not early-exit on the first bad sequence; it scans the entire range, resyncing after each error. Returns 0 for a clean UTF-8 stream.
  Useful for "tolerant" UTF-8 acceptance: a real-world UTF-8 file with a few corrupted sequences (copy-paste artefact, truncated upstream, MIME transport flip) should still be recognized as UTF-8 rather than rejected outright. Caller decides what error count is tolerable (typically as a fraction of probe length).
  
  The count matches Java's new String(bytes, UTF_8)'s U+FFFD-per-error semantics (one replacement per malformed sequence).
  
  Returns:
  
  number of malformed UTF-8 sequence events
- countUtf8Errors
  
  public static int countUtf8Errors(byte[] bytes, int offset, int length)

Class StructuralEncodingRules

Pipeline

Nested Class Summary

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

MIN_COLUMN_ASYMMETRY_PROBE

Method Details

checkAscii

checkAscii

detectIso2022

detectIso2022

checkHz

checkHz

checkIbm424

Why this is needed

Algorithm

isEbcdicLikely

has2ByteColumnAsymmetry

has2ByteColumnAsymmetryEvidence

checkIbm424

checkIbm500

Why this is needed

Algorithm

checkIbm500

hasCrlfBytes

hasCrlfBytes

hasC1Bytes

hasC1Bytes

hasGb18030FourByteSequence

GB18030 4-byte structure

hasGb18030FourByteSequence

checkIso2022Jp

checkUtf8

checkUtf8

countUtf8Errors

countUtf8Errors