Class StructuralEncodingRules

java.lang.Object
org.apache.tika.ml.chardetect.StructuralEncodingRules

public final class StructuralEncodingRules extends Object
Fast, rule-based encoding checks that run before the statistical model.

Pipeline

  1. checkAscii(byte[]): no bytes >= 0x80 → UTF-8 (ASCII is a subset)
  2. detectIso2022(byte[]): ISO-2022 escape sequences present → ISO-2022-JP, ISO-2022-KR, or ISO-2022-CN depending on the designation sequence
  3. checkUtf8(byte[]): validate UTF-8 multi-byte grammar; returns a StructuralEncodingRules.Utf8Result indicating whether the bytes are definitively UTF-8, definitively not UTF-8, or ambiguous (pass to model).

UTF-16/32 detection is handled upstream by org.apache.tika.utils.ByteEncodingHint and is not repeated here.

IBM424 (EBCDIC Hebrew) is detected via checkIbm424(byte[]): the Hebrew letters in this code page occupy bytes 0x41–0x6A, which fall entirely below the 0x80 threshold used by the statistical model's feature extractor. The EBCDIC space (0x40) vs ASCII space (0x20) frequency ratio provides a cheap first-pass EBCDIC gate before the Hebrew letter frequencies are checked.

All methods are stateless and safe to call from multiple threads.

  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static enum 
    Outcome of the UTF-8 structural check.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Minimum probe length before has2ByteColumnAsymmetry(byte[]) produces meaningful diversity counts.
  • Method Summary

    Modifier and Type
    Method
    Description
    static boolean
    checkAscii(byte[] bytes)
    Returns true if bytes contains no bytes with value >= 0x80 (i.e. pure 7-bit ASCII, which is a strict subset of UTF-8).
    static boolean
    checkAscii(byte[] bytes, int offset, int length)
     
    static boolean
    checkHz(byte[] bytes)
    Returns true if HZ-GB-2312 switching sequences are present.
    static boolean
    checkHz(byte[] bytes, int offset, int length)
     
    static boolean
    checkIbm424(byte[] bytes)
    Detects IBM424 (EBCDIC Hebrew) by examining the sub-0x80 byte landscape.
    static boolean
    checkIbm424(byte[] bytes, int offset, int length)
     
    static boolean
    checkIbm500(byte[] bytes)
    Detects IBM500 (International EBCDIC / EBCDIC-500) by looking for the combination of the EBCDIC space byte and high-byte Latin letter density.
    static boolean
    checkIbm500(byte[] bytes, int offset, int length)
     
    static boolean
    checkIso2022Jp(byte[] bytes)
    Deprecated.
    Use detectIso2022(byte[]) which distinguishes JP/KR/CN.
    checkUtf8(byte[] bytes)
    Validates the UTF-8 byte grammar of the sample and returns one of three outcomes: StructuralEncodingRules.Utf8Result.LIKELY_UTF8: all multi-byte sequences are valid and the sample contains enough high bytes to be informative.
    checkUtf8(byte[] bytes, int offset, int length)
     
    static int
    countUtf8Errors(byte[] bytes)
    Counts the number of malformed UTF-8 sequences in the sample — one event per bad lead, orphaned continuation, overlong, surrogate, or out-of-range codepoint, regardless of how many bytes the bad sequence spans.
    static int
    countUtf8Errors(byte[] bytes, int offset, int length)
     
    static Charset
    detectIso2022(byte[] bytes)
    Detects ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN by scanning for their characteristic ESC designation sequences.
    static Charset
    detectIso2022(byte[] bytes, int offset, int length)
     
    static boolean
    Returns true if the probe's byte distribution across stride-2 columns is sufficiently asymmetric to be plausible UTF-16 of some script.
    static boolean
    Evidence-based variant of has2ByteColumnAsymmetry(byte[]) with no conservative short-probe default: returns true only when the bytes themselves demonstrate column asymmetry, regardless of probe length.
    static boolean
    hasC1Bytes(byte[] bytes)
    Returns true if the probe contains any byte in the C1 control range 0x80–0x9F.
    static boolean
    hasC1Bytes(byte[] bytes, int offset, int length)
     
    static boolean
    hasCrlfBytes(byte[] bytes)
    Returns true if the probe contains at least one CRLF pair (0x0D 0x0A).
    static boolean
    hasCrlfBytes(byte[] bytes, int offset, int length)
     
    static boolean
    Returns true if the probe contains at least one GB18030-specific 4-byte sequence.
    static boolean
    hasGb18030FourByteSequence(byte[] bytes, int offset, int length)
     
    static boolean
    isEbcdicLikely(byte[] bytes)
    Returns true if the probe is plausibly EBCDIC based on the word-separator distribution.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • MIN_COLUMN_ASYMMETRY_PROBE

      public static final int MIN_COLUMN_ASYMMETRY_PROBE
      Minimum probe length before has2ByteColumnAsymmetry(byte[]) produces meaningful diversity counts. Short probes or probes with limited vocabulary may have too few distinct byte values per column to compare reliably; on anything below this threshold we fall back to the pre-gate behaviour (model + WideUnicodeDetector positive signal). Set above the size of typical short probes (a few hundred bytes) so real CJK UTF-16 text has room to diversify its high-byte column.
      See Also:
  • Method Details

    • checkAscii

      public static boolean checkAscii(byte[] bytes)
      Returns true if bytes contains no bytes with value >= 0x80 (i.e. pure 7-bit ASCII, which is a strict subset of UTF-8).
    • checkAscii

      public static boolean checkAscii(byte[] bytes, int offset, int length)
    • detectIso2022

      public static Charset detectIso2022(byte[] bytes)
      Detects ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN by scanning for their characteristic ESC designation sequences.

      All three share the ESC $ (0x1B 0x24) prefix, so we must read further to distinguish them:

         ISO-2022-JP:  ESC $ B  (JIS X 0208-1983)
                       ESC $ @  (JIS X 0208-1978)
                       ESC $ ( D  (JIS X 0212 supplementary)
         ISO-2022-KR:  ESC $ ) C
         ISO-2022-CN:  ESC $ ) A  (GB2312)
                       ESC $ ) G  (CNS 11643 plane 1)
                       ESC $ * H  (CNS 11643 plane 2)
       

      If ESC $ is found but no recognised third byte follows (or the buffer is too short), ISO-2022-JP is returned as the most common default.

      Returns:
      the detected ISO-2022 charset, or null if no ISO-2022 escape sequence is found
    • detectIso2022

      public static Charset detectIso2022(byte[] bytes, int offset, int length)
    • checkHz

      public static boolean checkHz(byte[] bytes)
      Returns true if HZ-GB-2312 switching sequences are present.

      HZ is a 7-bit encoding: it uses ~\{} ({@code 0x7E 0x7B}) to enter two-byte GB2312 mode and {@code ~\} (0x7E 0x7D) to return to ASCII mode. Like ISO-2022, all bytes are below 0x80, so the model would see no features and must be bypassed with this structural check.

    • checkHz

      public static boolean checkHz(byte[] bytes, int offset, int length)
    • checkIbm424

      public static boolean checkIbm424(byte[] bytes)
      Detects IBM424 (EBCDIC Hebrew) by examining the sub-0x80 byte landscape.

      Why this is needed

      In EBCDIC, the space character is 0x40 (not 0x20 as in ASCII). In IBM424 specifically, the 22 Hebrew base letters plus their five final forms occupy three byte clusters entirely below 0x80:

         0x41–0x49  alef … tet      (9 letters)
         0x51–0x59  yod  … samekh   (9 letters)
         0x62–0x6A  ayin … tav      (9 letters + final-pe, tsadi, etc.)
       

      The statistical model ignores all bytes below 0x80, so these letters are invisible to it. This structural rule detects them directly.

      Algorithm

      1. EBCDIC gate: byte 0x40 (EBCDIC space) must appear significantly more often than 0x20 (ASCII space). In normal Latin text 0x40 is the rare @ character; in any EBCDIC text it is the word separator and appears at ~10–20% of bytes.
      2. Hebrew letter gate: the combined frequency of bytes in the three Hebrew clusters above must exceed 0.12 of the sample length. Genuine Hebrew text has ~65% of its printable characters in these ranges. ASCII text with the same byte values (upper-case A–I, Q–Y, lower-case b–j) stays well below this threshold in practice.
      Returns:
      true if the byte stream is almost certainly IBM424
    • isEbcdicLikely

      public static boolean isEbcdicLikely(byte[] bytes)
      Returns true if the probe is plausibly EBCDIC based on the word-separator distribution. In every EBCDIC variant (IBM420, IBM424, IBM500, IBM1047) the space character is 0x40, not 0x20; a stretch of EBCDIC text therefore has 0x40 as its single most common byte, at roughly 10–20% of the sample. Conversely, any ASCII or ISO-8859-X / windows-12XX / DOS / Mac / CJK text uses 0x20 (or 0x09 / 0x0A) as its whitespace and has 0x40 only as the rare @ character (typically less than 0.1% of bytes).

      This is a negative gate: when it returns false, the probe cannot be any EBCDIC variant, and downstream scoring should exclude EBCDIC labels from consideration even if the statistical model ranks them highly.

      Threshold rationale: we require both (a) 0x40 at least 3% of the sample and (b) 0x40 at least 3× more frequent than 0x20. Gate (b) alone is not sufficient because sparse binary content can have neither byte; gate (a) alone is not sufficient because some text formats (CSV with @-separated fields, e-mail address lists) can exceed 3% 0x40 while clearly being ASCII-spaced. Both gates together match real EBCDIC text reliably across IBM420/424/500/1047 variants.

      Parameters:
      bytes - the probe to analyse
      Returns:
      true if the probe's whitespace distribution is consistent with EBCDIC; false if it is clearly ASCII-spaced
    • has2ByteColumnAsymmetry

      public static boolean has2ByteColumnAsymmetry(byte[] bytes)
      Returns true if the probe's byte distribution across stride-2 columns is sufficiently asymmetric to be plausible UTF-16 of some script.

      Every UTF-16 variant has one byte column concentrated in a script-specific Unicode block prefix while the other column is diverse: UTF-16 Latin pairs to (ascii, 0x00) so one column is 0x00 (1 value) vs ASCII range (~70 values); UTF-16 Cyrillic / Greek / Arabic / Hebrew pair to a single high-byte block prefix (0x04, 0x03, 0x06, 0x05); UTF-16 CJK Unified uses 0x4E-0x9F (~80 distinct high bytes) against ~256 low bytes; Hangul uses 0xAC-0xD7 (~44 high bytes).

      Non-UTF-16 text — including scattered-null binaries and mixed-content files — has roughly balanced column diversity (both columns saturate near 256 distinct byte values on long probes).

      This is a negative gate: when it returns false, the probe cannot be any UTF-16 variant, and UTF-16 labels should be masked from model output even when the stride-2 bigram features score them highly (e.g. a Greek plaintext file with 0.36% scattered nulls being mis-scored as UTF-16-LE).

      A diversity ratio of 3× (more diverse column has at least 3× as many distinct values as the more concentrated column) admits all UTF-16 variants including CJK (ratio ~3.2) while rejecting scattered-null false positives (ratio ~1:1).

      For probes shorter than MIN_COLUMN_ASYMMETRY_PROBE, this method returns true conservatively — column counts from short samples are not statistically meaningful, so the caller should rely on WideUnicodeDetector positive signal and downstream CharSoup arbitration rather than masking.

      Parameters:
      bytes - the probe to analyse
      Returns:
      true if the probe has UTF-16-compatible column asymmetry (or is too short to judge); false if column diversity is too balanced to be any UTF-16 variant
    • has2ByteColumnAsymmetryEvidence

      public static boolean has2ByteColumnAsymmetryEvidence(byte[] bytes)
      Evidence-based variant of has2ByteColumnAsymmetry(byte[]) with no conservative short-probe default: returns true only when the bytes themselves demonstrate column asymmetry, regardless of probe length. Use this to gate positive UTF-16 detection (e.g. invoking Utf16SpecialistEncodingDetector), where absence of evidence must mean "not UTF-16", not "unknown".

      Rejects probes below 16 bytes outright: with fewer than 8 pairs, column-distinct counts don't discriminate any UTF-16 variant from legacy double-byte encodings like GBK or Shift_JIS, which also have constrained lead-byte columns on short samples.

    • checkIbm424

      public static boolean checkIbm424(byte[] bytes, int offset, int length)
    • checkIbm500

      public static boolean checkIbm500(byte[] bytes)
      Detects IBM500 (International EBCDIC / EBCDIC-500) by looking for the combination of the EBCDIC space byte and high-byte Latin letter density.

      Why this is needed

      In IBM500 every Latin letter is encoded as a byte ≥ 0x80:

         0x81–0x89  a–i    (lowercase)
         0x91–0x99  j–r    (lowercase)
         0xA2–0xA9  s–z    (lowercase)
         0xC1–0xC9  A–I    (uppercase)
         0xD1–0xD9  J–R    (uppercase)
         0xE2–0xE9  S–Z    (uppercase)
       

      At full probe length the statistical model distinguishes IBM500 from IBM424 without difficulty. At very short probes (≤ 20 bytes) the model sees too few bytes to be confident and tends to confuse the two EBCDIC code pages. This structural gate fires early — before the model — using the cheap EBCDIC-space dominance check followed by a Latin-letter density check.

      Algorithm

      1. EBCDIC gate: same as checkIbm424(byte[]) — byte 0x40 must dominate over 0x20. This distinguishes any EBCDIC encoding from ASCII/UTF-8/Latin-1 where 0x40 is the rare @ character.
      2. Latin letter density: the combined frequency of bytes in the six IBM500 Latin-letter clusters above must exceed 0.25 of the sample. Normal Latin text has ~60–70% letter bytes; the threshold is intentionally conservative to fire reliably at 20 bytes.

      checkIbm424(byte[]) should be called first. If it fires the probe is IBM424 (Hebrew EBCDIC); only if it does not fire should this method be consulted for IBM500.

      Returns:
      true if the byte stream is almost certainly IBM500
    • checkIbm500

      public static boolean checkIbm500(byte[] bytes, int offset, int length)
    • hasCrlfBytes

      public static boolean hasCrlfBytes(byte[] bytes)
      Returns true if the probe contains at least one CRLF pair (0x0D 0x0A).

      Files originating on Windows use CRLF as the line separator. The presence of a 0x0D 0x0A pair in a probe that is otherwise 7-bit ASCII is weak evidence that the file was created on Windows and therefore more likely to use a Windows code page (e.g. windows-1252) than a Unix-origin ISO-8859-X encoding for any high-byte content beyond the probe window.

      A bare 0x0D without a following 0x0A is not counted: classic Mac OS used bare CR as its line ending, and that is a different case that does not imply Windows origin.

    • hasCrlfBytes

      public static boolean hasCrlfBytes(byte[] bytes, int offset, int length)
    • hasC1Bytes

      public static boolean hasC1Bytes(byte[] bytes)
      Returns true if the probe contains any byte in the C1 control range 0x80–0x9F.

      In every ISO-8859-X encoding those byte values are C1 control characters that never appear in real text. In every Windows-12XX encoding they are printable characters (smart quotes, Euro sign, em-dash, …). Their presence is therefore definitive proof that the content is not a valid ISO-8859-X encoding and should be attributed to the corresponding Windows-12XX variant instead.

    • hasC1Bytes

      public static boolean hasC1Bytes(byte[] bytes, int offset, int length)
    • hasGb18030FourByteSequence

      public static boolean hasGb18030FourByteSequence(byte[] bytes)
      Returns true if the probe contains at least one GB18030-specific 4-byte sequence.

      GB18030 4-byte structure

         Byte 1 (lead):   0x81–0xFE
         Byte 2 (second): 0x30–0x39  ← ASCII digits
         Byte 3 (third):  0x81–0xFE
         Byte 4 (trail):  0x30–0x39  ← ASCII digits
       

      In GBK and GB2312 all trail bytes are in 0x40–0xFE, so a digit (0x30–0x39) in the second or fourth position is impossible. A single matching 4-tuple is therefore definitive proof that the content was encoded with GB18030 and must be decoded with a GB18030-capable codec to avoid replacement characters for the affected code points.

    • hasGb18030FourByteSequence

      public static boolean hasGb18030FourByteSequence(byte[] bytes, int offset, int length)
    • checkIso2022Jp

      @Deprecated public static boolean checkIso2022Jp(byte[] bytes)
      Deprecated.
      Use detectIso2022(byte[]) which distinguishes JP/KR/CN.
    • checkUtf8

      public static StructuralEncodingRules.Utf8Result checkUtf8(byte[] bytes)
      Validates the UTF-8 byte grammar of the sample and returns one of three outcomes:
    • checkUtf8

      public static StructuralEncodingRules.Utf8Result checkUtf8(byte[] bytes, int offset, int length)
    • countUtf8Errors

      public static int countUtf8Errors(byte[] bytes)
      Counts the number of malformed UTF-8 sequences in the sample — one event per bad lead, orphaned continuation, overlong, surrogate, or out-of-range codepoint, regardless of how many bytes the bad sequence spans. Unlike checkUtf8(byte[]), this does not early-exit on the first bad sequence; it scans the entire range, resyncing after each error. Returns 0 for a clean UTF-8 stream.

      Useful for "tolerant" UTF-8 acceptance: a real-world UTF-8 file with a few corrupted sequences (copy-paste artefact, truncated upstream, MIME transport flip) should still be recognized as UTF-8 rather than rejected outright. Caller decides what error count is tolerable (typically as a fraction of probe length).

      The count matches Java's new String(bytes, UTF_8)'s U+FFFD-per-error semantics (one replacement per malformed sequence).

      Returns:
      number of malformed UTF-8 sequence events
    • countUtf8Errors

      public static int countUtf8Errors(byte[] bytes, int offset, int length)