Class DecodeEquivalence

java.lang.Object
org.apache.tika.ml.chardetect.DecodeEquivalence

public final class DecodeEquivalence extends Object
Cheap byte-wise decode-equivalence check for single-byte charsets.

For single-byte codepages, the mapping from byte value (0x00..0xFF) to Unicode codepoint is a fixed table. Two charsets decode a probe byte-for-byte identically iff their byte-to-char tables agree on every byte value that appears in the probe. ASCII bytes (below 0x80) map identically in every Latin-family codepage and are skipped; the check reduces to "do these charsets agree on every high byte present in this probe?"

Cost: O(probe.length) per call in the worst case, typically short-circuits on the first disagreement. Byte-to-char tables are computed lazily on first use and cached for process lifetime.

This is the inference-time counterpart to the broader CharsetConfusables#POTENTIAL_DECODE_EQUIV_FAMILIES declaration — families enumerate which pairs are potentially byte-identical; this class decides whether they are actually byte-identical on a specific probe.

  • Method Details

    • byteIdenticalOnProbe

      public static boolean byteIdenticalOnProbe(byte[] probe, Charset a, Charset b)
      Returns true if decoding probe under charsets a and b produces bit-identical character sequences. Only the high-byte positions (bytes >= 0x80) are compared; all Latin-family charsets agree on ASCII.

      Returns false (and caches nothing) if either charset's byte table cannot be resolved (e.g. stateful, multi-byte, or JVM-unsupported). Callers should restrict invocation to single-byte charsets, typically via CharsetConfusables#potentialDecodeEquivPeersOf(String).