org.apache.tika.ml.chardetect.DecodeEquivalence

public final class DecodeEquivalence extends Object

Cheap byte-wise decode-equivalence check for single-byte charsets.

For single-byte codepages, the mapping from byte value (0x00..0xFF) to Unicode codepoint is a fixed table. Two charsets decode a probe byte-for-byte identically iff their byte-to-char tables agree on every byte value that appears in the probe. ASCII bytes (below 0x80) map identically in every Latin-family codepage and are skipped; the check reduces to "do these charsets agree on every high byte present in this probe?"

Cost: O(probe.length) per call in the worst case, typically short-circuits on the first disagreement. Byte-to-char tables are computed lazily on first use and cached for process lifetime.

This is the inference-time counterpart to the broader CharsetConfusables#POTENTIAL_DECODE_EQUIV_FAMILIES declaration — families enumerate which pairs are potentially byte-identical; this class decides whether they are actually byte-identical on a specific probe.

Method Summary

Modifier and Type

Method

Description

static boolean

byteIdenticalOnProbe(byte[] probe, Charset a, Charset b)

Returns true if decoding probe under charsets a and b produces bit-identical character sequences.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- byteIdenticalOnProbe
  
  public static boolean byteIdenticalOnProbe(byte[] probe, Charset a, Charset b)
  
  Returns true if decoding probe under charsets a and b produces bit-identical character sequences. Only the high-byte positions (bytes >= 0x80) are compared; all Latin-family charsets agree on ASCII.
  Returns false (and caches nothing) if either charset's byte table cannot be resolved (e.g. stateful, multi-byte, or JVM-unsupported). Callers should restrict invocation to single-byte charsets, typically via CharsetConfusables#potentialDecodeEquivPeersOf(String).

Class DecodeEquivalence

Method Summary

Methods inherited from class java.lang.Object

Method Details

byteIdenticalOnProbe