Class Utf16ColumnFeatureExtractor

java.lang.Object
org.apache.tika.ml.chardetect.Utf16ColumnFeatureExtractor
All Implemented Interfaces:
FeatureExtractor<byte[]>

public class Utf16ColumnFeatureExtractor extends Object implements FeatureExtractor<byte[]>
Feature extractor for the UTF-16 specialist of the mixture-of-experts charset detector. Produces a small, dense, position-aware feature vector that is immune to HTML markup by construction: features capture the 2-byte alignment asymmetry that UTF-16 content produces and HTML content (which has no 2-byte alignment) cannot.

Feature vector

12 dense integer features: byte counts across six byte-value ranges, split by column (even-offset vs odd-offset in the probe). Indexing:

IndexFeature
0count_even(0x00)
1count_odd(0x00)
2count_even(0x01-0x1F, excluding 0x09/0x0A/0x0D)
3count_odd(0x01-0x1F, excluding 0x09/0x0A/0x0D)
4count_even(0x20-0x7E, plus 0x09, 0x0A, 0x0D)
5count_odd(0x20-0x7E, plus 0x09, 0x0A, 0x0D)
6count_even(0x7F)
7count_odd(0x7F)
8count_even(0x80-0x9F)
9count_odd(0x80-0x9F)
10count_even(0xA0-0xFF)
11count_odd(0xA0-0xFF)

Why this is HTML-immune

HTML has no 2-byte alignment — tags are variable-length (<br> is 4 bytes, <div> is 5, </span> is 7), entities and whitespace are arbitrary. Under random byte-offset content, any byte range has equal expected frequency at even vs odd positions. The maxent model pairing this extractor learns weights that reward column asymmetry: HTML produces near-zero asymmetry on every range → near-zero contribution to every UTF-16 class logit.

UTF-16 has strict 2-byte alignment by definition. The "high byte" of every codepoint lands in one column, the "low byte" in the other. This alignment cannot be faked by non-UTF-16 content without deliberately constructing 2-byte-aligned patterns, which organic text content never does.

Why raw counts instead of asymmetry ratios

The maxent model learns asymmetry weights naturally from raw counts: a positive weight on count_even(X) paired with a negative weight on count_odd(X) produces a dot-product proportional to count_even(X) - count_odd(X), which IS the asymmetry signal up to normalization. Explicit asymmetry features would add redundancy without adding information.

What it doesn't do

  • No UTF-32 detection. UTF-32 stays structural (4-byte alignment check) and doesn't need a statistical model.
  • No discrimination between UTF-16 content languages (Japanese vs Chinese vs Korean). CharSoup's language scoring handles that after decoding. The UTF-16 specialist returns only UTF-16-LE or UTF-16-BE.
  • No BOM handling — the caller is responsible for stripping BOM before feeding bytes to this extractor.
See Also:
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Number of columns (even-offset vs odd-offset).
    static final int
    Total feature-vector dimension: ranges * columns.
    static final int
    Number of byte-value ranges tracked.
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    int[]
    extract(byte[] input)
    Extract features from the given input.
    int[]
    extract(byte[] input, int offset, int length)
    Extract from a sub-range of a byte array.
    int
    extractSparseInto(byte[] input, int[] dense, int[] touched)
    Sparse extraction into caller-owned, reusable buffers.
    static String
    featureLabel(int i)
    Human-readable label for feature index i (for debugging).
    int
     
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • NUM_RANGES

      public static final int NUM_RANGES
      Number of byte-value ranges tracked.
      See Also:
    • NUM_COLUMNS

      public static final int NUM_COLUMNS
      Number of columns (even-offset vs odd-offset).
      See Also:
    • NUM_FEATURES

      public static final int NUM_FEATURES
      Total feature-vector dimension: ranges * columns.
      See Also:
  • Constructor Details

    • Utf16ColumnFeatureExtractor

      public Utf16ColumnFeatureExtractor()
  • Method Details

    • extract

      public int[] extract(byte[] input)
      Description copied from interface: FeatureExtractor
      Extract features from the given input.
      Specified by:
      extract in interface FeatureExtractor<byte[]>
      Parameters:
      input - raw input (may be null)
      Returns:
      int array of length FeatureExtractor.getNumBuckets() with feature counts
    • extract

      public int[] extract(byte[] input, int offset, int length)
      Extract from a sub-range of a byte array.
    • extractSparseInto

      public int extractSparseInto(byte[] input, int[] dense, int[] touched)
      Sparse extraction into caller-owned, reusable buffers. For this small dense vector, "sparse" just means "write non-zero feature indices into touched". Buckets with zero count are not listed.
      Specified by:
      extractSparseInto in interface FeatureExtractor<byte[]>
      Parameters:
      input - raw bytes
      dense - scratch buffer of length NUM_FEATURES, all-zeros on entry; caller clears used entries afterwards
      touched - buffer receiving indices of non-zero features
      Returns:
      number of entries written into touched
    • getNumBuckets

      public int getNumBuckets()
      Specified by:
      getNumBuckets in interface FeatureExtractor<byte[]>
      Returns:
      number of hash buckets (feature-vector dimension)
    • featureLabel

      public static String featureLabel(int i)
      Human-readable label for feature index i (for debugging).
    • toString

      public String toString()
      Overrides:
      toString in class Object