org.apache.tika.ml.chardetect.Utf16ColumnFeatureExtractor

All Implemented Interfaces:: FeatureExtractor<byte[]>

public class Utf16ColumnFeatureExtractor extends Object implements FeatureExtractor<byte[]>

Feature extractor for the UTF-16 specialist of the mixture-of-experts charset detector. Produces a small, dense, position-aware feature vector that is immune to HTML markup by construction: features capture the 2-byte alignment asymmetry that UTF-16 content produces and HTML content (which has no 2-byte alignment) cannot.

Feature vector

12 dense integer features: byte counts across six byte-value ranges, split by column (even-offset vs odd-offset in the probe). Indexing:

Index	Feature
0	count_even(0x00)
1	count_odd(0x00)
2	count_even(0x01-0x1F, excluding 0x09/0x0A/0x0D)
3	count_odd(0x01-0x1F, excluding 0x09/0x0A/0x0D)
4	count_even(0x20-0x7E, plus 0x09, 0x0A, 0x0D)
5	count_odd(0x20-0x7E, plus 0x09, 0x0A, 0x0D)
6	count_even(0x7F)
7	count_odd(0x7F)
8	count_even(0x80-0x9F)
9	count_odd(0x80-0x9F)
10	count_even(0xA0-0xFF)
11	count_odd(0xA0-0xFF)

Why this is HTML-immune

HTML has no 2-byte alignment — tags are variable-length (<br> is 4 bytes, <div> is 5, </span> is 7), entities and whitespace are arbitrary. Under random byte-offset content, any byte range has equal expected frequency at even vs odd positions. The maxent model pairing this extractor learns weights that reward column asymmetry: HTML produces near-zero asymmetry on every range → near-zero contribution to every UTF-16 class logit.

UTF-16 has strict 2-byte alignment by definition. The "high byte" of every codepoint lands in one column, the "low byte" in the other. This alignment cannot be faked by non-UTF-16 content without deliberately constructing 2-byte-aligned patterns, which organic text content never does.

Why raw counts instead of asymmetry ratios

The maxent model learns asymmetry weights naturally from raw counts: a positive weight on count_even(X) paired with a negative weight on count_odd(X) produces a dot-product proportional to count_even(X) - count_odd(X), which IS the asymmetry signal up to normalization. Explicit asymmetry features would add redundancy without adding information.

What it doesn't do

No UTF-32 detection. UTF-32 stays structural (4-byte alignment check) and doesn't need a statistical model.
No discrimination between UTF-16 content languages (Japanese vs Chinese vs Korean). CharSoup's language scoring handles that after decoding. The UTF-16 specialist returns only UTF-16-LE or UTF-16-BE.
No BOM handling — the caller is responsible for stripping BOM before feeding bytes to this extractor.

See Also:

LinearModel

Field Summary

Fields

Modifier and Type

Field

Description

static final int

NUM_COLUMNS

Number of columns (even-offset vs odd-offset).

static final int

NUM_FEATURES

Total feature-vector dimension: ranges * columns.

static final int

NUM_RANGES

Number of byte-value ranges tracked.
Constructor Summary

Constructors

Constructor

Description

Utf16ColumnFeatureExtractor()
Method Summary

Modifier and Type

Method

Description

int[]

extract(byte[] input)

Extract features from the given input.

int[]

extract(byte[] input, int offset, int length)

Extract from a sub-range of a byte array.

int

extractSparseInto(byte[] input, int[] dense, int[] touched)

Sparse extraction into caller-owned, reusable buffers.

static String

featureLabel(int i)

Human-readable label for feature index i (for debugging).

int

getNumBuckets()

String

toString()

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Details
- NUM_RANGES
  
  public static final int NUM_RANGES
  
  Number of byte-value ranges tracked.
  See Also:
  
  Constant Field Values
- NUM_COLUMNS
  
  public static final int NUM_COLUMNS
  
  Number of columns (even-offset vs odd-offset).
  See Also:
  
  Constant Field Values
- NUM_FEATURES
  
  public static final int NUM_FEATURES
  
  Total feature-vector dimension: ranges * columns.
  See Also:
  
  Constant Field Values
Constructor Details
- Utf16ColumnFeatureExtractor
  
  public Utf16ColumnFeatureExtractor()
Method Details
- extract
  
  public int[] extract(byte[] input)
  
  Description copied from interface: FeatureExtractor
  
  Extract features from the given input.
  
  Specified by:
  
  extract in interface FeatureExtractor<byte[]>
  
  Parameters:
  
  input - raw input (may be null)
  
  Returns:
  
  int array of length FeatureExtractor.getNumBuckets() with feature counts
- extract
  
  public int[] extract(byte[] input, int offset, int length)
  
  Extract from a sub-range of a byte array.
- extractSparseInto
  
  public int extractSparseInto(byte[] input, int[] dense, int[] touched)
  
  Sparse extraction into caller-owned, reusable buffers. For this small dense vector, "sparse" just means "write non-zero feature indices into touched". Buckets with zero count are not listed.
  
  Specified by:
  
  extractSparseInto in interface FeatureExtractor<byte[]>
  
  Parameters:
  
  input - raw bytes
  
  dense - scratch buffer of length NUM_FEATURES, all-zeros on entry; caller clears used entries afterwards
  
  touched - buffer receiving indices of non-zero features
  
  Returns:
  
  number of entries written into touched
- getNumBuckets
  
  public int getNumBuckets()
  
  Specified by:
  
  getNumBuckets in interface FeatureExtractor<byte[]>
  
  Returns:
  
  number of hash buckets (feature-vector dimension)
- featureLabel
  
  public static String featureLabel(int i)
  
  Human-readable label for feature index i (for debugging).
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object

Class Utf16ColumnFeatureExtractor

Feature vector

Why this is HTML-immune

Why raw counts instead of asymmetry ratios

What it doesn't do

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

NUM_RANGES

NUM_COLUMNS

NUM_FEATURES

Constructor Details

Utf16ColumnFeatureExtractor

Method Details

extract

extract

extractSparseInto

getNumBuckets

featureLabel

toString