Class Utf16ColumnFeatureExtractor
- All Implemented Interfaces:
FeatureExtractor<byte[]>
Feature vector
12 dense integer features: byte counts across six byte-value ranges, split by column (even-offset vs odd-offset in the probe). Indexing:
| Index | Feature |
|---|---|
| 0 | count_even(0x00) |
| 1 | count_odd(0x00) |
| 2 | count_even(0x01-0x1F, excluding 0x09/0x0A/0x0D) |
| 3 | count_odd(0x01-0x1F, excluding 0x09/0x0A/0x0D) |
| 4 | count_even(0x20-0x7E, plus 0x09, 0x0A, 0x0D) |
| 5 | count_odd(0x20-0x7E, plus 0x09, 0x0A, 0x0D) |
| 6 | count_even(0x7F) |
| 7 | count_odd(0x7F) |
| 8 | count_even(0x80-0x9F) |
| 9 | count_odd(0x80-0x9F) |
| 10 | count_even(0xA0-0xFF) |
| 11 | count_odd(0xA0-0xFF) |
Why this is HTML-immune
HTML has no 2-byte alignment — tags are variable-length (<br>
is 4 bytes, <div> is 5, </span> is 7), entities and
whitespace are arbitrary. Under random byte-offset content, any byte
range has equal expected frequency at even vs odd positions. The
maxent model pairing this extractor learns weights that reward column
asymmetry: HTML produces near-zero asymmetry on every range →
near-zero contribution to every UTF-16 class logit.
UTF-16 has strict 2-byte alignment by definition. The "high byte" of every codepoint lands in one column, the "low byte" in the other. This alignment cannot be faked by non-UTF-16 content without deliberately constructing 2-byte-aligned patterns, which organic text content never does.
Why raw counts instead of asymmetry ratios
The maxent model learns asymmetry weights naturally from raw counts:
a positive weight on count_even(X) paired with a negative weight
on count_odd(X) produces a dot-product proportional to
count_even(X) - count_odd(X), which IS the asymmetry signal up
to normalization. Explicit asymmetry features would add redundancy
without adding information.
What it doesn't do
- No UTF-32 detection. UTF-32 stays structural (4-byte alignment check) and doesn't need a statistical model.
- No discrimination between UTF-16 content languages (Japanese vs
Chinese vs Korean). CharSoup's language scoring handles that
after decoding. The UTF-16 specialist returns only
UTF-16-LEorUTF-16-BE. - No BOM handling — the caller is responsible for stripping BOM before feeding bytes to this extractor.
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intNumber of columns (even-offset vs odd-offset).static final intTotal feature-vector dimension: ranges * columns.static final intNumber of byte-value ranges tracked. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionint[]extract(byte[] input) Extract features from the given input.int[]extract(byte[] input, int offset, int length) Extract from a sub-range of a byte array.intextractSparseInto(byte[] input, int[] dense, int[] touched) Sparse extraction into caller-owned, reusable buffers.static StringfeatureLabel(int i) Human-readable label for feature indexi(for debugging).inttoString()
-
Field Details
-
NUM_RANGES
public static final int NUM_RANGESNumber of byte-value ranges tracked.- See Also:
-
NUM_COLUMNS
public static final int NUM_COLUMNSNumber of columns (even-offset vs odd-offset).- See Also:
-
NUM_FEATURES
public static final int NUM_FEATURESTotal feature-vector dimension: ranges * columns.- See Also:
-
-
Constructor Details
-
Utf16ColumnFeatureExtractor
public Utf16ColumnFeatureExtractor()
-
-
Method Details
-
extract
public int[] extract(byte[] input) Description copied from interface:FeatureExtractorExtract features from the given input.- Specified by:
extractin interfaceFeatureExtractor<byte[]>- Parameters:
input- raw input (may benull)- Returns:
- int array of length
FeatureExtractor.getNumBuckets()with feature counts
-
extract
public int[] extract(byte[] input, int offset, int length) Extract from a sub-range of a byte array. -
extractSparseInto
public int extractSparseInto(byte[] input, int[] dense, int[] touched) Sparse extraction into caller-owned, reusable buffers. For this small dense vector, "sparse" just means "write non-zero feature indices intotouched". Buckets with zero count are not listed.- Specified by:
extractSparseIntoin interfaceFeatureExtractor<byte[]>- Parameters:
input- raw bytesdense- scratch buffer of lengthNUM_FEATURES, all-zeros on entry; caller clears used entries afterwardstouched- buffer receiving indices of non-zero features- Returns:
- number of entries written into
touched
-
getNumBuckets
public int getNumBuckets()- Specified by:
getNumBucketsin interfaceFeatureExtractor<byte[]>- Returns:
- number of hash buckets (feature-vector dimension)
-
featureLabel
Human-readable label for feature indexi(for debugging). -
toString
-