Package org.apache.tika.ml
Interface FeatureExtractor<T>
- Type Parameters:
T- the raw input type (e.g.Stringfor text,byte[]for raw bytes)
- All Known Implementing Classes:
Utf16ColumnFeatureExtractor
public interface FeatureExtractor<T>
Generic feature extractor that maps an input of type
T to a
fixed-length integer feature vector suitable for a LinearModel.-
Method Summary
Modifier and TypeMethodDescriptionint[]Extract features from the given input.default intextractSparseInto(T input, int[] dense, int[] touched) Sparse extraction into caller-owned reusable buffers: populatesdensewith feature counts, writes the indices of non-zero entries intotouched, and returns how many indices were written.int
-
Method Details
-
extract
Extract features from the given input.- Parameters:
input- raw input (may benull)- Returns:
- int array of length
getNumBuckets()with feature counts
-
getNumBuckets
int getNumBuckets()- Returns:
- number of hash buckets (feature-vector dimension)
-
extractSparseInto
Sparse extraction into caller-owned reusable buffers: populatesdensewith feature counts, writes the indices of non-zero entries intotouched, and returns how many indices were written. Callers are responsible for clearing the touched entries ofdensebefore reuse.Default implementation delegates to
extract(T). Extractors that can do better (avoid allocating the full dense vector, or scan the input only once) should override.
-