Class AbstractEmbeddingFilter
- All Implemented Interfaces:
Closeable,Serializable,AutoCloseable
- Direct Known Subclasses:
OpenAIEmbeddingFilter
The pipeline is:
- Read source text from
contentFieldin metadata - Chunk it with
MarkdownChunker - Call
embed(List, InferenceConfig)to get vectors - Serialize chunks + vectors as JSON into
outputField
The MarkdownChunker requires markdown-formatted content to split at
semantic boundaries (headings, paragraphs, etc.). The content handler type
must be set to MARKDOWN in the configuration. If the
TikaCoreProperties.TIKA_CONTENT_HANDLER_TYPE metadata field indicates
a different handler type, a warning is logged.
Subclasses implement embed(java.util.List<org.apache.tika.inference.Chunk>, org.apache.tika.inference.InferenceConfig) for their specific API format.
Thread safety: instances are safe for concurrent filter(java.util.List<org.apache.tika.metadata.Metadata>, org.apache.tika.parser.ParseContext) calls once
fully constructed. Setters must not be called concurrently with
filter(java.util.List<org.apache.tika.metadata.Metadata>, org.apache.tika.parser.ParseContext).
- See Also:
-
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedprotected -
Method Summary
Modifier and TypeMethodDescriptionprotected abstract voidembed(List<Chunk> chunks, InferenceConfig config) Call the embeddings endpoint to fill in vectors on each chunk.voidfilter(List<Metadata> metadataList, ParseContext parseContext) Filters the metadata list in place using per-request context.intintintgetModel()intintbooleanbooleanvoidvoidsetBaseUrl(String baseUrl) voidsetClearContentAfterChunking(boolean clearContentAfterChunking) voidsetContentField(String contentField) voidsetMaxBatchSize(int maxBatchSize) voidsetMaxChunkChars(int maxChunkChars) voidsetMaxChunks(int maxChunks) voidvoidsetOutputField(String outputField) voidsetOverlapChars(int overlapChars) voidsetSkipEmbedding(boolean skipEmbedding) voidsetTimeoutSeconds(int timeoutSeconds) Methods inherited from class org.apache.tika.metadata.filter.MetadataFilter
close, filter
-
Constructor Details
-
AbstractEmbeddingFilter
protected AbstractEmbeddingFilter() -
AbstractEmbeddingFilter
-
-
Method Details
-
embed
protected abstract void embed(List<Chunk> chunks, InferenceConfig config) throws IOException, TikaException Call the embeddings endpoint to fill in vectors on each chunk. Implementations should setChunk.setVector(float[])on each chunk in the list.- Parameters:
chunks- the text chunks to embedconfig- the resolved config for this call- Throws:
IOException- on HTTP errorsTikaException- on API-level errors
-
filter
Description copied from class:MetadataFilterFilters the metadata list in place using per-request context. The list and the metadata objects within it may be modified. Callers must pass a mutable list and should make a defensive copy before calling if the original data must be preserved.- Specified by:
filterin classMetadataFilter- Parameters:
metadataList- the list to filter (must be mutable)parseContext- per-request context (e.g. skip flags, runtime config)- Throws:
TikaException- if filtering fails
-
getDefaultConfig
-
getBaseUrl
-
setBaseUrl
- Throws:
TikaConfigException
-
getModel
-
setModel
-
getApiKey
-
setApiKey
- Throws:
TikaConfigException
-
getTimeoutSeconds
public int getTimeoutSeconds() -
setTimeoutSeconds
public void setTimeoutSeconds(int timeoutSeconds) -
getMaxChunkChars
public int getMaxChunkChars() -
setMaxChunkChars
public void setMaxChunkChars(int maxChunkChars) -
getOverlapChars
public int getOverlapChars() -
setOverlapChars
public void setOverlapChars(int overlapChars) -
getContentField
-
setContentField
-
getOutputField
-
setOutputField
-
isSkipEmbedding
public boolean isSkipEmbedding() -
setSkipEmbedding
public void setSkipEmbedding(boolean skipEmbedding) -
isClearContentAfterChunking
public boolean isClearContentAfterChunking() -
setClearContentAfterChunking
public void setClearContentAfterChunking(boolean clearContentAfterChunking) -
getMaxBatchSize
public int getMaxBatchSize() -
setMaxBatchSize
public void setMaxBatchSize(int maxBatchSize) -
getMaxChunks
public int getMaxChunks() -
setMaxChunks
public void setMaxChunks(int maxChunks)
-