Class AbstractEmbeddingFilter

java.lang.Object
org.apache.tika.metadata.filter.MetadataFilter
org.apache.tika.inference.AbstractEmbeddingFilter
All Implemented Interfaces:
Closeable, Serializable, AutoCloseable
Direct Known Subclasses:
OpenAIEmbeddingFilter

public abstract class AbstractEmbeddingFilter extends MetadataFilter
Base class for metadata filters that chunk text content and call a remote embeddings endpoint to produce vectors for each chunk.

The pipeline is:

  1. Read source text from contentField in metadata
  2. Chunk it with MarkdownChunker
  3. Call embed(List, InferenceConfig) to get vectors
  4. Serialize chunks + vectors as JSON into outputField

The MarkdownChunker requires markdown-formatted content to split at semantic boundaries (headings, paragraphs, etc.). The content handler type must be set to MARKDOWN in the configuration. If the TikaCoreProperties.TIKA_CONTENT_HANDLER_TYPE metadata field indicates a different handler type, a warning is logged.

Subclasses implement embed(java.util.List<org.apache.tika.inference.Chunk>, org.apache.tika.inference.InferenceConfig) for their specific API format.

Thread safety: instances are safe for concurrent filter(java.util.List<org.apache.tika.metadata.Metadata>, org.apache.tika.parser.ParseContext) calls once fully constructed. Setters must not be called concurrently with filter(java.util.List<org.apache.tika.metadata.Metadata>, org.apache.tika.parser.ParseContext).

See Also:
  • Constructor Details

    • AbstractEmbeddingFilter

      protected AbstractEmbeddingFilter()
    • AbstractEmbeddingFilter

      protected AbstractEmbeddingFilter(InferenceConfig config)
  • Method Details

    • embed

      protected abstract void embed(List<Chunk> chunks, InferenceConfig config) throws IOException, TikaException
      Call the embeddings endpoint to fill in vectors on each chunk. Implementations should set Chunk.setVector(float[]) on each chunk in the list.
      Parameters:
      chunks - the text chunks to embed
      config - the resolved config for this call
      Throws:
      IOException - on HTTP errors
      TikaException - on API-level errors
    • filter

      public void filter(List<Metadata> metadataList, ParseContext parseContext) throws TikaException
      Description copied from class: MetadataFilter
      Filters the metadata list in place using per-request context. The list and the metadata objects within it may be modified. Callers must pass a mutable list and should make a defensive copy before calling if the original data must be preserved.
      Specified by:
      filter in class MetadataFilter
      Parameters:
      metadataList - the list to filter (must be mutable)
      parseContext - per-request context (e.g. skip flags, runtime config)
      Throws:
      TikaException - if filtering fails
    • getDefaultConfig

      public InferenceConfig getDefaultConfig()
    • getBaseUrl

      public String getBaseUrl()
    • setBaseUrl

      public void setBaseUrl(String baseUrl) throws TikaConfigException
      Throws:
      TikaConfigException
    • getModel

      public String getModel()
    • setModel

      public void setModel(String model)
    • getApiKey

      public String getApiKey()
    • setApiKey

      public void setApiKey(String apiKey) throws TikaConfigException
      Throws:
      TikaConfigException
    • getTimeoutSeconds

      public int getTimeoutSeconds()
    • setTimeoutSeconds

      public void setTimeoutSeconds(int timeoutSeconds)
    • getMaxChunkChars

      public int getMaxChunkChars()
    • setMaxChunkChars

      public void setMaxChunkChars(int maxChunkChars)
    • getOverlapChars

      public int getOverlapChars()
    • setOverlapChars

      public void setOverlapChars(int overlapChars)
    • getContentField

      public String getContentField()
    • setContentField

      public void setContentField(String contentField)
    • getOutputField

      public String getOutputField()
    • setOutputField

      public void setOutputField(String outputField)
    • isSkipEmbedding

      public boolean isSkipEmbedding()
    • setSkipEmbedding

      public void setSkipEmbedding(boolean skipEmbedding)
    • isClearContentAfterChunking

      public boolean isClearContentAfterChunking()
    • setClearContentAfterChunking

      public void setClearContentAfterChunking(boolean clearContentAfterChunking)
    • getMaxBatchSize

      public int getMaxBatchSize()
    • setMaxBatchSize

      public void setMaxBatchSize(int maxBatchSize)
    • getMaxChunks

      public int getMaxChunks()
    • setMaxChunks

      public void setMaxChunks(int maxChunks)