Class OpenAIImageEmbeddingParser
- All Implemented Interfaces:
Closeable,Serializable,AutoCloseable,Initializable,SelfConfiguring,Parser
/v1/embeddings with image input) and
stores the resulting vector in metadata.
This parser registers for the same image/ocr-* media types
used by the PDF renderer's OCR pipeline, so it slots into the
existing ocrStrategy mechanism. When configured, each
rendered page image is sent to the embedding endpoint and the
vector is stored as a serialized Chunk with a
PaginatedLocator (when page number metadata is available).
The image is sent in the Jina CLIP format:
{"input": [{"image": "data:image/png;base64,..."}]}.
Configuration key: "openai-image-embedding-parser"
Thread safety: instances are safe for concurrent parse(org.apache.tika.io.TikaInputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext) calls once
fully constructed. Setters must not be called concurrently with
parse(org.apache.tika.io.TikaInputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext).
- Since:
- Apache Tika 4.0
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()longlonggetModel()getSupportedTypes(ParseContext context) Returns the set of media types supported by this parser when used with the given parse context.intvoidCalled after all properties have been set to allow for validation and initialization that depends on multiple properties.booleanvoidparse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext parseContext) Parses a document stream into a sequence of XHTML SAX events.voidvoidsetApiKeyHeaderName(String apiKeyHeaderName) Set the HTTP header name for API key authentication.voidsetApiKeyPrefix(String apiKeyPrefix) Set the prefix prepended to the API key in the auth header.voidsetBaseUrl(String baseUrl) voidsetEmbeddingsPath(String embeddingsPath) Set the URL path for embeddings requests.voidsetMaxFileSizeToEmbed(long maxFileSizeToEmbed) voidsetMinFileSizeToEmbed(long minFileSizeToEmbed) voidvoidsetSkipEmbedding(boolean skipEmbedding) voidsetTimeoutSeconds(int timeoutSeconds)
-
Constructor Details
-
OpenAIImageEmbeddingParser
public OpenAIImageEmbeddingParser() -
OpenAIImageEmbeddingParser
-
OpenAIImageEmbeddingParser
-
-
Method Details
-
getSupportedTypes
Description copied from interface:ParserReturns the set of media types supported by this parser when used with the given parse context.- Specified by:
getSupportedTypesin interfaceParser- Parameters:
context- parse context- Returns:
- immutable set of media types
-
parse
public void parse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext parseContext) throws IOException, SAXException, TikaException Description copied from interface:ParserParses a document stream into a sequence of XHTML SAX events. Fills in related document metadata in the given metadata object.The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.
- Specified by:
parsein interfaceParserhandler- handler for the XHTML SAX events (output)metadata- document metadata (input and output)parseContext- parse context- Throws:
IOException- if the document stream could not be readSAXException- if the SAX events could not be processedTikaException- if the document could not be parsed
-
initialize
Description copied from interface:InitializableCalled after all properties have been set to allow for validation and initialization that depends on multiple properties.- Specified by:
initializein interfaceInitializable- Throws:
TikaConfigException- if there is a problem with the configuration
-
close
- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Throws:
IOException
-
getBaseUrl
-
setBaseUrl
- Throws:
TikaConfigException
-
getModel
-
setModel
-
getApiKey
-
setApiKey
- Throws:
TikaConfigException
-
getTimeoutSeconds
public int getTimeoutSeconds() -
setTimeoutSeconds
public void setTimeoutSeconds(int timeoutSeconds) -
isSkipEmbedding
public boolean isSkipEmbedding() -
setSkipEmbedding
public void setSkipEmbedding(boolean skipEmbedding) -
getMinFileSizeToEmbed
public long getMinFileSizeToEmbed() -
setMinFileSizeToEmbed
public void setMinFileSizeToEmbed(long minFileSizeToEmbed) -
getMaxFileSizeToEmbed
public long getMaxFileSizeToEmbed() -
setMaxFileSizeToEmbed
public void setMaxFileSizeToEmbed(long maxFileSizeToEmbed) -
getEmbeddingsPath
-
setEmbeddingsPath
Set the URL path for embeddings requests. Default is/v1/embeddings. For Azure OpenAI, use/openai/deployments/{deployment}/embeddings?api-version=2024-02-01. -
getApiKeyHeaderName
-
setApiKeyHeaderName
Set the HTTP header name for API key authentication. Default isAuthorization. For Azure OpenAI, set toapi-key. -
getApiKeyPrefix
-
setApiKeyPrefix
Set the prefix prepended to the API key in the auth header. Default is"Bearer "(with trailing space). For Azure OpenAI, set to""(empty string).
-