VLM (Vision-Language Model) Parsers
Tika includes a family of parsers that delegate OCR and document understanding to remote Vision-Language Model (VLM) endpoints. These parsers send images (or PDFs) to an external API and convert the model’s markdown response into structured XHTML.
Three implementations are provided out of the box:
| Parser | Endpoint | Config key | SPI auto-loaded? |
|---|---|---|---|
|
Any OpenAI-compatible chat completions endpoint (vLLM, Ollama, local FastAPI, OpenAI) |
|
Yes |
|
Anthropic Messages API |
|
No |
|
Google Gemini |
|
No |
OpenAIVLMParser is the only parser loaded by the default parser via SPI.
ClaudeVLMParser and GeminiVLMParser must be explicitly added to your
configuration.
Supported input types
All three parsers handle standard OCR image types (image/ocr-png,
image/ocr-jpeg, etc.). ClaudeVLMParser and GeminiVLMParser
additionally declare application/pdf support, meaning they can process
PDFs natively using the model’s vision capabilities.
Module dependency
The VLM parsers live in the tika-parser-vlm-ocr-module artifact. Add it
to your project:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parser-vlm-ocr-module</artifactId>
<version>${tika.version}</version>
</dependency>
| To run a local open-source VLM without cloud API keys, see Running a Local VLM Server. |
OpenAI-compatible (vLLM, Ollama, etc.)
Basic Configuration
{
"parsers": [
{
"openai-vlm-parser": {
"baseUrl": "http://127.0.0.1:8000",
"model": "jinaai/jina-vlm",
"timeoutSeconds": 300
}
}
]
}
Full Configuration
{
"parsers": [
{
"openai-vlm-parser": {
"baseUrl": "http://127.0.0.1:8000",
"model": "jinaai/jina-vlm",
"prompt": "Extract all visible text from this image. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the image. Only return the extracted text.",
"maxTokens": 4096,
"timeoutSeconds": 300,
"apiKey": "",
"inlineContent": true,
"skipOcr": false,
"minFileSizeToOcr": 0,
"maxFileSizeToOcr": 52428800
}
}
]
}
The OpenAIVLMParser works with any server that exposes an
/v1/chat/completions endpoint in the OpenAI format. This includes:
Authentication uses a standard Authorization: Bearer <apiKey> header.
Leave apiKey empty to skip authentication (typical for local servers).
Anthropic Claude
Basic Configuration
{
"parsers": [
{
"claude-vlm-parser": {
"apiKey": "sk-ant-your-key-here",
"model": "claude-sonnet-4-20250514"
}
}
]
}
Full Configuration
{
"parsers": [
{
"claude-vlm-parser": {
"baseUrl": "https://api.anthropic.com",
"model": "claude-sonnet-4-20250514",
"prompt": "Extract all visible text from this image. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the image. Only return the extracted text.",
"maxTokens": 4096,
"timeoutSeconds": 300,
"apiKey": "sk-ant-your-key-here",
"inlineContent": true,
"skipOcr": false,
"minFileSizeToOcr": 0,
"maxFileSizeToOcr": 52428800
}
}
]
}
The ClaudeVLMParser uses the Anthropic
Messages API.
Authentication uses the x-api-key header (not Bearer). The required
anthropic-version header is sent automatically.
Claude handles images and PDFs natively. For images, the content block
type is image; for PDFs it is document. The parser detects the
correct type from the input MIME type.
Google Gemini
Basic Configuration
{
"parsers": [
{
"gemini-vlm-parser": {
"apiKey": "your-gemini-api-key",
"model": "gemini-2.5-flash"
}
}
]
}
Full Configuration
{
"parsers": [
{
"gemini-vlm-parser": {
"baseUrl": "https://generativelanguage.googleapis.com",
"model": "gemini-2.5-flash",
"prompt": "Extract all visible text from this image. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the image. Only return the extracted text.",
"maxTokens": 4096,
"timeoutSeconds": 300,
"apiKey": "your-gemini-api-key",
"inlineContent": true,
"skipOcr": false,
"minFileSizeToOcr": 0,
"maxFileSizeToOcr": 52428800
}
}
]
}
The GeminiVLMParser targets the Google
Gemini generateContent
endpoint. The API key is passed as a key query parameter.
Change baseUrl if you are using Vertex AI or a proxy.
Using a VLM parser for PDF parsing
Claude and Gemini can process entire PDFs with their vision capabilities.
To route PDFs to a VLM parser instead of the default PDFParser, exclude
the default and add the VLM parser:
{
"parsers": [
{
"default-parser": {
"exclude": ["pdf-parser"]
}
},
{
"claude-vlm-parser": {
"apiKey": "sk-ant-your-key-here",
"model": "claude-sonnet-4-20250514",
"prompt": "Extract all text from this document. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the document. Only return the extracted text."
}
}
]
}
You can substitute gemini-vlm-parser for claude-vlm-parser above.
Configuration options reference
All three parsers share the same configuration POJO (VLMOCRConfig):
| Property | Default | Description |
|---|---|---|
|
varies by parser |
Base URL of the API endpoint (no trailing slash). |
|
varies by parser |
Model identifier sent in the API request. |
|
(markdown extraction prompt) |
The text prompt sent alongside the image or document. |
|
|
Maximum tokens the model may generate. |
|
|
HTTP read timeout in seconds. |
|
|
API key. Format depends on the parser (Bearer header, x-api-key header, or query parameter). |
|
|
When parsing inline images (embedded resource type |
|
|
Runtime kill-switch to disable the parser entirely. |
|
|
Minimum input file size in bytes. |
|
|
Maximum input file size in bytes. |
Markdown-to-XHTML conversion
The VLM’s text response is expected to be markdown. Tika parses it using
commonmark-java and emits
proper XHTML elements (<h1>, <p>, <table>, <b>, <i>, etc.)
instead of dumping raw text. GFM tables and strikethrough are supported.
Per-request configuration
You can override configuration per-request by setting a VLMOCRConfig
instance on the ParseContext:
VLMOCRConfig override = new VLMOCRConfig();
override.setModel("claude-opus-4-20250514");
override.setMaxTokens(8192);
ParseContext context = new ParseContext();
context.set(VLMOCRConfig.class, override);
@since Apache Tika 4.0