VLM (Vision-Language Model) Parsers

Table of Contents

Supported input types
Module dependency
OpenAI-compatible (vLLM, Ollama, etc.)
- Basic Configuration
- Full Configuration
Anthropic Claude
- Basic Configuration
- Full Configuration
Google Gemini
- Basic Configuration
- Full Configuration
Using a VLM parser for PDF parsing
Configuration options reference
Markdown-to-XHTML conversion
Per-request configuration

Tika includes a family of parsers that delegate OCR and document understanding to remote Vision-Language Model (VLM) endpoints. These parsers send images (or PDFs) to an external API and convert the model’s markdown response into structured XHTML.

Three implementations are provided out of the box:

Parser Endpoint Config key SPI auto-loaded?

Parser	Endpoint	Config key	SPI auto-loaded?
`OpenAIVLMParser`	Any OpenAI-compatible chat completions endpoint (vLLM, Ollama, local FastAPI, OpenAI)	`openai-vlm-parser`	Yes
`ClaudeVLMParser`	Anthropic Messages API	`claude-vlm-parser`	No
`GeminiVLMParser`	Google Gemini `generateContent` API	`gemini-vlm-parser`	No

OpenAIVLMParser

Any OpenAI-compatible chat completions endpoint (vLLM, Ollama, local FastAPI, OpenAI)

openai-vlm-parser

Yes

ClaudeVLMParser

Anthropic Messages API

claude-vlm-parser

GeminiVLMParser

Google Gemini generateContent API

gemini-vlm-parser

OpenAIVLMParser is the only parser loaded by the default parser via SPI. ClaudeVLMParser and GeminiVLMParser must be explicitly added to your configuration.

Supported input types

All three parsers handle standard OCR image types (image/ocr-png, image/ocr-jpeg, etc.). ClaudeVLMParser and GeminiVLMParser additionally declare application/pdf support, meaning they can process PDFs natively using the model’s vision capabilities.

Module dependency

The VLM parsers live in the tika-parser-vlm-ocr-module artifact. Add it to your project:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parser-vlm-ocr-module</artifactId>
  <version>${tika.version}</version>
</dependency>

To run a local open-source VLM without cloud API keys, see Running a Local VLM Server.

OpenAI-compatible (vLLM, Ollama, etc.)

Basic Configuration

{
  "parsers": [
    {
      "openai-vlm-parser": {
        "baseUrl": "http://127.0.0.1:8000",
        "model": "jinaai/jina-vlm",
        "timeoutSeconds": 300
      }
    }
  ]
}

Full Configuration

{
  "parsers": [
    {
      "openai-vlm-parser": {
        "baseUrl": "http://127.0.0.1:8000",
        "completionsPath": "/v1/chat/completions",
        "model": "jinaai/jina-vlm",
        "prompt": "Extract all visible text from this image. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the image. Only return the extracted text.",
        "maxTokens": 4096,
        "timeoutSeconds": 300,
        "apiKey": "",
        "inlineContent": true,
        "skipOcr": false,
        "minFileSizeToOcr": 0,
        "maxFileSizeToOcr": 52428800,
        "maxImagePixels": 100000000,
        "allowRuntimePrompt": false
      }
    }
  ]
}

The OpenAIVLMParser works with any server that exposes an /v1/chat/completions endpoint in the OpenAI format. This includes:

vLLM
Ollama
A local FastAPI / Flask wrapper around a Hugging Face model
OpenAI itself

Authentication uses a standard Authorization: Bearer <apiKey> header. Leave apiKey empty to skip authentication (typical for local servers).

Anthropic Claude

Basic Configuration

{
  "parsers": [
    {
      "claude-vlm-parser": {
        "apiKey": "YOUR_ANTHROPIC_API_KEY",
        "model": "claude-sonnet-4-20250514"
      }
    }
  ]
}

Full Configuration

{
  "parsers": [
    {
      "claude-vlm-parser": {
        "baseUrl": "https://api.anthropic.com",
        "model": "claude-sonnet-4-20250514",
        "prompt": "Extract all visible text from this image. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the image. Only return the extracted text.",
        "maxTokens": 4096,
        "timeoutSeconds": 300,
        "apiKey": "YOUR_ANTHROPIC_API_KEY",
        "inlineContent": true,
        "skipOcr": false,
        "minFileSizeToOcr": 0,
        "maxFileSizeToOcr": 52428800,
        "maxImagePixels": 100000000,
        "allowRuntimePrompt": false
      }
    }
  ]
}

The ClaudeVLMParser uses the Anthropic Messages API. Authentication uses the x-api-key header (not Bearer). The required anthropic-version header is sent automatically.

Claude handles images and PDFs natively. For images, the content block type is image; for PDFs it is document. The parser detects the correct type from the input MIME type.

Google Gemini

Basic Configuration

{
  "parsers": [
    {
      "gemini-vlm-parser": {
        "apiKey": "your-gemini-api-key",
        "model": "gemini-2.5-flash"
      }
    }
  ]
}

Full Configuration

{
  "parsers": [
    {
      "gemini-vlm-parser": {
        "baseUrl": "https://generativelanguage.googleapis.com",
        "model": "gemini-2.5-flash",
        "prompt": "Extract all visible text from this image. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the image. Only return the extracted text.",
        "maxTokens": 4096,
        "timeoutSeconds": 300,
        "apiKey": "your-gemini-api-key",
        "inlineContent": true,
        "skipOcr": false,
        "minFileSizeToOcr": 0,
        "maxFileSizeToOcr": 52428800,
        "maxImagePixels": 100000000,
        "allowRuntimePrompt": false
      }
    }
  ]
}

The GeminiVLMParser targets the Google Gemini generateContent endpoint. The API key is passed as a key query parameter.

Change baseUrl if you are using Vertex AI or a proxy.

Using a VLM parser for PDF parsing

Claude and Gemini can process entire PDFs with their vision capabilities. To route PDFs to a VLM parser instead of the default PDFParser, exclude the default and add the VLM parser:

{
  "parsers": [
    {
      "default-parser": {
        "exclude": ["pdf-parser"]
      }
    },
    {
      "claude-vlm-parser": {
        "apiKey": "YOUR_ANTHROPIC_API_KEY",
        "model": "claude-sonnet-4-20250514",
        "prompt": "Extract all text from this document. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the document. Only return the extracted text."
      }
    }
  ]
}

You can substitute gemini-vlm-parser for claude-vlm-parser above.

Configuration options reference

All three parsers share the same configuration POJO (VLMOCRConfig):

Property Default Description

Property	Default	Description
`baseUrl`	varies by parser	Base URL of the API endpoint (no trailing slash).
`model`	varies by parser	Model identifier sent in the API request.
`prompt`	(markdown extraction prompt)	The text prompt sent alongside the image or document.
`maxTokens`	`4096`	Maximum tokens the model may generate.
`timeoutSeconds`	`300`	HTTP read timeout in seconds.
`apiKey`	`""` (empty)	API key. Format depends on the parser (Bearer header, x-api-key header, or query parameter).
`inlineContent`	`true`	When parsing inline images (embedded resource type `INLINE`), write OCR text into the parent document’s content stream. Mirrors `TesseractOCRParser` inline behaviour.
`skipOcr`	`false`	Runtime kill-switch to disable the parser entirely.
`minFileSizeToOcr`	`0`	Minimum input file size in bytes.
`maxFileSizeToOcr`	`52428800` (50 MB)	Maximum input file size in bytes.
`maxImagePixels`	`100000000` (100 megapixels)	Maximum decoded image area in pixels. Larger images are rejected before being sent to the model. Set to `-1` to disable the limit. Guards against decompression-bomb inputs and runaway VLM cost on a single huge image.
`allowRuntimePrompt`	`false`	When `false` (default), the `prompt` is fixed at initialization time and per-request overrides are rejected. When `true`, the prompt can be overridden per request via the `ParseContext` `VLMOCRConfig`. Security-relevant: a runtime-controllable prompt is effectively a prompt-injection surface for any caller that can set the `ParseContext`. Only enable when callers are trusted.
`completionsPath`	`/v1/chat/completions` (OpenAI/vLLM only)	HTTP path appended to `baseUrl` for the chat-completions endpoint. Used by the OpenAI-compatible parser only. Claude and Gemini hardcode their own API paths (`/v1/messages` and `/v1beta/models/{model}:generateContent` respectively) and ignore this field.

baseUrl

varies by parser

Base URL of the API endpoint (no trailing slash).

model

varies by parser

Model identifier sent in the API request.

prompt

(markdown extraction prompt)

The text prompt sent alongside the image or document.

maxTokens

4096

Maximum tokens the model may generate.

timeoutSeconds

300

HTTP read timeout in seconds.

apiKey

"" (empty)

API key. Format depends on the parser (Bearer header, x-api-key header, or query parameter).

inlineContent

true

When parsing inline images (embedded resource type INLINE), write OCR text into the parent document’s content stream. Mirrors TesseractOCRParser inline behaviour.

skipOcr

false

Runtime kill-switch to disable the parser entirely.

minFileSizeToOcr

0

Minimum input file size in bytes.

maxFileSizeToOcr

52428800 (50 MB)

Maximum input file size in bytes.

maxImagePixels

100000000 (100 megapixels)

Maximum decoded image area in pixels. Larger images are rejected before being sent to the model. Set to -1 to disable the limit. Guards against decompression-bomb inputs and runaway VLM cost on a single huge image.

allowRuntimePrompt

false

When false (default), the prompt is fixed at initialization time and per-request overrides are rejected. When true, the prompt can be overridden per request via the ParseContext VLMOCRConfig. Security-relevant: a runtime-controllable prompt is effectively a prompt-injection surface for any caller that can set the ParseContext. Only enable when callers are trusted.

completionsPath

/v1/chat/completions (OpenAI/vLLM only)

HTTP path appended to baseUrl for the chat-completions endpoint. Used by the OpenAI-compatible parser only. Claude and Gemini hardcode their own API paths (/v1/messages and /v1beta/models/{model}:generateContent respectively) and ignore this field.

Markdown-to-XHTML conversion

The VLM’s text response is expected to be markdown. Tika parses it using commonmark-java and emits proper XHTML elements (<h1>, <p>, <table>, <b>, <i>, etc.) instead of dumping raw text. GFM tables and strikethrough are supported.

Per-request configuration

You can override configuration per-request by setting a VLMOCRConfig instance on the ParseContext:

VLMOCRConfig override = new VLMOCRConfig();
override.setModel("claude-opus-4-20250514");
override.setMaxTokens(8192);

ParseContext context = new ParseContext();
context.set(VLMOCRConfig.class, override);

@since Apache Tika 4.0