VLM (Vision-Language Model) Parsers

Tika includes a family of parsers that delegate OCR and document understanding to remote Vision-Language Model (VLM) endpoints. These parsers send images (or PDFs) to an external API and convert the model’s markdown response into structured XHTML.

Three implementations are provided out of the box:

Parser Endpoint Config key SPI auto-loaded?

OpenAIVLMParser

Any OpenAI-compatible chat completions endpoint (vLLM, Ollama, local FastAPI, OpenAI)

openai-vlm-parser

Yes

ClaudeVLMParser

Anthropic Messages API

claude-vlm-parser

No

GeminiVLMParser

Google Gemini generateContent API

gemini-vlm-parser

No

OpenAIVLMParser is the only parser loaded by the default parser via SPI. ClaudeVLMParser and GeminiVLMParser must be explicitly added to your configuration.

Supported input types

All three parsers handle standard OCR image types (image/ocr-png, image/ocr-jpeg, etc.). ClaudeVLMParser and GeminiVLMParser additionally declare application/pdf support, meaning they can process PDFs natively using the model’s vision capabilities.

Module dependency

The VLM parsers live in the tika-parser-vlm-ocr-module artifact. Add it to your project:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parser-vlm-ocr-module</artifactId>
  <version>${tika.version}</version>
</dependency>
To run a local open-source VLM without cloud API keys, see Running a Local VLM Server.

OpenAI-compatible (vLLM, Ollama, etc.)

Basic Configuration

{
  "parsers": [
    {
      "openai-vlm-parser": {
        "baseUrl": "http://127.0.0.1:8000",
        "model": "jinaai/jina-vlm",
        "timeoutSeconds": 300
      }
    }
  ]
}

Full Configuration

{
  "parsers": [
    {
      "openai-vlm-parser": {
        "baseUrl": "http://127.0.0.1:8000",
        "model": "jinaai/jina-vlm",
        "prompt": "Extract all visible text from this image. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the image. Only return the extracted text.",
        "maxTokens": 4096,
        "timeoutSeconds": 300,
        "apiKey": "",
        "inlineContent": true,
        "skipOcr": false,
        "minFileSizeToOcr": 0,
        "maxFileSizeToOcr": 52428800
      }
    }
  ]
}

The OpenAIVLMParser works with any server that exposes an /v1/chat/completions endpoint in the OpenAI format. This includes:

  • vLLM

  • Ollama

  • A local FastAPI / Flask wrapper around a Hugging Face model

  • OpenAI itself

Authentication uses a standard Authorization: Bearer <apiKey> header. Leave apiKey empty to skip authentication (typical for local servers).

Anthropic Claude

Basic Configuration

{
  "parsers": [
    {
      "claude-vlm-parser": {
        "apiKey": "sk-ant-your-key-here",
        "model": "claude-sonnet-4-20250514"
      }
    }
  ]
}

Full Configuration

{
  "parsers": [
    {
      "claude-vlm-parser": {
        "baseUrl": "https://api.anthropic.com",
        "model": "claude-sonnet-4-20250514",
        "prompt": "Extract all visible text from this image. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the image. Only return the extracted text.",
        "maxTokens": 4096,
        "timeoutSeconds": 300,
        "apiKey": "sk-ant-your-key-here",
        "inlineContent": true,
        "skipOcr": false,
        "minFileSizeToOcr": 0,
        "maxFileSizeToOcr": 52428800
      }
    }
  ]
}

The ClaudeVLMParser uses the Anthropic Messages API. Authentication uses the x-api-key header (not Bearer). The required anthropic-version header is sent automatically.

Claude handles images and PDFs natively. For images, the content block type is image; for PDFs it is document. The parser detects the correct type from the input MIME type.

Google Gemini

Basic Configuration

{
  "parsers": [
    {
      "gemini-vlm-parser": {
        "apiKey": "your-gemini-api-key",
        "model": "gemini-2.5-flash"
      }
    }
  ]
}

Full Configuration

{
  "parsers": [
    {
      "gemini-vlm-parser": {
        "baseUrl": "https://generativelanguage.googleapis.com",
        "model": "gemini-2.5-flash",
        "prompt": "Extract all visible text from this image. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the image. Only return the extracted text.",
        "maxTokens": 4096,
        "timeoutSeconds": 300,
        "apiKey": "your-gemini-api-key",
        "inlineContent": true,
        "skipOcr": false,
        "minFileSizeToOcr": 0,
        "maxFileSizeToOcr": 52428800
      }
    }
  ]
}

The GeminiVLMParser targets the Google Gemini generateContent endpoint. The API key is passed as a key query parameter.

Change baseUrl if you are using Vertex AI or a proxy.

Using a VLM parser for PDF parsing

Claude and Gemini can process entire PDFs with their vision capabilities. To route PDFs to a VLM parser instead of the default PDFParser, exclude the default and add the VLM parser:

{
  "parsers": [
    {
      "default-parser": {
        "exclude": ["pdf-parser"]
      }
    },
    {
      "claude-vlm-parser": {
        "apiKey": "sk-ant-your-key-here",
        "model": "claude-sonnet-4-20250514",
        "prompt": "Extract all text from this document. Return the text in markdown format, preserving the original structure (headings, lists, tables, paragraphs). Do not describe the document. Only return the extracted text."
      }
    }
  ]
}

You can substitute gemini-vlm-parser for claude-vlm-parser above.

Configuration options reference

All three parsers share the same configuration POJO (VLMOCRConfig):

Property Default Description

baseUrl

varies by parser

Base URL of the API endpoint (no trailing slash).

model

varies by parser

Model identifier sent in the API request.

prompt

(markdown extraction prompt)

The text prompt sent alongside the image or document.

maxTokens

4096

Maximum tokens the model may generate.

timeoutSeconds

300

HTTP read timeout in seconds.

apiKey

"" (empty)

API key. Format depends on the parser (Bearer header, x-api-key header, or query parameter).

inlineContent

true

When parsing inline images (embedded resource type INLINE), write OCR text into the parent document’s content stream. Mirrors TesseractOCRParser inline behaviour.

skipOcr

false

Runtime kill-switch to disable the parser entirely.

minFileSizeToOcr

0

Minimum input file size in bytes.

maxFileSizeToOcr

52428800 (50 MB)

Maximum input file size in bytes.

Markdown-to-XHTML conversion

The VLM’s text response is expected to be markdown. Tika parses it using commonmark-java and emits proper XHTML elements (<h1>, <p>, <table>, <b>, <i>, etc.) instead of dumping raw text. GFM tables and strikethrough are supported.

Per-request configuration

You can override configuration per-request by setting a VLMOCRConfig instance on the ParseContext:

VLMOCRConfig override = new VLMOCRConfig();
override.setModel("claude-opus-4-20250514");
override.setMaxTokens(8192);

ParseContext context = new ParseContext();
context.set(VLMOCRConfig.class, override);

@since Apache Tika 4.0