Running a Local VLM Server for OCR

This guide shows how to run an open-source Vision-Language Model (VLM) locally as an OpenAI-compatible endpoint, so that Tika’s OpenAIVLMParser can use it for OCR without any cloud API keys.

This is useful for:

Air-gapped / offline environments
Avoiding per-request API costs
Processing sensitive documents that cannot leave your network
Evaluation and testing

The server wraps a Hugging Face model in a lightweight FastAPI app that exposes the /v1/chat/completions endpoint in the OpenAI format.

Prerequisites

macOS with Apple Silicon (M1/M2/M3/M4) or a Linux box with an NVIDIA GPU
pyenv installed (brew install pyenv on macOS)
~5 GB disk for model weights (downloaded on first run)
(Optional) poppler for PDF-to-image conversion: brew install poppler

Supported models

Any Hugging Face causal vision-language model works. Two good options:

Model	Size	License	Notes
jinaai/jina-vlm	2.4B	CC BY-NC 4.0	Good OCR quality; non-commercial license
Qwen/Qwen2-VL-2B-Instruct	2B	Apache 2.0	Permissive license; comparable quality

Model

Size

License

Notes

jinaai/jina-vlm

2.4B

CC BY-NC 4.0

Good OCR quality; non-commercial license

Qwen/Qwen2-VL-2B-Instruct

Apache 2.0

Permissive license; comparable quality

To use a different model, pass --model <name> when starting the server.

Setup

1. Install Python 3.12

Python 3.14 is too new for the ML ecosystem. Install 3.12 via pyenv:

pyenv install 3.12.10

2. Create a project directory and virtual environment

mkdir ~/vlm-server && cd ~/vlm-server
pyenv local 3.12.10
python3 -m venv .venv
source .venv/bin/activate

3. Install dependencies

Use transformers 4.x (not 5.x). The Jina VLM model code imports CommonKwargs which was removed in transformers 5. The model also requires einops.

pip install torch torchvision "transformers>=4.49,<5" pillow requests \
    accelerate einops fastapi uvicorn

The server script

Save the following as server.py:

#!/usr/bin/env python3
"""
OpenAI-compatible chat completions server for local VLMs on macOS Apple Silicon.

Loads the model once at startup and serves requests via a local-only HTTP endpoint.
Supports image inputs as URLs, file paths, or base64 data URIs.

Usage:
    python server.py
    python server.py --port 8000
    python server.py --model Qwen/Qwen2-VL-2B-Instruct --port 8000
"""

import argparse
import base64
import io
import time
import uuid

import torch
import uvicorn
from fastapi import FastAPI
from PIL import Image
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig


# --- Request / response models (OpenAI-compatible subset) ---

class ChatMessage(BaseModel):
    role: str
    content: str | list

class ChatRequest(BaseModel):
    model: str = "jina-vlm"
    messages: list[ChatMessage]
    max_tokens: int = 1024
    temperature: float = 0.0

class ChatChoice(BaseModel):
    index: int = 0
    message: dict
    finish_reason: str = "stop"

class ChatUsage(BaseModel):
    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0

class ChatResponse(BaseModel):
    id: str
    object: str = "chat.completion"
    created: int
    model: str
    choices: list[ChatChoice]
    usage: ChatUsage


app = FastAPI(title="VLM server")

model = None
processor = None
device = None


def load_image(src: str) -> Image.Image:
    """Load an image from a URL, file path, or base64 data URI."""
    if src.startswith("data:"):
        header, data = src.split(",", 1)
        return Image.open(io.BytesIO(base64.b64decode(data)))
    elif src.startswith("http://") or src.startswith("https://"):
        import requests
        return Image.open(io.BytesIO(requests.get(src, timeout=30).content))
    else:
        return Image.open(src)


def parse_messages(messages: list[ChatMessage]):
    """Convert OpenAI-style messages to transformers conversation + images."""
    conversation = []
    images = []
    for msg in messages:
        if isinstance(msg.content, str):
            conversation.append({
                "role": msg.role,
                "content": [{"type": "text", "text": msg.content}]
            })
        elif isinstance(msg.content, list):
            parts = []
            for part in msg.content:
                if part.get("type") == "text":
                    parts.append({"type": "text", "text": part["text"]})
                elif part.get("type") == "image_url":
                    url = part["image_url"]["url"]
                    images.append(load_image(url))
                    parts.append({"type": "image", "image": url})
            conversation.append({"role": msg.role, "content": parts})
    return conversation, images


@app.get("/health")
def health():
    return {"status": "ok"}


@app.post("/v1/chat/completions")
def chat_completions(req: ChatRequest):
    conversation, images = parse_messages(req.messages)
    text = processor.apply_chat_template(
        conversation, add_generation_prompt=True
    )

    if images:
        inputs = processor(
            text=[text], images=images,
            padding="longest", return_tensors="pt"
        )
    else:
        inputs = processor(
            text=[text], padding="longest", return_tensors="pt"
        )

    inputs = {
        k: v.to(model.device) if isinstance(v, torch.Tensor) else v
        for k, v in inputs.items()
    }

    output = model.generate(
        **inputs,
        generation_config=GenerationConfig(
            max_new_tokens=req.max_tokens,
            do_sample=req.temperature > 0,
            temperature=req.temperature if req.temperature > 0 else None,
        ),
        return_dict_in_generate=True,
        use_model_defaults=True,
    )

    response_text = processor.tokenizer.decode(
        output.sequences[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True,
    )

    prompt_tokens = inputs["input_ids"].shape[-1]
    completion_tokens = output.sequences[0].shape[-1] - prompt_tokens

    return ChatResponse(
        id=f"chatcmpl-{uuid.uuid4().hex[:12]}",
        created=int(time.time()),
        model=req.model,
        choices=[ChatChoice(
            message={"role": "assistant", "content": response_text}
        )],
        usage=ChatUsage(
            prompt_tokens=prompt_tokens,
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        ),
    )


def main():
    global model, processor, device

    parser = argparse.ArgumentParser(
        description="VLM OpenAI-compatible server (localhost only)"
    )
    parser.add_argument(
        "--model", type=str, default="jinaai/jina-vlm",
        help="Hugging Face model name or local path"
    )
    parser.add_argument(
        "--port", type=int, default=8000, help="Port (default: 8000)"
    )
    args = parser.parse_args()

    if torch.backends.mps.is_available():
        device = "mps"
        print("Using Apple Silicon GPU (MPS)")
    else:
        device = "cpu"
        print("MPS not available, using CPU (this will be slow)")

    print(f"Loading model {args.model}...")
    processor = AutoProcessor.from_pretrained(
        args.model, use_fast=False, trust_remote_code=True
    )
    model = AutoModelForCausalLM.from_pretrained(
        args.model, device_map=device,
        torch_dtype=torch.float16, trust_remote_code=True,
    )
    print(f"Model loaded. Starting server on http://127.0.0.1:{args.port}")
    uvicorn.run(app, host="127.0.0.1", port=args.port)


if __name__ == "__main__":
    main()

Starting the server

source .venv/bin/activate
python server.py                              # default: jina-vlm on port 8000
python server.py --model Qwen/Qwen2-VL-2B-Instruct  # use Qwen instead
python server.py --port 9000                  # custom port

The first run downloads ~5 GB of model weights. Subsequent runs use the Hugging Face cache.

The server binds to 127.0.0.1 only — it is not accessible from the network.

Verify the server is running

curl http://localhost:8000/health
# {"status":"ok"}

Configure Tika to use it

Point the OpenAIVLMParser at your local server:

{
  "parsers": [
    {
      "openai-vlm-parser": {
        "baseUrl": "http://127.0.0.1:8000",
        "model": "jinaai/jina-vlm",
        "timeoutSeconds": 600
      }
    }
  ]
}

A generous timeoutSeconds is recommended — local inference on Apple Silicon takes 10-60 seconds per page depending on model size and image resolution.

Testing with curl

Text-only query

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jina-vlm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Image from URL

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jina-vlm",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://picsum.photos/800/600"}},
        {"type": "text", "text": "Describe this image"}
      ]
    }]
  }'

Local image (base64)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jina-vlm",
    "messages": [{
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/png;base64,'"$(base64 -i page.png)"'"
          }
        },
        {
          "type": "text",
          "text": "Extract all visible text from this image as markdown."
        }
      ]
    }]
  }'

Troubleshooting

`ImportError: cannot import name 'CommonKwargs'`

You have transformers 5.x. The Jina VLM model code requires 4.x:

pip install "transformers>=4.49,<5"
rm -rf ~/.cache/huggingface/modules/transformers_modules/jinaai/jina_hyphen_vlm

`ImportError: … einops`

pip install einops

`UserWarning: MPS: The constant padding of more than 3 dimensions …`

Harmless performance warning from PyTorch’s MPS backend on Apple Silicon. Results are correct; some operations fall back to a slower code path. Safe to ignore.

Performance notes

Jina VLM (2.4B) on M3 Max: ~15-30 seconds per page image
Model loading at startup: ~20-30 seconds
Keeping the server running avoids reloading the model per request
Consider timeoutSeconds: 600 or higher in the Tika config for large or complex images

Licensing

Jina VLM (jinaai/jina-vlm): CC BY-NC 4.0 (non-commercial). Contact Jina AI for commercial licensing.
Qwen2-VL (Qwen/Qwen2-VL-2B-Instruct): Apache 2.0 (permissive, commercial use OK).