Chunk Strategies for Search Engines
Tika 4.x introduces a unified chunking and embedding pipeline (tika-inference) that produces
a tika:chunks array on each document’s metadata. This page describes the strategies for
emitting those chunks to Elasticsearch and OpenSearch.
Background
The tika-inference module produces chunks from two sources:
-
Text chunks — the extracted text is split by the
MarkdownChunkerat heading/paragraph boundaries, then each chunk is sent to a text embedding endpoint (e.g., OpenAI Embeddings API). Each chunk carries aTextLocator(character offsets). -
Image chunks — rendered PDF page images (via PDFBox or Poppler) are sent to a CLIP-like image embedding endpoint (e.g., Jina CLIP v2). Each chunk carries a
PaginatedLocator(page number) and a vector, but no text.
Both types are stored in the same tika:chunks metadata field as a JSON array. When multiple
components produce chunks (e.g., image embedder runs during parsing, text embedder runs as a
metadata filter), they merge into the same array via ChunkSerializer.mergeInto().
A single chunk looks like:
{
"text": "Revenue grew 15% year-over-year...",
"vector": "base64-encoded-float32-le",
"locators": {
"text": [{"start_offset": 0, "end_offset": 120}],
"paginated": [{"page": 1}]
}
}
Image-only chunks omit text; text-only chunks omit paginated.
The Four Chunk Strategies
Option A: Chunks as Nested Objects
Each file (container or embedded) is one ES/OpenSearch document. The tika:chunks field
is mapped as a nested type.
{"index":{"_id":"report.pdf"}}
{"title":"Q4 Report","mime":"application/pdf","tika:chunks":[
{"text":"Revenue grew...","vector":"...","locators":{"text":[{"start_offset":0,"end_offset":120}]}},
{"text":"Operating costs...","vector":"...","locators":{"text":[{"start_offset":121,"end_offset":300}]}},
{"vector":"...","locators":{"paginated":[{"page":1}]}},
{"vector":"...","locators":{"paginated":[{"page":2}]}}
]}
Pros:
-
Simple — one document per file, everything together.
-
Atomic updates.
Cons:
-
nestedkNN search is available in ES 8.11+ but has limitations. -
Large documents with many chunks can be expensive.
-
Cannot retrieve individual chunks independently.
Option B: Chunks as Separate Documents
Each chunk is its own ES/OpenSearch document with a parent_doc_id keyword field
(not a join — just a plain reference). Parent document metadata may be denormalized onto
each chunk.
{"index":{"_id":"report.pdf"}}
{"title":"Q4 Report","mime":"application/pdf"}
{"index":{"_id":"report.pdf-chunk-0"}}
{"parent_doc_id":"report.pdf","text":"Revenue grew...","vector":[0.1,0.2,...],"locators":{...}}
{"index":{"_id":"report.pdf-chunk-1"}}
{"parent_doc_id":"report.pdf","text":"Operating costs...","vector":[0.1,0.2,...],"locators":{...}}
Pros:
-
Standard RAG pattern — each chunk is independently searchable via kNN.
-
Used by LangChain, LlamaIndex, Haystack, and most vector search frameworks.
-
Simple mapping: one
dense_vectorfield, one vector per document.
Cons:
-
Many more documents.
-
Parent metadata is either denormalized (duplicated) or requires a second lookup.
Option C: Chunks as Parent-Child (Join)
Chunks use the ES/OpenSearch join field as children of the container document.
{"index":{"_id":"report.pdf","routing":"report.pdf"}}
{"title":"Q4 Report","mime":"application/pdf","relation_type":"container"}
{"index":{"_id":"report.pdf-chunk-0","routing":"report.pdf"}}
{"text":"Revenue grew...","vector":[0.1,...],"relation_type":{"name":"chunk","parent":"report.pdf"}}
Pros:
-
Parent metadata is not duplicated.
-
Can query parent fields via
has_parentqueries.
Cons:
-
Join queries are expensive in ES/OpenSearch.
-
kNN + parent_id queries are awkward.
-
Routing is required — all children must be on the same shard as the parent.
Option D: Separate Documents per File with Inline Chunks (Current Default)
Each file (container + each embedded file) is a separate document — matching the existing
SEPARATE_DOCUMENTS attachment strategy. Chunks are stored as a structured JSON array
within each document (not exploded into separate docs, not a stringified blob).
{"index":{"_id":"email.msg"}}
{"title":"Re: Q4 report","mime":"message/rfc822","tika:chunks":[
{"text":"Hi team, see attached...","vector":"...","locators":{"text":[{"start_offset":0,"end_offset":35}]}}
]}
{"index":{"_id":"email.msg-<uuid>"}}
{"title":"Q4-report.pdf","mime":"application/pdf","parent":"email.msg","tika:chunks":[
{"vector":"...","locators":{"paginated":[{"page":1}]}},
{"vector":"...","locators":{"paginated":[{"page":2}]}},
{"text":"Revenue grew...","vector":"...","locators":{"text":[{"start_offset":0,"end_offset":120}]}},
{"text":"Operating costs...","vector":"...","locators":{"text":[{"start_offset":121,"end_offset":300}]}}
]}
Pros:
-
Natural model — one document per file matches how users think about documents.
-
Embedded files (e.g., PDF inside an email) each get their own document with their own chunks.
-
Chunks are structured JSON (not escaped strings), so ES/OpenSearch can index vectors and locators natively.
-
No need for join fields or routing for the chunk relationship.
-
Compatible with ES
nestedkNN when needed.
Cons:
-
Same
nestedkNN caveats as Option A when searching within a document’s chunks. -
For very large files with many pages, the document can be large.
Current Implementation
Tika 4.x currently implements Option D:
-
The
AttachmentStrategy(SEPARATE_DOCUMENTSorPARENT_CHILD) controls how embedded files are emitted — each embedded file becomes its own document. -
The
tika:chunksfield on each document holds all chunks (text + image) as a structured JSON array. -
The emitter clients (
ElasticsearchClient,OpenSearchClient) detect thetika:chunksfield and write it as raw JSON rather than an escaped string, so the nested objects are indexable.
The tika:chunks Field
All chunk data — regardless of source — is unified in a single metadata field: tika:chunks.
There is no separate field for image embeddings vs. text embeddings. A chunk is a chunk:
-
Text chunk: has
text,vector, andTextLocator(character offsets) -
Image chunk: has
vectorandPaginatedLocator(page number), notext -
Audio chunk (future): would have
vectorandTemporalLocator(millisecond range)
When multiple pipeline components produce chunks (e.g., OpenAIImageEmbeddingParser for
rendered page images, then OpenAIEmbeddingFilter for extracted text), they merge into the
same array via ChunkSerializer.mergeInto().
Elasticsearch/OpenSearch Mapping
For Option D with nested kNN support, a mapping like this works:
{
"mappings": {
"properties": {
"title": {"type": "text"},
"mime": {"type": "keyword"},
"content": {"type": "text"},
"parent": {"type": "keyword"},
"tika:chunks": {
"type": "nested",
"properties": {
"text": {"type": "text"},
"vector": {
"type": "dense_vector",
"dims": 1024,
"index": true,
"similarity": "cosine"
},
"locators": {
"properties": {
"text": {
"type": "nested",
"properties": {
"start_offset": {"type": "integer"},
"end_offset": {"type": "integer"}
}
},
"paginated": {
"type": "nested",
"properties": {
"page": {"type": "integer"}
}
}
}
}
}
}
}
}
}
The vector field in tika:chunks stores base64-encoded float32 arrays during Tika
processing, but the emitter writes them as-is. If your mapping uses dense_vector, you may
need a pipeline or custom serialization to convert the base64 vectors to float arrays at
index time. Alternatively, map vector as keyword and decode at query time.
|
Future Work
-
Option B support — a
ChunkStrategy.SEPARATE_DOCUMENTSthat explodes chunks into individual ES/OpenSearch documents at emit time, for simpler kNN search withoutnestedqueries. -
Hybrid search — combining kNN vector search on chunks with BM25 text search on the parent document’s
contentfield. -
Chunk-level metadata — propagating selected parent metadata onto each chunk for filtering during vector search.