Content Handler Requirements for Inference
- The Problem
- Options
- Option A: Validate at Load Time (TikaLoader Cross-Component Check)
- Option B: Warn at Runtime via Metadata Field Check
- Option C: Auto-Override the Handler Type
- Option D: Default the Handler to MARKDOWN (Change the Global Default)
- Option E: Require MARKDOWN via the Embedding Filter’s Config Validation
- Current Implementation (Option B + D)
- Configuration Example (Full Inference Pipeline)
- Why Markdown?
The tika-inference module’s text embedding pipeline (AbstractEmbeddingFilter and its
subclasses like OpenAIEmbeddingFilter) reads extracted text from the tika:content
metadata field, splits it with the MarkdownChunker, and sends the resulting chunks
to an embeddings endpoint.
The MarkdownChunker is designed to split text at markdown structural boundaries
(headings, paragraph breaks, list items, fenced code blocks). If the content in
tika:content is plain text (the default), the chunker loses the ability to split
intelligently at semantic boundaries — it falls back to splitting on blank lines and
then on character limits, which produces lower-quality chunks.
This means the inference pipeline requires (or at minimum strongly prefers) that
the content-handler-factory be set to MARKDOWN.
The Problem
The default content-handler-factory in Tika is:
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1
}
}
}
This produces tika:content as plain text — no headings, no structural markers.
A user who adds an OpenAIEmbeddingFilter to their metadata filter chain but forgets
to change the handler type to MARKDOWN will get working but poor-quality chunks.
The system will not warn or error. It will silently degrade.
Options
Option A: Validate at Load Time (TikaLoader Cross-Component Check)
After TikaLoader loads both the MetadataFilter chain and the ContentHandlerFactory,
perform a cross-component validation:
-
Walk the loaded
CompositeMetadataFilterlooking for anyAbstractEmbeddingFilter -
If found, check whether the
ContentHandlerFactoryis aBasicContentHandlerFactorywithtype == MARKDOWN -
If not, throw
TikaConfigExceptionwith a clear message
Pros:
-
Fail-fast — the user knows immediately at startup.
-
Impossible to run inference with plain text accidentally.
Cons:
-
Adds cross-component coupling to
TikaLoader. -
TikaLoadertoday loads components independently — this would be the first cross-component validation. -
Makes it impossible to intentionally chunk plain text (some users may want that).
Option B: Warn at Runtime via Metadata Field Check
The content handler factory writes the handler type (e.g. MARKDOWN, TEXT) into
tika:content_handler_type on the metadata object. The AbstractEmbeddingFilter
reads that field and logs a WARN if it is not MARKDOWN:
WARN - content-handler-factory type is 'TEXT' but the MarkdownChunker requires
MARKDOWN-formatted content for high-quality chunking. Set the
content-handler-factory type to MARKDOWN.
This is a deterministic check on a metadata field — no heuristic content inspection.
Pros:
-
No coupling between loader and filters.
-
Exact — checks what handler was actually used, not what the content looks like.
-
Graceful degradation — still works, just warns.
-
Works even when the embedding filter is used outside of Pipes (e.g., via Java API).
Cons:
-
Warning may be missed in log noise.
-
Doesn’t prevent the problem, just reports it.
Option C: Auto-Override the Handler Type
When TikaLoader detects an AbstractEmbeddingFilter in the metadata filter chain,
automatically switch the content handler to MARKDOWN if it’s currently TEXT:
if (hasEmbeddingFilter && handlerType == TEXT) {
LOG.info("Inference filter detected, upgrading handler type to MARKDOWN");
contentHandlerFactory = new BasicContentHandlerFactory(HANDLER_TYPE.MARKDOWN, writeLimit);
}
Pros:
-
Zero-configuration for the common case — just add the embedding filter and it works.
-
User doesn’t need to understand the handler type / chunker relationship.
Cons:
-
Implicit behavior — the handler type changes without the user explicitly requesting it.
-
Could break existing workflows that depend on
TEXToutput intika:content(e.g., downstream consumers that don’t expect markdown). -
"Magic" behavior is generally discouraged in Tika’s design philosophy.
Option D: Default the Handler to MARKDOWN (Change the Global Default)
Change TikaLoader.loadContentHandlerFactory() to default to MARKDOWN instead of TEXT
when no content-handler-factory section is present:
// Before:
contentHandlerFactory = new BasicContentHandlerFactory(HANDLER_TYPE.TEXT, -1);
// After:
contentHandlerFactory = new BasicContentHandlerFactory(HANDLER_TYPE.MARKDOWN, -1);
Pros:
-
Simplest change — one line.
-
Markdown is a superset of plain text for most purposes.
-
Aligns the default with the inference use case, which is the primary new capability in 4.x.
Cons:
-
Breaking change for users upgrading from 3.x who expect plain text.
-
Markdown output is slightly larger than plain text (heading markers, etc.).
-
Not all parsers produce equally good markdown.
Option E: Require MARKDOWN via the Embedding Filter’s Config Validation
Have AbstractEmbeddingFilter implement Initializable and accept an optional
requiredHandlerType config field (defaulting to MARKDOWN). During
checkInitialization(), validate that the handler type matches:
@Override
public void checkInitialization(InitializableProblemHandler problemHandler)
throws TikaConfigException {
// This runs at load time, but the filter doesn't have access
// to the ContentHandlerFactory...
}
Problem: MetadataFilter runs in the post-parse metadata filter chain and has no
access to the ContentHandlerFactory or ParseContext at initialization time. This
option would require architectural changes to pass the handler type through.
Verdict: Not viable without broader refactoring.
Current Implementation (Option B + D)
We use a combination of Option B + Option D:
-
Change the global default to
MARKDOWN(Option D). Markdown output is a strict superset of plain text for search/indexing purposes. Users who explicitly need plain text can still set"type": "TEXT"in their config. This change aligns the default with the modern inference use case and is appropriate for a major version bump (4.x). -
Record the handler type in metadata —
BasicContentHandlerFactorywrites itsHANDLER_TYPEenum name totika:content_handler_typeon the metadata object. This is set in both the CONCATENATE and RMETA code paths. -
Runtime warning in
AbstractEmbeddingFilter(Option B) as a safety net. The embedding filter readstika:content_handler_typefrom metadata and logs a warning if it is notMARKDOWN. This catches the case where a user explicitly overrides the handler toTEXTbut still configures an embedding filter.
With this combination, the default configuration works out of the box for inference:
{
"metadata-filters": {
"openai-embedding-filter": {
"baseUrl": "http://localhost:8000",
"model": "text-embedding-3-small"
}
}
}
No need to also configure the handler type — it defaults to MARKDOWN.
Configuration Example (Full Inference Pipeline)
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "MARKDOWN",
"writeLimit": -1
}
},
"metadata-filters": {
"openai-embedding-filter": {
"baseUrl": "https://api.openai.com",
"model": "text-embedding-3-small",
"apiKey": "${OPENAI_API_KEY}",
"maxChunkChars": 1500,
"overlapChars": 200
}
},
"fetchers": {
"fs": {
"file-system-fetcher": {
"basePath": "/data/documents"
}
}
},
"emitters": {
"es": {
"elasticsearch-emitter": {
"urls": ["http://localhost:9200"],
"index": "documents"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "/data/documents",
"fetcherId": "fs",
"emitterId": "es"
}
},
"pipes": {
"parseMode": "CONCATENATE",
"numClients": 4
}
}
parseMode should be CONCATENATE (or RMETA) for inference. CONTENT_ONLY
skips metadata filters entirely, so the embedding filter would never run.
|
Why Markdown?
The MarkdownChunker splits at these structural boundaries (in priority order):
-
Headings (
# H1,## H2, etc.) — the strongest semantic boundary -
Thematic breaks (
---) — section separators -
Blank lines — paragraph boundaries
-
Sentence endings — last resort before hard character split
With plain text, only blank lines and sentence endings are available. This means:
-
A 10-page PDF with headings will chunk at heading boundaries with markdown, but at arbitrary paragraph breaks with plain text.
-
Tables, code blocks, and lists retain their structure in markdown, improving embedding quality.
-
The heading text itself appears in the chunk, giving the embedding model context about what the chunk is about.