Parse Modes
Tika Pipes uses ParseMode to control how documents are parsed and how results are emitted.
The parse mode is configured in the pipes section of the JSON config, or overridden per-request
in the parseContext field of a FetchEmitTuple.
Available Parse Modes
| Mode | Description |
|---|---|
|
Default mode. Each embedded document produces its own |
|
All embedded-document text is concatenated into a single content field on the container’s |
|
Same parsing as |
|
Skips parsing. Container-level MIME detection and digesting (if configured) still run. See NO_PARSE Mode. |
|
Extract raw bytes from embedded documents. See Extracting Embedded Bytes. |
Content Handler Types
The content handler type determines the format of the extracted text. It is set on the
ContentHandlerFactory configured in parseContext (or via the CLI --handler flag), and applies
to all modes that produce content (RMETA, CONCATENATE, CONTENT_ONLY).
| Handler | Extension | Description |
|---|---|---|
|
|
Plain text output |
|
|
HTML output |
|
|
XHTML output |
|
|
Markdown output |
|
|
Body content handler output (text from the document body only) |
CONCATENATE Mode
CONCATENATE merges all extracted text — from the container and all embedded documents — into a single content field on the container’s Metadata object.
{
"pipes": {
"parseMode": "CONCATENATE"
}
}
What’s in the result
-
A single
Metadataobject (the container’s). -
X-TIKA:contentcontains the concatenated text of the container and all reachable embedded documents. -
Container-level metadata fields (title, author, content type, etc.) are present.
-
The handler type used is recorded in
X-TIKA:content_handler_type.
What’s NOT in the result
-
Per-embedded-document metadata is discarded. If an embedded PDF has its own title and author, those values are not in the output. Only the container’s metadata is returned. Use
RMETAif you need per-embedded metadata. -
Individual embedded-document parse exceptions are not surfaced as separate entries. They are handled by Tika’s embedded document extractor and may appear as embedded-exception fields on the container metadata, but there is no per-embedded
Metadataobject to inspect.
Container-level exceptions
If the container parse fails (SAXException, EncryptedDocumentException, or any other Exception), the stack trace is caught, logged, and stored on the container metadata as X-TIKA:container_exception. The parse continues to a return value rather than throwing — callers must check this field if they need to detect failure.
If the configured write limit is reached during concatenation, X-TIKA:write_limit_reached is set to true.
CONTENT_ONLY Mode
CONTENT_ONLY is designed for cases where you want just the extracted content
written to storage — no JSON wrapping, no metadata overhead. This is particularly
useful for:
-
Extracting markdown files from a document corpus
-
Building plain text search indexes
-
Generating HTML versions of documents
{
"pipes": {
"parseMode": "CONTENT_ONLY"
}
}
How it works
-
Documents are parsed identically to
CONCATENATEmode — all embedded text is merged into the container’s content field, and the same caveats around per-embedded metadata apply. -
A metadata filter automatically strips all metadata except
X-TIKA:contentandX-TIKA:container_exception(for error tracking). -
When the emitter is a
StreamEmitter(such as the filesystem or S3 emitter), the raw content string is written directly as bytes — no JSON serialization.
Metadata filtering
By default, CONTENT_ONLY mode applies an IncludeFieldMetadataFilter that retains
only X-TIKA:content and X-TIKA:container_exception. If you set your own
MetadataFilter on the ParseContext, your filter takes priority.
CLI usage
The tika-app batch processor supports CONTENT_ONLY via the --content-only
flag:
java -jar tika-app.jar -i /input -o /output --handler m --content-only
This produces .md files (when using the m handler type) containing only the
extracted markdown content. See Content Handler Types for the available handler types.
NO_PARSE Mode
NO_PARSE skips parsing entirely. The container’s content type is still detected, and any configured digester still runs against the raw bytes. No text is extracted, no embedded documents are recursed into.
{
"parseContext": {
"parseMode": "NO_PARSE"
}
}
What still runs
-
MIME detection. The configured
Detectorruns against the input stream and populatesContent-TypeandX-TIKA:content_type_parser_overrideon the container metadata. -
Digesting. If a
DigesterFactoryis configured on theParseContext, it runs against the raw bytes and writes the digest fields (e.g.,X-TIKA:digest:SHA256) to the container metadata before the parse-mode check.
What does NOT run
-
No parser is invoked.
X-TIKA:contentis empty. -
No embedded documents are extracted.
-
No content handler is constructed (handler-type configuration is ignored for this mode).
When to use
-
Fetch-and-emit pipelines that move bytes from one store to another and need only the content type and a fixed-bytes digest for downstream routing or deduplication.
-
Hash-only inventories of large corpora where parsing every document is too expensive but a stable digest per file is required.
-
MIME triage: detect content types across a large set so a downstream pipeline can pick the right parser, parse mode, or skip rule.
Because digest and detection run in _preParse regardless of parse mode, switching between NO_PARSE and the parsing modes leaves digest values stable for the same input — useful for cross-stage joins.
UNPACK Mode
UNPACK extracts the raw bytes of embedded documents (rather than their parsed text) and emits them via the configured emitter. See Extracting Embedded Bytes for the full configuration model.
The recursive parsing pass for UNPACK uses the same code path as RMETA; the difference is at setup and emit time, where mandatory byte extraction is enabled and emitted bytes are routed through the UnpackHandler.