Parse Modes
Tika Pipes uses parse modes to control how documents are parsed and how results are emitted.
The mode is set as parseMode in the pipes section of the JSON config, and can be overridden
per-request from Java code by attaching a ParseMode to the ParseContext on the
FetchEmitTuple you submit.
Available Parse Modes
| Mode | Description |
|---|---|
|
Default mode. Each embedded document produces its own |
|
All embedded-document text is concatenated into a single content field on the container’s |
|
Same parsing as |
|
Skips parsing. Container-level MIME detection and digesting (if configured) still run. See NO_PARSE Mode. |
|
Extract raw bytes from embedded documents. See Extracting Embedded Bytes. |
Content Handler Types
The content handler type determines the format of the extracted text. It is set in the
top-level content-handler-factory section of the JSON config (or via the CLI --handler flag),
and applies to all modes that produce content (RMETA, CONCATENATE, CONTENT_ONLY).
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT"
}
}
}
Accepted type values: TEXT, HTML, XML, MARKDOWN, BODY, IGNORE. The CLI
--handler flag uses single-letter shortcuts (t, h, x, m, b, i) that map onto
these values.
| Handler | Extension | Description |
|---|---|---|
|
|
Plain text output |
|
|
HTML output |
|
|
XHTML output |
|
|
Markdown output |
|
|
Body content handler output (text from the document body only) |
CONCATENATE Mode
CONCATENATE merges all extracted text — from the container and all embedded documents — into a single content field on the container’s Metadata object.
{
"pipes": {
"parseMode": "CONCATENATE"
}
}
What’s in the result
-
A single
Metadataobject (the container’s). -
X-TIKA:contentcontains the concatenated text of the container and all reachable embedded documents. -
Container-level metadata fields (title, author, content type, etc.) are present.
-
The handler type used is recorded in
X-TIKA:content_handler_type.
What’s NOT in the result
-
Per-embedded-document metadata is discarded. If an embedded PDF has its own title and author, those values are not in the output. Only the container’s metadata is returned. Use
RMETAif you need per-embedded metadata. -
Individual embedded-document parse exceptions are not surfaced as separate entries. They are handled by Tika’s embedded document extractor and may appear as embedded-exception fields on the container metadata, but there is no per-embedded
Metadataobject to inspect.
Container-level exceptions
If the container parse fails (SAXException, EncryptedDocumentException, or any other Exception), the stack trace is caught, logged, and stored on the container metadata as X-TIKA:container_exception. The parse continues to a return value rather than throwing — callers must check this field if they need to detect failure.
If the configured write limit is reached during concatenation, X-TIKA:write_limit_reached is set to true.
CONTENT_ONLY Mode
CONTENT_ONLY is designed for cases where you want just the extracted content
written to storage — no JSON wrapping, no metadata overhead. This is particularly
useful for:
-
Extracting markdown files from a document corpus
-
Building plain text search indexes
-
Generating HTML versions of documents
{
"pipes": {
"parseMode": "CONTENT_ONLY"
}
}
How it works
-
Documents are parsed identically to
CONCATENATEmode — all embedded text is merged into the container’s content field, and the same caveats around per-embedded metadata apply. -
A metadata filter automatically strips all metadata except
X-TIKA:contentandX-TIKA:container_exception(for error tracking). -
When the emitter is a
StreamEmitter(such as the filesystem or S3 emitter), the raw content string is written directly as bytes — no JSON serialization.
Metadata filtering
By default, CONTENT_ONLY mode applies an IncludeFieldMetadataFilter that retains
only X-TIKA:content and X-TIKA:container_exception. If you set your own
MetadataFilter on the ParseContext, your filter takes priority.
CLI usage
The tika-app Pipes processor supports CONTENT_ONLY via the --content-only
flag:
java -jar tika-app.jar -i /input -o /output --handler m --content-only
This produces .md files (when using the m handler type) containing only the
extracted markdown content. See Content Handler Types for the available handler types.
NO_PARSE Mode
NO_PARSE skips parsing entirely. The container’s content type is still detected, and any configured digester still runs against the raw bytes. No text is extracted, no embedded documents are recursed into.
{
"pipes": {
"parseMode": "NO_PARSE"
}
}
What still runs
-
MIME detection. The configured
Detectorruns against the input stream and populatesContent-TypeandX-TIKA:content_type_parser_overrideon the container metadata. -
Digesting. If a
DigesterFactoryis configured on theParseContext, it runs against the raw bytes and writes the digest fields (e.g.,X-TIKA:digest:SHA256) to the container metadata before the parse-mode check.
What does NOT run
-
No parser is invoked.
X-TIKA:contentis empty. -
No embedded documents are extracted.
-
No content handler is constructed (handler-type configuration is ignored for this mode).
When to use
-
Fetch-and-emit pipelines that move bytes from one store to another and need only the content type and a fixed-bytes digest for downstream routing or deduplication.
-
Hash-only inventories of large corpora where parsing every document is too expensive but a stable digest per file is required.
-
MIME triage: detect content types across a large set so a downstream pipeline can pick the right parser, parse mode, or skip rule.
Because digest and detection run in _preParse regardless of parse mode, switching between NO_PARSE and the parsing modes leaves digest values stable for the same input — useful for cross-stage joins.
UNPACK Mode
UNPACK extracts the raw bytes of embedded documents (rather than their parsed text) and emits them via the configured emitter. See Extracting Embedded Bytes for the full configuration model.
The recursive parsing pass for UNPACK uses the same code path as RMETA; the difference is at setup and emit time, where mandatory byte extraction is enabled and emitted bytes are routed through the UnpackHandler.