Parse Modes

Tika Pipes uses ParseMode to control how documents are parsed and how results are emitted. The parse mode is configured in the pipes section of the JSON config, or overridden per-request in the parseContext field of a FetchEmitTuple.

Available Parse Modes

Mode Description

RMETA

Default mode. Each embedded document produces its own Metadata object. Results are returned as a JSON array of metadata objects, preserving per-embedded metadata.

CONCATENATE

All embedded-document text is concatenated into a single content field on the container’s Metadata object. Per-embedded metadata is not retained in the result. See CONCATENATE Mode.

CONTENT_ONLY

Same parsing as CONCATENATE, but emitters write only the raw content — no JSON wrapper, no metadata fields. See CONTENT_ONLY Mode.

NO_PARSE

Skips parsing. Container-level MIME detection and digesting (if configured) still run. See NO_PARSE Mode.

UNPACK

Extract raw bytes from embedded documents. See Extracting Embedded Bytes.

Content Handler Types

The content handler type determines the format of the extracted text. It is set on the ContentHandlerFactory configured in parseContext (or via the CLI --handler flag), and applies to all modes that produce content (RMETA, CONCATENATE, CONTENT_ONLY).

Handler Extension Description

t (text)

.txt

Plain text output

h (html)

.html

HTML output

x (xml)

.xml

XHTML output

m (markdown)

.md

Markdown output

b (body)

.txt

Body content handler output (text from the document body only)

CONCATENATE Mode

CONCATENATE merges all extracted text — from the container and all embedded documents — into a single content field on the container’s Metadata object.

{
  "pipes": {
    "parseMode": "CONCATENATE"
  }
}

What’s in the result

  • A single Metadata object (the container’s).

  • X-TIKA:content contains the concatenated text of the container and all reachable embedded documents.

  • Container-level metadata fields (title, author, content type, etc.) are present.

  • The handler type used is recorded in X-TIKA:content_handler_type.

What’s NOT in the result

  • Per-embedded-document metadata is discarded. If an embedded PDF has its own title and author, those values are not in the output. Only the container’s metadata is returned. Use RMETA if you need per-embedded metadata.

  • Individual embedded-document parse exceptions are not surfaced as separate entries. They are handled by Tika’s embedded document extractor and may appear as embedded-exception fields on the container metadata, but there is no per-embedded Metadata object to inspect.

Container-level exceptions

If the container parse fails (SAXException, EncryptedDocumentException, or any other Exception), the stack trace is caught, logged, and stored on the container metadata as X-TIKA:container_exception. The parse continues to a return value rather than throwing — callers must check this field if they need to detect failure.

If the configured write limit is reached during concatenation, X-TIKA:write_limit_reached is set to true.

CONTENT_ONLY Mode

CONTENT_ONLY is designed for cases where you want just the extracted content written to storage — no JSON wrapping, no metadata overhead. This is particularly useful for:

  • Extracting markdown files from a document corpus

  • Building plain text search indexes

  • Generating HTML versions of documents

{
  "pipes": {
    "parseMode": "CONTENT_ONLY"
  }
}

How it works

  1. Documents are parsed identically to CONCATENATE mode — all embedded text is merged into the container’s content field, and the same caveats around per-embedded metadata apply.

  2. A metadata filter automatically strips all metadata except X-TIKA:content and X-TIKA:container_exception (for error tracking).

  3. When the emitter is a StreamEmitter (such as the filesystem or S3 emitter), the raw content string is written directly as bytes — no JSON serialization.

Metadata filtering

By default, CONTENT_ONLY mode applies an IncludeFieldMetadataFilter that retains only X-TIKA:content and X-TIKA:container_exception. If you set your own MetadataFilter on the ParseContext, your filter takes priority.

CLI usage

The tika-app batch processor supports CONTENT_ONLY via the --content-only flag:

java -jar tika-app.jar -i /input -o /output --handler m --content-only

This produces .md files (when using the m handler type) containing only the extracted markdown content. See Content Handler Types for the available handler types.

NO_PARSE Mode

NO_PARSE skips parsing entirely. The container’s content type is still detected, and any configured digester still runs against the raw bytes. No text is extracted, no embedded documents are recursed into.

{
  "parseContext": {
    "parseMode": "NO_PARSE"
  }
}

What still runs

  • MIME detection. The configured Detector runs against the input stream and populates Content-Type and X-TIKA:content_type_parser_override on the container metadata.

  • Digesting. If a DigesterFactory is configured on the ParseContext, it runs against the raw bytes and writes the digest fields (e.g., X-TIKA:digest:SHA256) to the container metadata before the parse-mode check.

What does NOT run

  • No parser is invoked. X-TIKA:content is empty.

  • No embedded documents are extracted.

  • No content handler is constructed (handler-type configuration is ignored for this mode).

When to use

  • Fetch-and-emit pipelines that move bytes from one store to another and need only the content type and a fixed-bytes digest for downstream routing or deduplication.

  • Hash-only inventories of large corpora where parsing every document is too expensive but a stable digest per file is required.

  • MIME triage: detect content types across a large set so a downstream pipeline can pick the right parser, parse mode, or skip rule.

Because digest and detection run in _preParse regardless of parse mode, switching between NO_PARSE and the parsing modes leaves digest values stable for the same input — useful for cross-stage joins.

UNPACK Mode

UNPACK extracts the raw bytes of embedded documents (rather than their parsed text) and emits them via the configured emitter. See Extracting Embedded Bytes for the full configuration model.

The recursive parsing pass for UNPACK uses the same code path as RMETA; the difference is at setup and emit time, where mandatory byte extraction is enabled and emitted bytes are routed through the UnpackHandler.