Parse Modes
Tika Pipes uses ParseMode to control how documents are parsed and how results are emitted.
The parse mode is set on the ParseContext or configured in PipesConfig.
Available Parse Modes
| Mode | Description |
|---|---|
|
Default mode. Each embedded document produces a separate |
|
All content from embedded documents is concatenated into a single content field.
Results are returned as a single |
|
Parses like |
|
Skip parsing entirely. Useful for pipelines that only need to fetch and emit raw bytes. |
|
Extract raw bytes from embedded documents. See Extracting Embedded Bytes. |
CONCATENATE Mode
CONCATENATE merges all content from embedded documents into a single content field
while preserving all metadata from parsing:
{
"parseContext": {
"parseMode": "CONCATENATE"
}
}
The result is a single Metadata object containing the concatenated content in
X-TIKA:content along with all other metadata fields (title, author, content type, etc.).
CONTENT_ONLY Mode
CONTENT_ONLY is designed for use cases where you want just the extracted content
written to storage — no JSON wrapping, no metadata overhead. This is particularly
useful for:
-
Extracting markdown files from a document corpus
-
Building plain text search indexes
-
Generating HTML versions of documents
{
"parseContext": {
"parseMode": "CONTENT_ONLY"
}
}
How It Works
-
Documents are parsed identically to
CONCATENATEmode — all embedded content is merged into a single content field. -
A metadata filter automatically strips all metadata except
X-TIKA:contentandX-TIKA:CONTAINER_EXCEPTION(for error tracking). -
When the emitter is a
StreamEmitter(such as the filesystem or S3 emitter), the raw content string is written directly as bytes — no JSON serialization.
Metadata Filtering
By default, CONTENT_ONLY mode applies an IncludeFieldMetadataFilter that retains
only X-TIKA:content and X-TIKA:CONTAINER_EXCEPTION. If you set your own
MetadataFilter on the ParseContext, your filter takes priority.