unpack-config: Extracting Embedded Document Bytes
When processing container files (ZIP, DOCX, PDF with attachments, etc.), you may want to
extract the raw bytes of embedded documents in addition to parsing them. The
unpack-config component (Java: UnpackConfig) controls how embedded bytes are
extracted and emitted.
Quick Start
To turn on byte extraction for every document the pipeline processes, set
parseMode to UNPACK in the pipes section of your tika-config.json.
That’s the minimum configuration — extraction defaults are fine for most cases.
{
"pipes": {
"parseMode": "UNPACK"
}
}
To tune extraction (size limits, naming, ZIP output, etc.), add an unpack-config
block under the top-level parse-context section. All the options listed below
live inside that block:
{
"pipes": {
"parseMode": "UNPACK"
},
"parse-context": {
"unpack-config": {
"maxUnpackBytes": 104857600,
"zipEmbeddedFiles": true
}
}
}
This extracts both metadata (like RMETA mode) and embedded document bytes.
|
You can also set |
Configuration Options
All options below are fields of the unpack-config block — nest them inside
parse-context.unpack-config as shown in the Quick Start.
| Property | Type | Default | Description |
|---|---|---|---|
|
String |
(from FetchEmitTuple) |
Emitter name for embedded bytes. Falls back to the FetchEmitTuple’s emitterId. |
|
long |
10GB |
Maximum total bytes to extract per file. Set to |
|
boolean |
|
Include the container document itself in the output. |
|
boolean |
|
Collect all embedded files into a single ZIP archive. |
|
boolean |
|
Include |
|
int |
|
Zero-pad embedded IDs in output names (e.g., |
|
NONE, EXISTING, DETECTED |
|
How to determine file extensions for extracted files. |
|
String |
|
Prefix between base name and embedded ID (e.g., |
|
DEFAULT, CUSTOM |
|
Strategy for generating emit keys. |
|
String |
|
Custom base path when |
Examples
Basic Byte Extraction
Extract embedded bytes with default naming:
{
"pipes": {
"parseMode": "UNPACK"
}
}
ZIP Output with Metadata
Collect all embedded files into a ZIP with metadata:
{
"pipes": {
"parseMode": "UNPACK"
},
"parse-context": {
"unpack-config": {
"zipEmbeddedFiles": true,
"includeMetadataInZip": true,
"includeOriginal": true
}
}
}
Suffix Strategies
NONE-
No file extension added to extracted files.
EXISTING-
Use the file extension from the embedded document’s resource name.
DETECTED-
Use the file extension based on the detected MIME type.
Key Base Strategies
DEFAULT-
Output key is
{containerKey}-{embeddedIdPrefix}{id}{suffix} CUSTOM-
Output key uses
emitKeyBaseas the prefix.
Safety Limits
The maxUnpackBytes setting protects against zip bombs and other malicious files that
expand to enormous sizes. The default 10GB limit should be appropriate for most use cases.
When the limit is reached:
-
Extraction stops for the current file
-
An exception is logged
-
Parsing continues (already-extracted bytes are kept)
-
The parse result status is
PARSE_SUCCESS_WITH_EXCEPTION
Set maxUnpackBytes=-1 to disable the limit. This is not recommended for untrusted input.
Frictionless Data Package Output
The UNPACK mode can output files in Frictionless Data Package format,
a standard for packaging data files with their metadata. This format includes a datapackage.json
manifest with file checksums and MIME types, making it easy to verify and process extracted files.
Enabling Frictionless Output
Set outputFormat to FRICTIONLESS in your unpack-config:
{
"pipes": {
"parseMode": "UNPACK"
},
"parse-context": {
"unpack-config": {
"outputFormat": "FRICTIONLESS",
"includeFullMetadata": true
}
}
}
Output Structure
When using Frictionless output format, the ZIP archive contains:
output.zip
├── datapackage.json # Manifest with file list, SHA256 hashes, mimetypes
├── metadata.json # Full RMETA metadata (if includeFullMetadata=true)
└── unpacked/
├── 00000001.pdf
├── 00000002.png
└── ...
The datapackage.json file contains:
-
List of all extracted files as "resources"
-
SHA256 hash for each file
-
MIME type for each file
-
File size in bytes
Frictionless Configuration Options
| Property | Type | Default | Description |
|---|---|---|---|
|
STANDARD, FRICTIONLESS |
|
Output format for the ZIP archive. Use |
|
boolean |
|
Include a |
CLI Usage
Extract files in Frictionless format using the CLI. The -Z flag turns on recursive
unpack (the Pipes-mode counterpart of standard-mode -z), and -i/-o are the
Pipes input/output directories:
java -jar tika-app.jar -Z --unpack-format=FRICTIONLESS -i /path/to/input -o /path/to/output
-i expects a directory of containers to unpack, not a single file. For
one-off unpacking of a single document, see the standard-mode -z/--extract
flag — though as of 4.x that path also routes through the Pipes machinery and
expects an input directory.
|
Code Examples
For working code examples, see:
-
tika-pipes/tika-pipes-integration-tests/src/test/java/org/apache/tika/pipes/core/UnpackModeTest.java -
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/TikaPipesTest.java
These test files demonstrate all configuration options with assertions.