UnpackConfig: Extracting Embedded Document Bytes
When processing container files (ZIP, DOCX, PDF with attachments, etc.), you may want to
extract the raw bytes of embedded documents in addition to parsing them. UnpackConfig
controls how embedded bytes are extracted and emitted.
Quick Start
Use ParseMode.UNPACK to automatically extract embedded document bytes:
{
"id": "doc1",
"fetchKey": {"fetcherId": "fsf", "fetchKey": "container.docx"},
"emitKey": {"emitterId": "fse", "emitKey": "container.docx"},
"parseContext": {
"parseMode": "UNPACK"
}
}
This extracts both metadata (like RMETA mode) and embedded document bytes.
Configuration Options
| Property | Type | Default | Description |
|---|---|---|---|
|
String |
(from FetchEmitTuple) |
Emitter name for embedded bytes. Falls back to the FetchEmitTuple’s emitterId. |
|
long |
10GB |
Maximum total bytes to extract per file. Set to |
|
boolean |
|
Include the container document itself in the output. |
|
boolean |
|
Collect all embedded files into a single ZIP archive. |
|
boolean |
|
Include |
|
int |
|
Zero-pad embedded IDs in output names (e.g., |
|
NONE, EXISTING, DETECTED |
|
How to determine file extensions for extracted files. |
|
String |
|
Prefix between base name and embedded ID (e.g., |
|
DEFAULT, CUSTOM |
|
Strategy for generating emit keys. |
|
String |
|
Custom base path when |
Examples
Basic Byte Extraction
Extract embedded bytes with default naming:
{
"parseContext": {
"parseMode": "UNPACK"
}
}
ZIP Output with Metadata
Collect all embedded files into a ZIP with metadata:
{
"parseContext": {
"parseMode": "UNPACK",
"unpack-config": {
"zipEmbeddedFiles": true,
"includeMetadataInZip": true,
"includeOriginal": true
}
}
}
Suffix Strategies
NONE-
No file extension added to extracted files.
EXISTING-
Use the file extension from the embedded document’s resource name.
DETECTED-
Use the file extension based on the detected MIME type.
Key Base Strategies
DEFAULT-
Output key is
{containerKey}-{embeddedIdPrefix}{id}{suffix} CUSTOM-
Output key uses
emitKeyBaseas the prefix.
Safety Limits
The maxUnpackBytes setting protects against zip bombs and other malicious files that
expand to enormous sizes. The default 10GB limit should be appropriate for most use cases.
When the limit is reached:
-
Extraction stops for the current file
-
An exception is logged
-
Parsing continues (already-extracted bytes are kept)
-
The parse result status is
PARSE_SUCCESS_WITH_EXCEPTION
Set maxUnpackBytes=-1 to disable the limit. This is not recommended for untrusted input.
Frictionless Data Package Output
The UNPACK mode can output files in Frictionless Data Package format,
a standard for packaging data files with their metadata. This format includes a datapackage.json
manifest with file checksums and MIME types, making it easy to verify and process extracted files.
Enabling Frictionless Output
Set outputFormat to FRICTIONLESS in your UnpackConfig:
{
"parseContext": {
"parseMode": "UNPACK",
"unpack-config": {
"outputFormat": "FRICTIONLESS",
"includeFullMetadata": true
}
}
}
Output Structure
When using Frictionless output format, the ZIP archive contains:
output.zip
├── datapackage.json # Manifest with file list, SHA256 hashes, mimetypes
├── metadata.json # Full RMETA metadata (if includeFullMetadata=true)
└── unpacked/
├── 00000001.pdf
├── 00000002.png
└── ...
The datapackage.json file contains:
-
List of all extracted files as "resources"
-
SHA256 hash for each file
-
MIME type for each file
-
File size in bytes
Frictionless Configuration Options
| Property | Type | Default | Description |
|---|---|---|---|
|
STANDARD, FRICTIONLESS |
|
Output format for the ZIP archive. Use |
|
boolean |
|
Include a |
Code Examples
For working code examples, see:
-
tika-pipes/tika-pipes-integration-tests/src/test/java/org/apache/tika/pipes/core/UnpackModeTest.java -
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/TikaPipesTest.java
These test files demonstrate all configuration options with assertions.