Class UnpackerResource
java.lang.Object
org.apache.tika.server.core.resource.UnpackerResource
JAX-RS resource for unpacking embedded documents from container files.
This endpoint uses process-isolated parsing via tika-pipes with ParseMode.UNPACK. Embedded documents are extracted and returned as a zip archive.
Endpoints:
- PUT /unpack - Extract embedded documents (raw body)
- POST /unpack - Extract with config (multipart: file + optional JSON config)
- PUT /unpack/all - Extract embedded + container text/metadata
- POST /unpack/all - Extract all with config (multipart)
Configuration Requirements:
Your tika-config.json must include:
{
"fetchers": {
"file-system-fetcher": {
"class": "org.apache.tika.pipes.fetcher.fs.FileSystemFetcher",
"allowAbsolutePaths": true
}
},
"emitters": {
"unpack-emitter": {
"class": "org.apache.tika.pipes.emitter.fs.FileSystemEmitter",
"basePath": "/tmp/tika-unpack",
"onExists": "replace"
}
}
}
Multipart Configuration (POST endpoints):
Submit as multipart/form-data with:
- "file" part: the document to unpack
- "config" part (optional): JSON configuration
Example config JSON:
{
"parse-context": {
"unpack-config": {
"suffixStrategy": "DETECTED",
"includeOriginal": true
},
"standard-unpack-selector": {
"includeMimeTypes": ["image/jpeg", "image/png"],
"excludeMimeTypes": ["application/pdf"]
},
"embedded-limits": {
"maxDepth": 5,
"maxCount": 100
}
}
}
Frictionless Data Package Format:
To receive output in Frictionless Data Package format (with datapackage.json manifest, SHA256 hashes, and files in unpacked/ subdirectory), use:
{
"parse-context": {
"unpack-config": {
"outputFormat": "FRICTIONLESS",
"outputMode": "ZIPPED",
"includeFullMetadata": true
}
}
}
The Frictionless zip structure:
output.zip
├── datapackage.json # Manifest with file list, SHA256 hashes, mimetypes
├── metadata.json # Full RMETA metadata (if includeFullMetadata=true)
└── unpacked/
├── 00000001.pdf
├── 00000002.png
└── ...
Breaking Changes from Pre-4.0:
- Parsing now runs in a separate process for memory safety
- Configuration via HTTP headers is no longer supported; use multipart JSON config
- Custom EmbeddedDocumentExtractor in ParseContext is ignored; use UnpackSelector
- The unpackMaxBytes header is removed; use embedded-limits in config
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionjakarta.ws.rs.core.Responseunpack(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders, jakarta.ws.rs.core.UriInfo info) Extracts embedded documents from a container file (simple PUT, no config).jakarta.ws.rs.core.ResponseunpackAll(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders, jakarta.ws.rs.core.UriInfo info) Extracts embedded documents plus original document and metadata (simple PUT).jakarta.ws.rs.core.ResponseunpackAllWithConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders, jakarta.ws.rs.core.UriInfo info) Extracts embedded documents plus original/metadata with config (multipart POST).jakarta.ws.rs.core.ResponseunpackWithConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders, jakarta.ws.rs.core.UriInfo info) Extracts embedded documents with configuration (multipart POST).
-
Constructor Details
-
UnpackerResource
public UnpackerResource()
-
-
Method Details
-
unpack
@Path("/{id:(/.*)?}") @PUT @Produces("application/zip") public jakarta.ws.rs.core.Response unpack(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @Context jakarta.ws.rs.core.UriInfo info) throws Exception Extracts embedded documents from a container file (simple PUT, no config). Returns a zip archive containing the extracted files.- Parameters:
is- input stream containing the documenthttpHeaders- HTTP headersinfo- URI info- Returns:
- streaming zip response
- Throws:
Exception
-
unpackWithConfig
@Path("/{id:(/.*)?}") @POST @Consumes("multipart/form-data") @Produces("application/zip") public jakarta.ws.rs.core.Response unpackWithConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @Context jakarta.ws.rs.core.UriInfo info) throws Exception Extracts embedded documents with configuration (multipart POST). Accepts multipart/form-data with "file" and optional "config" parts.- Parameters:
attachments- multipart attachmentshttpHeaders- HTTP headersinfo- URI info- Returns:
- streaming zip response
- Throws:
Exception
-
unpackAll
@Path("/all{id:(/.*)?}") @PUT @Produces("application/zip") public jakarta.ws.rs.core.Response unpackAll(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @Context jakarta.ws.rs.core.UriInfo info) throws Exception Extracts embedded documents plus original document and metadata (simple PUT). Returns a zip archive containing extracted files, original document, and metadata.- Parameters:
is- input stream containing the documenthttpHeaders- HTTP headersinfo- URI info- Returns:
- streaming zip response
- Throws:
Exception
-
unpackAllWithConfig
@Path("/all{id:(/.*)?}") @POST @Consumes("multipart/form-data") @Produces("application/zip") public jakarta.ws.rs.core.Response unpackAllWithConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @Context jakarta.ws.rs.core.UriInfo info) throws Exception Extracts embedded documents plus original/metadata with config (multipart POST). Accepts multipart/form-data with "file" and optional "config" parts.- Parameters:
attachments- multipart attachmentshttpHeaders- HTTP headersinfo- URI info- Returns:
- streaming zip response
- Throws:
Exception
-