Class UnpackerResource

java.lang.Object
org.apache.tika.server.core.resource.UnpackerResource

@Path("/unpack") public class UnpackerResource extends Object
JAX-RS resource for unpacking embedded documents from container files.

This endpoint uses process-isolated parsing via tika-pipes with ParseMode.UNPACK. Embedded documents are extracted and returned as a zip archive.

Endpoints:

  • PUT /unpack - Extract embedded documents (raw body)
  • POST /unpack - Extract with config (multipart: file + optional JSON config)
  • PUT /unpack/all - Extract embedded + container text/metadata
  • POST /unpack/all - Extract all with config (multipart)

Configuration Requirements:

Your tika-config.json must include:

 {
   "fetchers": {
     "file-system-fetcher": {
       "class": "org.apache.tika.pipes.fetcher.fs.FileSystemFetcher",
       "allowAbsolutePaths": true
     }
   },
   "emitters": {
     "unpack-emitter": {
       "class": "org.apache.tika.pipes.emitter.fs.FileSystemEmitter",
       "basePath": "/tmp/tika-unpack",
       "onExists": "replace"
     }
   }
 }
 

Multipart Configuration (POST endpoints):

Submit as multipart/form-data with:

  • "file" part: the document to unpack
  • "config" part (optional): JSON configuration

Example config JSON:

 {
   "parse-context": {
     "unpack-config": {
       "suffixStrategy": "DETECTED",
       "includeOriginal": true
     },
     "standard-unpack-selector": {
       "includeMimeTypes": ["image/jpeg", "image/png"],
       "excludeMimeTypes": ["application/pdf"]
     },
     "embedded-limits": {
       "maxDepth": 5,
       "maxCount": 100
     }
   }
 }
 

Frictionless Data Package Format:

To receive output in Frictionless Data Package format (with datapackage.json manifest, SHA256 hashes, and files in unpacked/ subdirectory), use:

 {
   "parse-context": {
     "unpack-config": {
       "outputFormat": "FRICTIONLESS",
       "outputMode": "ZIPPED",
       "includeFullMetadata": true
     }
   }
 }
 

The Frictionless zip structure:

 output.zip
 ├── datapackage.json      # Manifest with file list, SHA256 hashes, mimetypes
 ├── metadata.json         # Full RMETA metadata (if includeFullMetadata=true)
 └── unpacked/
     ├── 00000001.pdf
     ├── 00000002.png
     └── ...
 

Breaking Changes from Pre-4.0:

  • Parsing now runs in a separate process for memory safety
  • Configuration via HTTP headers is no longer supported; use multipart JSON config
  • Custom EmbeddedDocumentExtractor in ParseContext is ignored; use UnpackSelector
  • The unpackMaxBytes header is removed; use embedded-limits in config
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    jakarta.ws.rs.core.Response
    unpack(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders, jakarta.ws.rs.core.UriInfo info)
    Extracts embedded documents from a container file (simple PUT, no config).
    jakarta.ws.rs.core.Response
    unpackAll(InputStream is, jakarta.ws.rs.core.HttpHeaders httpHeaders, jakarta.ws.rs.core.UriInfo info)
    Extracts embedded documents plus original document and metadata (simple PUT).
    jakarta.ws.rs.core.Response
    unpackAllWithConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders, jakarta.ws.rs.core.UriInfo info)
    Extracts embedded documents plus original/metadata with config (multipart POST).
    jakarta.ws.rs.core.Response
    unpackWithConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, jakarta.ws.rs.core.HttpHeaders httpHeaders, jakarta.ws.rs.core.UriInfo info)
    Extracts embedded documents with configuration (multipart POST).

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • UnpackerResource

      public UnpackerResource()
  • Method Details

    • unpack

      @Path("/{id:(/.*)?}") @PUT @Produces("application/zip") public jakarta.ws.rs.core.Response unpack(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @Context jakarta.ws.rs.core.UriInfo info) throws Exception
      Extracts embedded documents from a container file (simple PUT, no config). Returns a zip archive containing the extracted files.
      Parameters:
      is - input stream containing the document
      httpHeaders - HTTP headers
      info - URI info
      Returns:
      streaming zip response
      Throws:
      Exception
    • unpackWithConfig

      @Path("/{id:(/.*)?}") @POST @Consumes("multipart/form-data") @Produces("application/zip") public jakarta.ws.rs.core.Response unpackWithConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @Context jakarta.ws.rs.core.UriInfo info) throws Exception
      Extracts embedded documents with configuration (multipart POST). Accepts multipart/form-data with "file" and optional "config" parts.
      Parameters:
      attachments - multipart attachments
      httpHeaders - HTTP headers
      info - URI info
      Returns:
      streaming zip response
      Throws:
      Exception
    • unpackAll

      @Path("/all{id:(/.*)?}") @PUT @Produces("application/zip") public jakarta.ws.rs.core.Response unpackAll(InputStream is, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @Context jakarta.ws.rs.core.UriInfo info) throws Exception
      Extracts embedded documents plus original document and metadata (simple PUT). Returns a zip archive containing extracted files, original document, and metadata.
      Parameters:
      is - input stream containing the document
      httpHeaders - HTTP headers
      info - URI info
      Returns:
      streaming zip response
      Throws:
      Exception
    • unpackAllWithConfig

      @Path("/all{id:(/.*)?}") @POST @Consumes("multipart/form-data") @Produces("application/zip") public jakarta.ws.rs.core.Response unpackAllWithConfig(List<org.apache.cxf.jaxrs.ext.multipart.Attachment> attachments, @Context jakarta.ws.rs.core.HttpHeaders httpHeaders, @Context jakarta.ws.rs.core.UriInfo info) throws Exception
      Extracts embedded documents plus original/metadata with config (multipart POST). Accepts multipart/form-data with "file" and optional "config" parts.
      Parameters:
      attachments - multipart attachments
      httpHeaders - HTTP headers
      info - URI info
      Returns:
      streaming zip response
      Throws:
      Exception