Class FrictionlessUnpackHandler

java.lang.Object
org.apache.tika.pipes.core.extractor.AbstractUnpackHandler
org.apache.tika.pipes.core.extractor.FrictionlessUnpackHandler
All Implemented Interfaces:
Closeable, AutoCloseable, UnpackHandler

public class FrictionlessUnpackHandler extends AbstractUnpackHandler implements Closeable
An UnpackHandler that collects embedded files for Frictionless Data Package output. Files are stored in a temporary directory under an "unpacked/" subdirectory. SHA256 hashes are computed during the add() operation using DigestInputStream. After parsing completes, buildDataPackage() creates the manifest. Output structure:
 temp-dir/
 └── unpacked/
     ├── 00000001.pdf
     ├── 00000002.png
     └── ...
 
  • Constructor Details

    • FrictionlessUnpackHandler

      public FrictionlessUnpackHandler(EmitKey containerEmitKey, UnpackConfig unpackConfig) throws IOException
      Creates a new FrictionlessUnpackHandler.
      Parameters:
      containerEmitKey - the emit key for the container document
      unpackConfig - the unpack configuration
      Throws:
      IOException - if temp directory creation fails
  • Method Details

    • add

      public void add(int id, Metadata metadata, InputStream inputStream) throws IOException
      Specified by:
      add in interface UnpackHandler
      Overrides:
      add in class AbstractUnpackHandler
      Throws:
      IOException
    • storeOriginalDocument

      public void storeOriginalDocument(InputStream inputStream, String fileName) throws IOException
      Stores the original container document for optional inclusion.
      Parameters:
      inputStream - the original document input stream
      fileName - the file name for the original document
      Throws:
      IOException - if storing fails
    • buildDataPackage

      public DataPackage buildDataPackage(String containerName)
      Builds the DataPackage manifest from collected files.
      Parameters:
      containerName - the name of the container document
      Returns:
      the built DataPackage
    • getTempDirectory

      public Path getTempDirectory()
      Returns the temporary directory where files are stored.
    • getUnpackedDirectory

      public Path getUnpackedDirectory()
      Returns the unpacked subdirectory where embedded files are stored.
    • getEmbeddedFiles

      Returns information about all embedded files.
    • hasEmbeddedFiles

      public boolean hasEmbeddedFiles()
      Returns true if there are any embedded files.
    • getOriginalDocumentPath

      public Path getOriginalDocumentPath()
      Returns the path to the original document if stored.
    • getOriginalDocumentName

      public String getOriginalDocumentName()
      Returns the name of the original document if stored.
    • hasOriginalDocument

      public boolean hasOriginalDocument()
      Returns true if the original document was stored.
    • getUnpackConfig

      public UnpackConfig getUnpackConfig()
      Returns the UnpackConfig used by this handler.
    • getContainerEmitKey

      public EmitKey getContainerEmitKey()
      Returns the container emit key.
    • close

      public void close() throws IOException
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Throws:
      IOException