Class PipesParsingHelper

java.lang.Object
org.apache.tika.server.core.resource.PipesParsingHelper

public class PipesParsingHelper extends Object
Helper class for pipes-based parsing in tika-server endpoints. Handles temp file management, FetchEmitTuple creation, and result processing.

The helper manages a dedicated temp directory for input files. A file-system-fetcher is configured with basePath pointing to this directory, ensuring child processes can only access files within the designated temp directory (no absolute paths).

  • Field Details

    • DEFAULT_FETCHER_ID

      public static final String DEFAULT_FETCHER_ID
      The fetcher ID used for reading temp files. This fetcher is configured with basePath = inputTempDirectory.
      See Also:
    • UNPACK_EMITTER_ID

      public static final String UNPACK_EMITTER_ID
      Name of the file-system emitter used for UNPACK mode. This emitter must be configured in tika-config.json with a basePath pointing to a writable temp directory.
      See Also:
  • Constructor Details

    • PipesParsingHelper

      public PipesParsingHelper(PipesParser pipesParser, PipesConfig pipesConfig, Path inputTempDirectory, Path unpackEmitterBasePath)
      Creates a PipesParsingHelper.
      Parameters:
      pipesParser - the PipesParser instance
      pipesConfig - the PipesConfig instance
      inputTempDirectory - the temp directory for input files. The file-system-fetcher is configured with basePath = this directory.
      unpackEmitterBasePath - the basePath where the unpack-emitter writes files. This is where the server will find the zip files created by UNPACK mode. May be null if UNPACK mode won't be used.
  • Method Details

    • getInputTempDirectory

      public Path getInputTempDirectory()
      Gets the input temp directory path.
      Returns:
      the input temp directory
    • parse

      public List<Metadata> parse(TikaInputStream tis, Metadata metadata, ParseContext parseContext, ParseMode parseMode) throws IOException
      Parses content using pipes-based parsing with process isolation.

      This method spools the input to the dedicated temp directory and uses a relative filename in the FetchKey. The file-system-fetcher is configured with basePath pointing to this directory, so the child process can only access files there.

      The caller is responsible for closing the TikaInputStream.

      Parameters:
      tis - the TikaInputStream containing the content to parse
      metadata - metadata to pass to the parser (may include filename, content-type, etc.)
      parseContext - parse context with handler configuration
      parseMode - the parse mode (RMETA or CONCATENATE)
      Returns:
      list of metadata objects from parsing
      Throws:
      IOException - if temp file operations fail
      TikaServerParseException - if parsing fails
    • mapStatusToHttpResponse

      public static jakarta.ws.rs.core.Response.Status mapStatusToHttpResponse(PipesResult.RESULT_STATUS status)
      Maps PipesResult status to HTTP response status.
    • getPipesParser

      public PipesParser getPipesParser()
      Gets the PipesParser instance.
    • getPipesConfig

      public PipesConfig getPipesConfig()
      Gets the PipesConfig instance.
    • parseUnpack

      public PipesParsingHelper.UnpackResult parseUnpack(TikaInputStream tis, Metadata metadata, ParseContext parseContext, boolean saveAll) throws IOException
      Parses content using UNPACK mode and returns a path to the zip file containing extracted embedded documents.

      This method: 1. Spools input to the dedicated temp directory 2. Configures UnpackConfig with zipEmbeddedFiles=true 3. The pipes child process extracts embedded files and creates a zip 4. The zip is emitted to the configured file-system emitter 5. Returns the path to the zip file for streaming

      The caller is responsible for deleting the zip file after streaming.

      Parameters:
      tis - the TikaInputStream containing the content to parse
      metadata - metadata to pass to the parser
      parseContext - parse context (may contain UnpackConfig, UnpackSelector, EmbeddedLimits)
      saveAll - if true, includes container text and metadata in the zip
      Returns:
      UnpackResult containing path to zip file and metadata list
      Throws:
      IOException - if parsing or file operations fail