Class PipesForkParserExample

java.lang.Object
org.apache.tika.example.PipesForkParserExample

public class PipesForkParserExample extends Object
Examples of how to use the PipesForkParser to parse documents in a forked JVM process.

The PipesForkParser provides isolation from crashes, memory leaks, and other issues that can occur during parsing of untrusted or malformed documents. If parsing fails catastrophically (OOM, infinite loop, etc.), only the forked process is affected - your main application continues running.

Key features:

  • Process isolation - crashes don't affect your main JVM
  • Automatic process restart after crashes
  • Configurable timeouts to prevent infinite loops
  • Memory isolation - each forked process has its own heap
  • Thread-safe - can be shared across multiple threads

IMPORTANT - Resource Management:

  • Always close both the PipesForkParser and TikaInputStream using try-with-resources or explicit close() calls
  • TikaInputStream may create temporary files when parsing from streams - these are only cleaned up when the stream is closed
  • PipesForkParser manages forked JVM processes - closing it terminates these processes and cleans up the temporary config file

Performance Tip: Tika is significantly more efficient on some file types (especially those requiring random access like ZIP, OLE2/Office, PDF) when you have a file on disk and use TikaInputStream.get(Path) instead of TikaInputStream.get(Files.newInputStream(path)). The latter will cause TikaInputStream to spool the entire stream to a temporary file before parsing, which adds overhead. If you already have a file, always use the Path-based method.

  • Constructor Details

    • PipesForkParserExample

      public PipesForkParserExample()
  • Method Details

    • parseFileBasic

      public String parseFileBasic(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
      Basic example of parsing a file using PipesForkParser with default settings.

      This is the simplest way to use PipesForkParser. It uses default configuration which includes:

      • Single forked process
      • TEXT output (plain text extraction)
      • RMETA mode (separate metadata for container and each embedded document)

      Note: This example uses result.getContent() which only returns the container document's content. For files with embedded documents (ZIP, email, Office docs with attachments), embedded content is NOT included. See parseEmbeddedDocumentsRmeta(Path) for the proper way to access all content including embedded documents.

      Parameters:
      filePath - the path to the file to parse
      Returns:
      the container document's extracted text content (embedded content not included)
      Throws:
      IOException - if an I/O error occurs
      InterruptedException - if parsing is interrupted
      TikaException - if a Tika error occurs
      PipesException - if a pipes infrastructure error occurs
      See Also:
    • parseFileAllContent

      public String parseFileAllContent(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
      Example of parsing a file and getting ALL content (container + embedded documents).

      This is the recommended approach when using RMETA mode (the default) if you need all content from a document that may contain embedded files.

      This method iterates over all metadata objects and concatenates their content, giving you content from the container AND all embedded documents.

      Parameters:
      filePath - the path to the file to parse
      Returns:
      all extracted text content (container + all embedded documents)
      Throws:
      IOException - if an I/O error occurs
      InterruptedException - if parsing is interrupted
      TikaException - if a Tika error occurs
      PipesException - if a pipes infrastructure error occurs
    • parseInputStream

      public String parseInputStream(InputStream inputStream) throws IOException, InterruptedException, TikaException, PipesException
      Example of parsing from an InputStream.

      When parsing from an InputStream (as opposed to a file), TikaInputStream will automatically spool the stream to a temporary file. This is necessary because the forked process needs file system access.

      Performance Note: If you already have a file on disk, use parseFileBasic(Path) with TikaInputStream.get(Path) instead. This avoids the overhead of spooling the stream to a temporary file. For file types that require random access (ZIP, OLE2/Office documents, PDF), the performance difference can be significant.

      The temporary file is automatically cleaned up when the TikaInputStream is closed. Always close the TikaInputStream to ensure temp files are deleted.

      Parameters:
      inputStream - the input stream to parse
      Returns:
      the extracted text content
      Throws:
      IOException - if an I/O error occurs
      InterruptedException - if parsing is interrupted
      TikaException - if a Tika error occurs
      PipesException - if a pipes infrastructure error occurs
    • parseWithCustomConfig

      public String parseWithCustomConfig(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
      Example of parsing with custom configuration.

      This example shows how to configure:

      • HTML output instead of plain text
      • Parse timeout of 60 seconds
      • JVM memory settings for the forked process
      • Maximum files before process restart (to prevent memory leaks)
      Parameters:
      filePath - the path to the file to parse
      Returns:
      the extracted HTML content
      Throws:
      IOException - if an I/O error occurs
      InterruptedException - if parsing is interrupted
      TikaException - if a Tika error occurs
      PipesException - if a pipes infrastructure error occurs
    • parseWithMetadata

      public void parseWithMetadata(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
      Example of parsing with metadata extraction.

      This example demonstrates how to access both content and metadata from the parse result.

      Parameters:
      filePath - the path to the file to parse
      Throws:
      IOException - if an I/O error occurs
      InterruptedException - if parsing is interrupted
      TikaException - if a Tika error occurs
      PipesException - if a pipes infrastructure error occurs
    • parseEmbeddedDocumentsRmeta

      public void parseEmbeddedDocumentsRmeta(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
      Example of parsing documents with embedded files using RMETA mode.

      Both RMETA and CONCATENATE modes parse embedded content. The key differences are:

      RMETA mode (recommended for most use cases):

      • Returns separate metadata objects for the container and each embedded document
      • Preserves per-document metadata (author, title, dates, etc.) for each embedded file
      • Exceptions from embedded documents are captured in each document's metadata (via TikaCoreProperties.EMBEDDED_EXCEPTION) - they are NOT silently swallowed
      • You can see which embedded document caused a problem

      CONCATENATE mode (legacy behavior):

      • Returns a single metadata object with all content concatenated together
      • Embedded document metadata is lost (only container metadata is preserved)
      • Exceptions from embedded documents may be silently swallowed
      • Simpler output but less visibility into what happened
      Parameters:
      filePath - the path to the file to parse
      Throws:
      IOException - if an I/O error occurs
      InterruptedException - if parsing is interrupted
      TikaException - if a Tika error occurs
      PipesException - if a pipes infrastructure error occurs
      See Also:
    • parseEmbeddedDocumentsConcatenate

      public void parseEmbeddedDocumentsConcatenate(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
      Example of parsing documents with embedded files using CONCATENATE mode (legacy).

      Both RMETA and CONCATENATE modes parse embedded content. However, CONCATENATE mode provides less visibility into the parsing process:

      • All content from container and embedded documents is concatenated into one string
      • Only a single metadata object is returned (container metadata only)
      • Per-embedded-document metadata is lost
      • Exceptions from embedded documents may be silently swallowed

      Recommendation: Use RMETA mode (parseEmbeddedDocumentsRmeta(Path)) unless you specifically need the legacy concatenation behavior. RMETA gives you visibility into embedded document exceptions and preserves metadata for each document.

      Parameters:
      filePath - the path to the file to parse
      Throws:
      IOException - if an I/O error occurs
      InterruptedException - if parsing is interrupted
      TikaException - if a Tika error occurs
      PipesException - if a pipes infrastructure error occurs
    • parseWithErrorHandling

      public String parseWithErrorHandling(Path filePath)
      Example of proper error handling with PipesForkParser.

      There are three categories of results to handle:

      1. Success - Parsing completed successfully
      2. Process crash - The forked JVM crashed (OOM, timeout, etc.). The parser will automatically restart for the next parse.
      3. Application error - Configuration or infrastructure error. These throw PipesForkParserException.
      Parameters:
      filePath - the path to the file to parse
      Returns:
      the extracted content, or error message if parsing failed
    • parseManyFiles

      public void parseManyFiles(List<Path> filePaths) throws IOException, InterruptedException, TikaException, PipesException
      Example of reusing PipesForkParser for multiple documents.

      PipesForkParser is designed to be reused. Creating a new parser for each document is inefficient because it requires starting a new forked JVM process.

      This example shows the recommended pattern: create the parser once and reuse it for multiple documents.

      Parameters:
      filePaths - the files to parse
      Throws:
      IOException - if an I/O error occurs
      InterruptedException - if parsing is interrupted
      TikaException - if a Tika error occurs
      PipesException - if a pipes infrastructure error occurs
    • parseWithContentTypeHint

      public String parseWithContentTypeHint(Path filePath, String contentType) throws IOException, InterruptedException, TikaException, PipesException
      Example of providing initial metadata hints.

      You can provide metadata hints to the parser, such as the content type if you already know it. This can improve parsing accuracy or performance.

      Parameters:
      filePath - the path to the file to parse
      contentType - the known content type
      Returns:
      the extracted content
      Throws:
      IOException - if an I/O error occurs
      InterruptedException - if parsing is interrupted
      TikaException - if a Tika error occurs
      PipesException - if a pipes infrastructure error occurs
    • main

      public static void main(String[] args) throws Exception
      Throws:
      Exception