org.apache.tika.example.PipesForkParserExample

public class PipesForkParserExample extends Object

Examples of how to use the PipesForkParser to parse documents in a forked JVM process.

The PipesForkParser provides isolation from crashes, memory leaks, and other issues that can occur during parsing of untrusted or malformed documents. If parsing fails catastrophically (OOM, infinite loop, etc.), only the forked process is affected - your main application continues running.

Key features:

Process isolation - crashes don't affect your main JVM
Automatic process restart after crashes
Configurable timeouts to prevent infinite loops
Memory isolation - each forked process has its own heap
Thread-safe - can be shared across multiple threads

IMPORTANT - Resource Management:

Always close both the PipesForkParser and TikaInputStream using try-with-resources or explicit close() calls
TikaInputStream may create temporary files when parsing from streams - these are only cleaned up when the stream is closed
PipesForkParser manages forked JVM processes - closing it terminates these processes and cleans up the temporary config file

Performance Tip: Tika is significantly more efficient on some file types (especially those requiring random access like ZIP, OLE2/Office, PDF) when you have a file on disk and use TikaInputStream.get(Path) instead of TikaInputStream.get(Files.newInputStream(path)). The latter will cause TikaInputStream to spool the entire stream to a temporary file before parsing, which adds overhead. If you already have a file, always use the Path-based method.

Constructor Summary

Constructors

Constructor

Description

PipesForkParserExample()
Method Summary

Modifier and Type

Method

Description

static void

main(String[] args)

void

parseEmbeddedDocumentsConcatenate(Path filePath)

Example of parsing documents with embedded files using CONCATENATE mode (legacy).

void

parseEmbeddedDocumentsRmeta(Path filePath)

Example of parsing documents with embedded files using RMETA mode.

String

parseFileAllContent(Path filePath)

Example of parsing a file and getting ALL content (container + embedded documents).

String

parseFileBasic(Path filePath)

Basic example of parsing a file using PipesForkParser with default settings.

String

parseInputStream(InputStream inputStream)

Example of parsing from an InputStream.

void

parseManyFiles(List<Path> filePaths)

Example of reusing PipesForkParser for multiple documents.

String

parseWithContentTypeHint(Path filePath, String contentType)

Example of providing initial metadata hints.

String

parseWithCustomConfig(Path filePath)

Example of parsing with custom configuration.

String

parseWithErrorHandling(Path filePath)

Example of proper error handling with PipesForkParser.

void

parseWithMetadata(Path filePath)

Example of parsing with metadata extraction.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- PipesForkParserExample
  
  public PipesForkParserExample()
Method Details
- parseFileBasic
  
  public String parseFileBasic(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
  Basic example of parsing a file using PipesForkParser with default settings.
  This is the simplest way to use PipesForkParser. It uses default configuration which includes:
  
  Single forked process
  
  TEXT output (plain text extraction)
  
  RMETA mode (separate metadata for container and each embedded document)
  
  Note: This example uses result.getContent() which only returns the container document's content. For files with embedded documents (ZIP, email, Office docs with attachments), embedded content is NOT included. See parseEmbeddedDocumentsRmeta(Path) for the proper way to access all content including embedded documents.
  Parameters:
  
  filePath - the path to the file to parse
  
  Returns:
  
  the container document's extracted text content (embedded content not included)
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if parsing is interrupted
  
  TikaException - if a Tika error occurs
  
  PipesException - if a pipes infrastructure error occurs
  
  See Also:
  
  for accessing all content including embedded documents
- parseFileAllContent
  
  public String parseFileAllContent(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
  
  Example of parsing a file and getting ALL content (container + embedded documents).
  This is the recommended approach when using RMETA mode (the default) if you need all content from a document that may contain embedded files.
  This method iterates over all metadata objects and concatenates their content, giving you content from the container AND all embedded documents.
  
  Parameters:
  
  filePath - the path to the file to parse
  
  Returns:
  
  all extracted text content (container + all embedded documents)
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if parsing is interrupted
  
  TikaException - if a Tika error occurs
  
  PipesException - if a pipes infrastructure error occurs
- parseInputStream
  
  public String parseInputStream(InputStream inputStream) throws IOException, InterruptedException, TikaException, PipesException
  
  Example of parsing from an InputStream.
  When parsing from an InputStream (as opposed to a file), TikaInputStream will automatically spool the stream to a temporary file. This is necessary because the forked process needs file system access.
  Performance Note: If you already have a file on disk, use parseFileBasic(Path) with TikaInputStream.get(Path) instead. This avoids the overhead of spooling the stream to a temporary file. For file types that require random access (ZIP, OLE2/Office documents, PDF), the performance difference can be significant.
  The temporary file is automatically cleaned up when the TikaInputStream is closed. Always close the TikaInputStream to ensure temp files are deleted.
  
  Parameters:
  
  inputStream - the input stream to parse
  
  Returns:
  
  the extracted text content
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if parsing is interrupted
  
  TikaException - if a Tika error occurs
  
  PipesException - if a pipes infrastructure error occurs
- parseWithCustomConfig
  
  public String parseWithCustomConfig(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
  Example of parsing with custom configuration.
  This example shows how to configure:
  
  HTML output instead of plain text
  
  Parse timeout of 60 seconds
  
  JVM memory settings for the forked process
  
  Maximum files before process restart (to prevent memory leaks)
  Parameters:
  
  filePath - the path to the file to parse
  
  Returns:
  
  the extracted HTML content
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if parsing is interrupted
  
  TikaException - if a Tika error occurs
  
  PipesException - if a pipes infrastructure error occurs
- parseWithMetadata
  
  public void parseWithMetadata(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
  
  Example of parsing with metadata extraction.
  This example demonstrates how to access both content and metadata from the parse result.
  
  Parameters:
  
  filePath - the path to the file to parse
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if parsing is interrupted
  
  TikaException - if a Tika error occurs
  
  PipesException - if a pipes infrastructure error occurs
- parseEmbeddedDocumentsRmeta
  
  public void parseEmbeddedDocumentsRmeta(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
  Example of parsing documents with embedded files using RMETA mode.
  Both RMETA and CONCATENATE modes parse embedded content. The key differences are:
  RMETA mode (recommended for most use cases):
  
  Returns separate metadata objects for the container and each embedded document
  
  Preserves per-document metadata (author, title, dates, etc.) for each embedded file
  
  Exceptions from embedded documents are captured in each document's metadata (via TikaCoreProperties.EMBEDDED_EXCEPTION) - they are NOT silently swallowed
  
  You can see which embedded document caused a problem
  
  CONCATENATE mode (legacy behavior):
  
  Returns a single metadata object with all content concatenated together
  
  Embedded document metadata is lost (only container metadata is preserved)
  
  Exceptions from embedded documents may be silently swallowed
  
  Simpler output but less visibility into what happened
  Parameters:
  
  filePath - the path to the file to parse
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if parsing is interrupted
  
  TikaException - if a Tika error occurs
  
  PipesException - if a pipes infrastructure error occurs
  
  See Also:
  
  for the legacy CONCATENATE mode example
- parseEmbeddedDocumentsConcatenate
  
  public void parseEmbeddedDocumentsConcatenate(Path filePath) throws IOException, InterruptedException, TikaException, PipesException
  Example of parsing documents with embedded files using CONCATENATE mode (legacy).
  Both RMETA and CONCATENATE modes parse embedded content. However, CONCATENATE mode provides less visibility into the parsing process:
  
  All content from container and embedded documents is concatenated into one string
  
  Only a single metadata object is returned (container metadata only)
  
  Per-embedded-document metadata is lost
  
  Exceptions from embedded documents may be silently swallowed
  
  Recommendation: Use RMETA mode (parseEmbeddedDocumentsRmeta(Path)) unless you specifically need the legacy concatenation behavior. RMETA gives you visibility into embedded document exceptions and preserves metadata for each document.
  Parameters:
  
  filePath - the path to the file to parse
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if parsing is interrupted
  
  TikaException - if a Tika error occurs
  
  PipesException - if a pipes infrastructure error occurs
- parseWithErrorHandling
  
  public String parseWithErrorHandling(Path filePath)
  Example of proper error handling with PipesForkParser.
  There are three categories of results to handle:
  
  Success - Parsing completed successfully
  
  Process crash - The forked JVM crashed (OOM, timeout, etc.). The parser will automatically restart for the next parse.
  
  Application error - Configuration or infrastructure error. These throw PipesForkParserException.
  Parameters:
  
  filePath - the path to the file to parse
  
  Returns:
  
  the extracted content, or error message if parsing failed
- parseManyFiles
  
  public void parseManyFiles(List<Path> filePaths) throws IOException, InterruptedException, TikaException, PipesException
  
  Example of reusing PipesForkParser for multiple documents.
  PipesForkParser is designed to be reused. Creating a new parser for each document is inefficient because it requires starting a new forked JVM process.
  This example shows the recommended pattern: create the parser once and reuse it for multiple documents.
  
  Parameters:
  
  filePaths - the files to parse
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if parsing is interrupted
  
  TikaException - if a Tika error occurs
  
  PipesException - if a pipes infrastructure error occurs
- parseWithContentTypeHint
  
  public String parseWithContentTypeHint(Path filePath, String contentType) throws IOException, InterruptedException, TikaException, PipesException
  
  Example of providing initial metadata hints.
  You can provide metadata hints to the parser, such as the content type if you already know it. This can improve parsing accuracy or performance.
  
  Parameters:
  
  filePath - the path to the file to parse
  
  contentType - the known content type
  
  Returns:
  
  the extracted content
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if parsing is interrupted
  
  TikaException - if a Tika error occurs
  
  PipesException - if a pipes infrastructure error occurs
- main
  
  public static void main(String[] args) throws Exception
  
  Throws:
  
  Exception

Class PipesForkParserExample

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

PipesForkParserExample

Method Details

parseFileBasic

parseFileAllContent

parseInputStream

parseWithCustomConfig

parseWithMetadata

parseEmbeddedDocumentsRmeta

parseEmbeddedDocumentsConcatenate

parseWithErrorHandling

parseManyFiles

parseWithContentTypeHint

main