Class PipesForkParserExample
PipesForkParser to parse documents
in a forked JVM process.
The PipesForkParser provides isolation from crashes, memory leaks, and other issues that can occur during parsing of untrusted or malformed documents. If parsing fails catastrophically (OOM, infinite loop, etc.), only the forked process is affected - your main application continues running.
Key features:
- Process isolation - crashes don't affect your main JVM
- Automatic process restart after crashes
- Configurable timeouts to prevent infinite loops
- Memory isolation - each forked process has its own heap
- Thread-safe - can be shared across multiple threads
IMPORTANT - Resource Management:
- Always close both the
PipesForkParserandTikaInputStreamusing try-with-resources or explicit close() calls - TikaInputStream may create temporary files when parsing from streams - these are only cleaned up when the stream is closed
- PipesForkParser manages forked JVM processes - closing it terminates these processes and cleans up the temporary config file
Performance Tip: Tika is significantly more efficient on some file types
(especially those requiring random access like ZIP, OLE2/Office, PDF) when you have
a file on disk and use TikaInputStream.get(Path) instead of
TikaInputStream.get(Files.newInputStream(path)). The latter will cause
TikaInputStream to spool the entire stream to a temporary file before parsing,
which adds overhead. If you already have a file, always use the Path-based method.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidvoidparseEmbeddedDocumentsConcatenate(Path filePath) Example of parsing documents with embedded files using CONCATENATE mode (legacy).voidparseEmbeddedDocumentsRmeta(Path filePath) Example of parsing documents with embedded files using RMETA mode.parseFileAllContent(Path filePath) Example of parsing a file and getting ALL content (container + embedded documents).parseFileBasic(Path filePath) Basic example of parsing a file using PipesForkParser with default settings.parseInputStream(InputStream inputStream) Example of parsing from an InputStream.voidparseManyFiles(List<Path> filePaths) Example of reusing PipesForkParser for multiple documents.parseWithContentTypeHint(Path filePath, String contentType) Example of providing initial metadata hints.parseWithCustomConfig(Path filePath) Example of parsing with custom configuration.parseWithErrorHandling(Path filePath) Example of proper error handling with PipesForkParser.voidparseWithMetadata(Path filePath) Example of parsing with metadata extraction.
-
Constructor Details
-
PipesForkParserExample
public PipesForkParserExample()
-
-
Method Details
-
parseFileBasic
public String parseFileBasic(Path filePath) throws IOException, InterruptedException, TikaException, PipesException Basic example of parsing a file using PipesForkParser with default settings.This is the simplest way to use PipesForkParser. It uses default configuration which includes:
- Single forked process
- TEXT output (plain text extraction)
- RMETA mode (separate metadata for container and each embedded document)
Note: This example uses
result.getContent()which only returns the container document's content. For files with embedded documents (ZIP, email, Office docs with attachments), embedded content is NOT included. SeeparseEmbeddedDocumentsRmeta(Path)for the proper way to access all content including embedded documents.- Parameters:
filePath- the path to the file to parse- Returns:
- the container document's extracted text content (embedded content not included)
- Throws:
IOException- if an I/O error occursInterruptedException- if parsing is interruptedTikaException- if a Tika error occursPipesException- if a pipes infrastructure error occurs- See Also:
-
parseFileAllContent
public String parseFileAllContent(Path filePath) throws IOException, InterruptedException, TikaException, PipesException Example of parsing a file and getting ALL content (container + embedded documents).This is the recommended approach when using RMETA mode (the default) if you need all content from a document that may contain embedded files.
This method iterates over all metadata objects and concatenates their content, giving you content from the container AND all embedded documents.
- Parameters:
filePath- the path to the file to parse- Returns:
- all extracted text content (container + all embedded documents)
- Throws:
IOException- if an I/O error occursInterruptedException- if parsing is interruptedTikaException- if a Tika error occursPipesException- if a pipes infrastructure error occurs
-
parseInputStream
public String parseInputStream(InputStream inputStream) throws IOException, InterruptedException, TikaException, PipesException Example of parsing from an InputStream.When parsing from an InputStream (as opposed to a file), TikaInputStream will automatically spool the stream to a temporary file. This is necessary because the forked process needs file system access.
Performance Note: If you already have a file on disk, use
parseFileBasic(Path)withTikaInputStream.get(Path)instead. This avoids the overhead of spooling the stream to a temporary file. For file types that require random access (ZIP, OLE2/Office documents, PDF), the performance difference can be significant.The temporary file is automatically cleaned up when the TikaInputStream is closed. Always close the TikaInputStream to ensure temp files are deleted.
- Parameters:
inputStream- the input stream to parse- Returns:
- the extracted text content
- Throws:
IOException- if an I/O error occursInterruptedException- if parsing is interruptedTikaException- if a Tika error occursPipesException- if a pipes infrastructure error occurs
-
parseWithCustomConfig
public String parseWithCustomConfig(Path filePath) throws IOException, InterruptedException, TikaException, PipesException Example of parsing with custom configuration.This example shows how to configure:
- HTML output instead of plain text
- Parse timeout of 60 seconds
- JVM memory settings for the forked process
- Maximum files before process restart (to prevent memory leaks)
- Parameters:
filePath- the path to the file to parse- Returns:
- the extracted HTML content
- Throws:
IOException- if an I/O error occursInterruptedException- if parsing is interruptedTikaException- if a Tika error occursPipesException- if a pipes infrastructure error occurs
-
parseWithMetadata
public void parseWithMetadata(Path filePath) throws IOException, InterruptedException, TikaException, PipesException Example of parsing with metadata extraction.This example demonstrates how to access both content and metadata from the parse result.
- Parameters:
filePath- the path to the file to parse- Throws:
IOException- if an I/O error occursInterruptedException- if parsing is interruptedTikaException- if a Tika error occursPipesException- if a pipes infrastructure error occurs
-
parseEmbeddedDocumentsRmeta
public void parseEmbeddedDocumentsRmeta(Path filePath) throws IOException, InterruptedException, TikaException, PipesException Example of parsing documents with embedded files using RMETA mode.Both RMETA and CONCATENATE modes parse embedded content. The key differences are:
RMETA mode (recommended for most use cases):
- Returns separate metadata objects for the container and each embedded document
- Preserves per-document metadata (author, title, dates, etc.) for each embedded file
- Exceptions from embedded documents are captured in each document's metadata
(via
TikaCoreProperties.EMBEDDED_EXCEPTION) - they are NOT silently swallowed - You can see which embedded document caused a problem
CONCATENATE mode (legacy behavior):
- Returns a single metadata object with all content concatenated together
- Embedded document metadata is lost (only container metadata is preserved)
- Exceptions from embedded documents may be silently swallowed
- Simpler output but less visibility into what happened
- Parameters:
filePath- the path to the file to parse- Throws:
IOException- if an I/O error occursInterruptedException- if parsing is interruptedTikaException- if a Tika error occursPipesException- if a pipes infrastructure error occurs- See Also:
-
parseEmbeddedDocumentsConcatenate
public void parseEmbeddedDocumentsConcatenate(Path filePath) throws IOException, InterruptedException, TikaException, PipesException Example of parsing documents with embedded files using CONCATENATE mode (legacy).Both RMETA and CONCATENATE modes parse embedded content. However, CONCATENATE mode provides less visibility into the parsing process:
- All content from container and embedded documents is concatenated into one string
- Only a single metadata object is returned (container metadata only)
- Per-embedded-document metadata is lost
- Exceptions from embedded documents may be silently swallowed
Recommendation: Use RMETA mode (
parseEmbeddedDocumentsRmeta(Path)) unless you specifically need the legacy concatenation behavior. RMETA gives you visibility into embedded document exceptions and preserves metadata for each document.- Parameters:
filePath- the path to the file to parse- Throws:
IOException- if an I/O error occursInterruptedException- if parsing is interruptedTikaException- if a Tika error occursPipesException- if a pipes infrastructure error occurs
-
parseWithErrorHandling
Example of proper error handling with PipesForkParser.There are three categories of results to handle:
- Success - Parsing completed successfully
- Process crash - The forked JVM crashed (OOM, timeout, etc.). The parser will automatically restart for the next parse.
- Application error - Configuration or infrastructure error.
These throw
PipesForkParserException.
- Parameters:
filePath- the path to the file to parse- Returns:
- the extracted content, or error message if parsing failed
-
parseManyFiles
public void parseManyFiles(List<Path> filePaths) throws IOException, InterruptedException, TikaException, PipesException Example of reusing PipesForkParser for multiple documents.PipesForkParser is designed to be reused. Creating a new parser for each document is inefficient because it requires starting a new forked JVM process.
This example shows the recommended pattern: create the parser once and reuse it for multiple documents.
- Parameters:
filePaths- the files to parse- Throws:
IOException- if an I/O error occursInterruptedException- if parsing is interruptedTikaException- if a Tika error occursPipesException- if a pipes infrastructure error occurs
-
parseWithContentTypeHint
public String parseWithContentTypeHint(Path filePath, String contentType) throws IOException, InterruptedException, TikaException, PipesException Example of providing initial metadata hints.You can provide metadata hints to the parser, such as the content type if you already know it. This can improve parsing accuracy or performance.
- Parameters:
filePath- the path to the file to parsecontentType- the known content type- Returns:
- the extracted content
- Throws:
IOException- if an I/O error occursInterruptedException- if parsing is interruptedTikaException- if a Tika error occursPipesException- if a pipes infrastructure error occurs
-
main
- Throws:
Exception
-