Class PipesForkParser

java.lang.Object
org.apache.tika.pipes.fork.PipesForkParser
All Implemented Interfaces:
Closeable, AutoCloseable

public class PipesForkParser extends Object implements Closeable
A ForkParser implementation backed by PipesParser.

This class is intended to replace the legacy org.apache.tika.fork.ForkParser. The legacy ForkParser streamed SAX events between processes, which was complex and error-prone. This implementation uses the modern pipes infrastructure and returns parsed content in the metadata (via TikaCoreProperties.TIKA_CONTENT).

This parser runs parsing in forked JVM processes, providing isolation from crashes, memory leaks, and other issues that can occur during parsing. Multiple forked processes can be used for concurrent parsing.

Getting Started: This class is designed as a simple entry point to help users get started with forked parsing using files on the local filesystem. Under the hood, it uses a FileSystemFetcher to read files. For more advanced use cases, the Tika Pipes infrastructure supports many other sources and destinations through plugins:

  • Fetchers (read from): S3, Azure Blob, Google Cloud Storage, HTTP, Microsoft Graph, and more
  • Emitters (write to): OpenSearch, Solr, S3, filesystem, and more
  • Pipes Iterators (batch processing): JDBC, CSV, filesystem crawling, and more
See the tika-pipes module and its submodules for available plugins. For production batch processing, consider using AsyncProcessor or the tika-pipes-cli directly with a JSON configuration file.

Thread Safety: This class is thread-safe. Multiple threads can call parse(java.nio.file.Path) concurrently, and requests will be distributed across the pool of forked processes.

Error Handling:

  • Application errors (initialization failures, config errors) throw PipesForkParserException
  • Process crashes (OOM, timeout) are returned in the result - the next parse will automatically restart the forked process
  • Per-document errors (fetch/parse exceptions) are returned in the result

Example usage:

 PipesForkParserConfig config = new PipesForkParserConfig();
 config.setHandlerType(HANDLER_TYPE.TEXT);
 config.setParseMode(ParseMode.RMETA);

 try (PipesForkParser parser = new PipesForkParser(config)) {
     // Parse a file by Path
     Path file = Paths.get("/path/to/file.pdf");
     PipesForkResult result = parser.parse(file);
     for (Metadata m : result.getMetadataList()) {
         String content = m.get(TikaCoreProperties.TIKA_CONTENT);
         // process content and metadata
     }

     // Or parse from an InputStream (will be spooled to temp file)
     try (TikaInputStream tis = TikaInputStream.get(inputStream)) {
         result = parser.parse(tis);
         // ...
     }
 }
 
See Also: