org.apache.tika.pipes.fork.PipesForkParser

All Implemented Interfaces:: Closeable, AutoCloseable

public class PipesForkParser extends Object implements Closeable

A ForkParser implementation backed by PipesParser.

This class is intended to replace the legacy org.apache.tika.fork.ForkParser. The legacy ForkParser streamed SAX events between processes, which was complex and error-prone. This implementation uses the modern pipes infrastructure and returns parsed content in the metadata (via TikaCoreProperties.TIKA_CONTENT).

This parser runs parsing in forked JVM processes, providing isolation from crashes, memory leaks, and other issues that can occur during parsing. Multiple forked processes can be used for concurrent parsing.

Getting Started: This class is designed as a simple entry point to help users get started with forked parsing using files on the local filesystem. Under the hood, it uses a FileSystemFetcher to read files. For more advanced use cases, the Tika Pipes infrastructure supports many other sources and destinations through plugins:

Fetchers (read from): S3, Azure Blob, Google Cloud Storage, HTTP, Microsoft Graph, and more
Emitters (write to): OpenSearch, Solr, S3, filesystem, and more
Pipes Iterators (batch processing): JDBC, CSV, filesystem crawling, and more

See the tika-pipes module and its submodules for available plugins. For production batch processing, consider using AsyncProcessor or the tika-pipes-cli directly with a JSON configuration file.

Thread Safety: This class is thread-safe. Multiple threads can call parse(java.nio.file.Path) concurrently, and requests will be distributed across the pool of forked processes.

Error Handling:

Application errors (initialization failures, config errors) throw PipesForkParserException
Process crashes (OOM, timeout) are returned in the result - the next parse will automatically restart the forked process
Per-document errors (fetch/parse exceptions) are returned in the result

Example usage:

 PipesForkParserConfig config = new PipesForkParserConfig();
 config.setHandlerType(HANDLER_TYPE.TEXT);
 config.setParseMode(ParseMode.RMETA);

 try (PipesForkParser parser = new PipesForkParser(config)) {
     // Parse a file by Path
     Path file = Paths.get("/path/to/file.pdf");
     PipesForkResult result = parser.parse(file);
     for (Metadata m : result.getMetadataList()) {
         String content = m.get(TikaCoreProperties.TIKA_CONTENT);
         // process content and metadata
     }

     // Or parse from an InputStream (will be spooled to temp file)
     try (TikaInputStream tis = TikaInputStream.get(inputStream)) {
         result = parser.parse(tis);
         // ...
     }
 }

See Also:

for batch processing

Field Summary

Fields

Modifier and Type

Field

Description

static final String

DEFAULT_FETCHER_NAME
Constructor Summary

Constructors

Constructor

Description

PipesForkParser()

Creates a new PipesForkParser with default configuration.

PipesForkParser(PipesForkParserConfig config)

Creates a new PipesForkParser with the specified configuration.
Method Summary

Modifier and Type

Method

Description

void

close()

PipesForkResult

parse(Path path)

Parse a file in a forked JVM process.

PipesForkResult

parse(Path path, Metadata metadata)

Parse a file in a forked JVM process with the specified metadata.

PipesForkResult

parse(Path path, Metadata metadata, ParseContext parseContext)

Parse a file in a forked JVM process with the specified metadata and parse context.

PipesForkResult

parse(TikaInputStream tis)

Parse a file in a forked JVM process.

PipesForkResult

parse(TikaInputStream tis, Metadata metadata)

Parse a file in a forked JVM process with the specified metadata.

PipesForkResult

parse(TikaInputStream tis, Metadata metadata, ParseContext parseContext)

Parse a file in a forked JVM process with the specified metadata and parse context.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_FETCHER_NAME
  
  public static final String DEFAULT_FETCHER_NAME
  See Also:
  
  Constant Field Values
Constructor Details
- PipesForkParser
  
  public PipesForkParser() throws IOException, TikaConfigException
  
  Creates a new PipesForkParser with default configuration.
  
  Throws:
  
  IOException - if the temporary config file cannot be created
  
  TikaConfigException - if configuration is invalid
- PipesForkParser
  
  public PipesForkParser(PipesForkParserConfig config) throws IOException, TikaConfigException
  
  Creates a new PipesForkParser with the specified configuration.
  
  Parameters:
  
  config - the configuration for this parser
  
  Throws:
  
  IOException - if the temporary config file cannot be created
  
  TikaConfigException - if configuration is invalid
Method Details
- parse
  
  public PipesForkResult parse(Path path) throws IOException, InterruptedException, PipesException, TikaException
  
  Parse a file in a forked JVM process.
  
  Parameters:
  
  path - the path to the file to parse
  
  Returns:
  
  the parse result containing metadata and content
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if the parsing is interrupted
  
  PipesException - if a pipes infrastructure error occurs
  
  PipesForkParserException - if an application error occurs (initialization failure or configuration error)
  
  TikaException
- parse
  
  public PipesForkResult parse(Path path, Metadata metadata) throws IOException, InterruptedException, PipesException, TikaException
  
  Parse a file in a forked JVM process with the specified metadata.
  
  Parameters:
  
  path - the path to the file to parse
  
  metadata - initial metadata (e.g., content type hint)
  
  Returns:
  
  the parse result containing metadata and content
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if the parsing is interrupted
  
  PipesException - if a pipes infrastructure error occurs
  
  PipesForkParserException - if an application error occurs (initialization failure or configuration error)
  
  TikaException
- parse
  
  public PipesForkResult parse(Path path, Metadata metadata, ParseContext parseContext) throws IOException, InterruptedException, PipesException, TikaException
  
  Parse a file in a forked JVM process with the specified metadata and parse context.
  
  Parameters:
  
  path - the path to the file to parse
  
  metadata - initial metadata (e.g., content type hint)
  
  parseContext - the parse context
  
  Returns:
  
  the parse result containing metadata and content
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if the parsing is interrupted
  
  PipesException - if a pipes infrastructure error occurs
  
  PipesForkParserException - if an application error occurs (initialization failure or configuration error)
  
  TikaException
- parse
  
  public PipesForkResult parse(TikaInputStream tis) throws IOException, InterruptedException, PipesException, TikaException
  
  Parse a file in a forked JVM process.
  
  Parameters:
  
  tis - the TikaInputStream to parse. If the stream doesn't have an underlying file, it will be spooled to a temporary file. The caller must keep the TikaInputStream open until this method returns.
  
  Returns:
  
  the parse result containing metadata and content
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if the parsing is interrupted
  
  PipesException - if a pipes infrastructure error occurs
  
  PipesForkParserException - if an application error occurs (initialization failure or configuration error)
  
  TikaException
- parse
  
  public PipesForkResult parse(TikaInputStream tis, Metadata metadata) throws IOException, InterruptedException, PipesException, TikaException
  
  Parse a file in a forked JVM process with the specified metadata.
  
  Parameters:
  
  tis - the TikaInputStream to parse. If the stream doesn't have an underlying file, it will be spooled to a temporary file. The caller must keep the TikaInputStream open until this method returns.
  
  metadata - initial metadata (e.g., content type hint)
  
  Returns:
  
  the parse result containing metadata and content
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if the parsing is interrupted
  
  PipesException - if a pipes infrastructure error occurs
  
  PipesForkParserException - if an application error occurs (initialization failure or configuration error)
  
  TikaException
- parse
  
  public PipesForkResult parse(TikaInputStream tis, Metadata metadata, ParseContext parseContext) throws IOException, InterruptedException, PipesException, TikaException
  
  Parse a file in a forked JVM process with the specified metadata and parse context.
  
  Parameters:
  
  tis - the TikaInputStream to parse. If the stream doesn't have an underlying file, it will be spooled to a temporary file. The caller must keep the TikaInputStream open until this method returns.
  
  metadata - initial metadata (e.g., content type hint)
  
  parseContext - the parse context
  
  Returns:
  
  the parse result containing metadata and content
  
  Throws:
  
  IOException - if an I/O error occurs
  
  InterruptedException - if the parsing is interrupted
  
  PipesException - if a pipes infrastructure error occurs
  
  PipesForkParserException - if an application error occurs (initialization failure or configuration error)
  
  TikaException
- close
  
  public void close() throws IOException
  
  Specified by:
  
  close in interface AutoCloseable
  
  Specified by:
  
  close in interface Closeable
  
  Throws:
  
  IOException

Class PipesForkParser

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

DEFAULT_FETCHER_NAME

Constructor Details

PipesForkParser

PipesForkParser

Method Details

parse

parse

parse

parse

parse

parse

close