Class PipesForkParser
- All Implemented Interfaces:
Closeable,AutoCloseable
PipesParser.
This class is intended to replace the legacy
org.apache.tika.fork.ForkParser. The legacy ForkParser streamed
SAX events between processes, which was complex and error-prone. This implementation
uses the modern pipes infrastructure and returns parsed content in the metadata
(via TikaCoreProperties.TIKA_CONTENT).
This parser runs parsing in forked JVM processes, providing isolation from crashes, memory leaks, and other issues that can occur during parsing. Multiple forked processes can be used for concurrent parsing.
Getting Started: This class is designed as a simple entry point
to help users get started with forked parsing using files on the local filesystem.
Under the hood, it uses a FileSystemFetcher to read files. For more advanced
use cases, the Tika Pipes infrastructure supports many other sources and destinations
through plugins:
- Fetchers (read from): S3, Azure Blob, Google Cloud Storage, HTTP, Microsoft Graph, and more
- Emitters (write to): OpenSearch, Solr, S3, filesystem, and more
- Pipes Iterators (batch processing): JDBC, CSV, filesystem crawling, and more
tika-pipes module and its submodules for available plugins. For
production batch processing, consider using AsyncProcessor or the
tika-pipes-cli directly with a JSON configuration file.
Thread Safety: This class is thread-safe. Multiple threads can
call parse(java.nio.file.Path) concurrently, and requests will be distributed across the
pool of forked processes.
Error Handling:
- Application errors (initialization failures, config errors) throw
PipesForkParserException - Process crashes (OOM, timeout) are returned in the result - the next parse will automatically restart the forked process
- Per-document errors (fetch/parse exceptions) are returned in the result
Example usage:
PipesForkParserConfig config = new PipesForkParserConfig();
config.setHandlerType(HANDLER_TYPE.TEXT);
config.setParseMode(ParseMode.RMETA);
try (PipesForkParser parser = new PipesForkParser(config)) {
// Parse a file by Path
Path file = Paths.get("/path/to/file.pdf");
PipesForkResult result = parser.parse(file);
for (Metadata m : result.getMetadataList()) {
String content = m.get(TikaCoreProperties.TIKA_CONTENT);
// process content and metadata
}
// Or parse from an InputStream (will be spooled to temp file)
try (TikaInputStream tis = TikaInputStream.get(inputStream)) {
result = parser.parse(tis);
// ...
}
}
- See Also:
-
Field Summary
Fields -
Constructor Summary
ConstructorsConstructorDescriptionCreates a new PipesForkParser with default configuration.PipesForkParser(PipesForkParserConfig config) Creates a new PipesForkParser with the specified configuration. -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()Parse a file in a forked JVM process.Parse a file in a forked JVM process with the specified metadata.parse(Path path, Metadata metadata, ParseContext parseContext) Parse a file in a forked JVM process with the specified metadata and parse context.parse(TikaInputStream tis) Parse a file in a forked JVM process.parse(TikaInputStream tis, Metadata metadata) Parse a file in a forked JVM process with the specified metadata.parse(TikaInputStream tis, Metadata metadata, ParseContext parseContext) Parse a file in a forked JVM process with the specified metadata and parse context.
-
Field Details
-
DEFAULT_FETCHER_NAME
- See Also:
-
-
Constructor Details
-
PipesForkParser
Creates a new PipesForkParser with default configuration.- Throws:
IOException- if the temporary config file cannot be createdTikaConfigException- if configuration is invalid
-
PipesForkParser
Creates a new PipesForkParser with the specified configuration.- Parameters:
config- the configuration for this parser- Throws:
IOException- if the temporary config file cannot be createdTikaConfigException- if configuration is invalid
-
-
Method Details
-
parse
public PipesForkResult parse(Path path) throws IOException, InterruptedException, PipesException, TikaException Parse a file in a forked JVM process.- Parameters:
path- the path to the file to parse- Returns:
- the parse result containing metadata and content
- Throws:
IOException- if an I/O error occursInterruptedException- if the parsing is interruptedPipesException- if a pipes infrastructure error occursPipesForkParserException- if an application error occurs (initialization failure or configuration error)TikaException
-
parse
public PipesForkResult parse(Path path, Metadata metadata) throws IOException, InterruptedException, PipesException, TikaException Parse a file in a forked JVM process with the specified metadata.- Parameters:
path- the path to the file to parsemetadata- initial metadata (e.g., content type hint)- Returns:
- the parse result containing metadata and content
- Throws:
IOException- if an I/O error occursInterruptedException- if the parsing is interruptedPipesException- if a pipes infrastructure error occursPipesForkParserException- if an application error occurs (initialization failure or configuration error)TikaException
-
parse
public PipesForkResult parse(Path path, Metadata metadata, ParseContext parseContext) throws IOException, InterruptedException, PipesException, TikaException Parse a file in a forked JVM process with the specified metadata and parse context.- Parameters:
path- the path to the file to parsemetadata- initial metadata (e.g., content type hint)parseContext- the parse context- Returns:
- the parse result containing metadata and content
- Throws:
IOException- if an I/O error occursInterruptedException- if the parsing is interruptedPipesException- if a pipes infrastructure error occursPipesForkParserException- if an application error occurs (initialization failure or configuration error)TikaException
-
parse
public PipesForkResult parse(TikaInputStream tis) throws IOException, InterruptedException, PipesException, TikaException Parse a file in a forked JVM process.- Parameters:
tis- the TikaInputStream to parse. If the stream doesn't have an underlying file, it will be spooled to a temporary file. The caller must keep the TikaInputStream open until this method returns.- Returns:
- the parse result containing metadata and content
- Throws:
IOException- if an I/O error occursInterruptedException- if the parsing is interruptedPipesException- if a pipes infrastructure error occursPipesForkParserException- if an application error occurs (initialization failure or configuration error)TikaException
-
parse
public PipesForkResult parse(TikaInputStream tis, Metadata metadata) throws IOException, InterruptedException, PipesException, TikaException Parse a file in a forked JVM process with the specified metadata.- Parameters:
tis- the TikaInputStream to parse. If the stream doesn't have an underlying file, it will be spooled to a temporary file. The caller must keep the TikaInputStream open until this method returns.metadata- initial metadata (e.g., content type hint)- Returns:
- the parse result containing metadata and content
- Throws:
IOException- if an I/O error occursInterruptedException- if the parsing is interruptedPipesException- if a pipes infrastructure error occursPipesForkParserException- if an application error occurs (initialization failure or configuration error)TikaException
-
parse
public PipesForkResult parse(TikaInputStream tis, Metadata metadata, ParseContext parseContext) throws IOException, InterruptedException, PipesException, TikaException Parse a file in a forked JVM process with the specified metadata and parse context.- Parameters:
tis- the TikaInputStream to parse. If the stream doesn't have an underlying file, it will be spooled to a temporary file. The caller must keep the TikaInputStream open until this method returns.metadata- initial metadata (e.g., content type hint)parseContext- the parse context- Returns:
- the parse result containing metadata and content
- Throws:
IOException- if an I/O error occursInterruptedException- if the parsing is interruptedPipesException- if a pipes infrastructure error occursPipesForkParserException- if an application error occurs (initialization failure or configuration error)TikaException
-
close
- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Throws:
IOException
-