Class PipesForkParserConfig
PipesForkParser.
This provides a simplified configuration API that abstracts away the complexity of the pipes infrastructure.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionAdd a JVM argument for the forked process.Get the content handler factory that specifies how content should be handled.Get the embedded limits configuration.Get the fetcher name used for file system fetching.intGet the number of forked JVM processes configured.Get the parse mode.Get the underlying PipesConfig for advanced configuration.Get the plugins directory.Get the timeout limits.Get the user-provided configuration file path.setContentHandlerFactory(ContentHandlerFactory contentHandlerFactory) Set the content handler factory.setEmbeddedLimits(EmbeddedLimits embeddedLimits) Set the embedded limits configuration.setFetcherName(String fetcherName) Set the fetcher name.Set the handler type (TEXT, HTML, XML, etc.).setJavaPath(String javaPath) Set the Java executable path.setJvmArgs(List<String> jvmArgs) Set the JVM arguments for the forked process.setMaxEmbeddedCount(int maxEmbeddedCount) Set the maximum number of embedded resources to process.setMaxFilesPerProcess(int maxFiles) Set the maximum number of files to process before restarting the forked process.setNumClients(int numClients) EXPERT: Set the number of forked JVM processes (clients) to use for parsing.setParseMode(ParseMode parseMode) Set the parse mode (RMETA for recursive metadata, CONCATENATE for single document).setPluginsDir(Path pluginsDir) Set the plugins directory where plugin zips are located.setStartupTimeoutMillis(long startupTimeoutMillis) Set the startup timeout in milliseconds.setTimeoutLimits(TimeoutLimits timeoutLimits) Set the timeout limits for parsing operations.setUserConfigPath(Path userConfigPath) Set a user-provided configuration file path.setWriteLimit(int writeLimit) Set the write limit for content extraction.
-
Constructor Details
-
PipesForkParserConfig
public PipesForkParserConfig()
-
-
Method Details
-
getPipesConfig
Get the underlying PipesConfig for advanced configuration.- Returns:
- the pipes configuration
-
getContentHandlerFactory
Get the content handler factory that specifies how content should be handled.- Returns:
- the content handler factory
-
setContentHandlerFactory
Set the content handler factory.- Parameters:
contentHandlerFactory- the content handler factory- Returns:
- this config for chaining
-
getParseMode
Get the parse mode.- Returns:
- the parse mode
-
setHandlerType
Set the handler type (TEXT, HTML, XML, etc.).- Parameters:
type- the handler type- Returns:
- this config for chaining
-
setParseMode
Set the parse mode (RMETA for recursive metadata, CONCATENATE for single document).- Parameters:
parseMode- the parse mode- Returns:
- this config for chaining
-
setWriteLimit
Set the write limit for content extraction.- Parameters:
writeLimit- the maximum characters to extract (-1 for unlimited)- Returns:
- this config for chaining
-
setMaxEmbeddedCount
Set the maximum number of embedded resources to process. This sets the maxCount on EmbeddedLimits which will be applied to ParseContext.- Parameters:
maxEmbeddedCount- the maximum embedded count (-1 for unlimited)- Returns:
- this config for chaining
-
getEmbeddedLimits
Get the embedded limits configuration.- Returns:
- the embedded limits, or null if not set
-
setEmbeddedLimits
Set the embedded limits configuration.- Parameters:
embeddedLimits- the embedded limits- Returns:
- this config for chaining
-
getFetcherName
Get the fetcher name used for file system fetching.- Returns:
- the fetcher name
-
setFetcherName
Set the fetcher name.- Parameters:
fetcherName- the fetcher name- Returns:
- this config for chaining
-
setTimeoutLimits
Set the timeout limits for parsing operations.The progress timeout bounds the time between progress updates (catches hung parsers). The total task timeout bounds overall wall-clock time.
- Parameters:
timeoutLimits- the timeout limits- Returns:
- this config for chaining
-
getTimeoutLimits
Get the timeout limits.- Returns:
- the timeout limits, or null if not set (defaults will be used)
-
setJvmArgs
Set the JVM arguments for the forked process.- Parameters:
jvmArgs- the JVM arguments (e.g., "-Xmx512m")- Returns:
- this config for chaining
-
addJvmArg
Add a JVM argument for the forked process.- Parameters:
arg- the JVM argument to add- Returns:
- this config for chaining
-
setJavaPath
Set the Java executable path.- Parameters:
javaPath- path to the java executable- Returns:
- this config for chaining
-
setMaxFilesPerProcess
Set the maximum number of files to process before restarting the forked process. This helps prevent memory leaks from accumulating.- Parameters:
maxFiles- the maximum files per process (-1 for unlimited)- Returns:
- this config for chaining
-
setNumClients
EXPERT: Set the number of forked JVM processes (clients) to use for parsing.This enables concurrent parsing across multiple forked processes. Each client is an independent JVM that can parse documents in parallel. When multiple threads call
PipesForkParser.parse(java.nio.file.Path), requests are distributed across the pool of forked processes.When to use: Set this higher than 1 when you need to parse many documents concurrently and have sufficient CPU cores and memory. Each forked process consumes memory independently (based on your JVM args like -Xmx).
Default: 1 (single forked process, suitable for simple sequential use)
- Parameters:
numClients- the number of forked JVM processes (must be >= 1)- Returns:
- this config for chaining
- Throws:
IllegalArgumentException- if numClients is less than 1
-
getNumClients
public int getNumClients()Get the number of forked JVM processes configured.- Returns:
- the number of clients
-
setStartupTimeoutMillis
Set the startup timeout in milliseconds.- Parameters:
startupTimeoutMillis- the startup timeout- Returns:
- this config for chaining
-
getPluginsDir
Get the plugins directory.- Returns:
- the plugins directory, or null if not set
-
setPluginsDir
Set the plugins directory where plugin zips are located. This directory should contain the tika-pipes-file-system zip and any other required plugins.- Parameters:
pluginsDir- the plugins directory- Returns:
- this config for chaining
-
getUserConfigPath
Get the user-provided configuration file path. If set, this config will be merged with the generated configuration.- Returns:
- the user config path, or null if not set
-
setUserConfigPath
Set a user-provided configuration file path. The user's configuration will be merged with the automatically generated configuration for PipesForkParser. User settings are preserved except for the internal fetcher which is always added.- Parameters:
userConfigPath- path to the user's configuration file- Returns:
- this config for chaining
-