Class PipesForkParserConfig

java.lang.Object
org.apache.tika.pipes.fork.PipesForkParserConfig

public class PipesForkParserConfig extends Object
Configuration for PipesForkParser.

This provides a simplified configuration API that abstracts away the complexity of the pipes infrastructure.

  • Constructor Details

    • PipesForkParserConfig

      public PipesForkParserConfig()
  • Method Details

    • getPipesConfig

      public PipesConfig getPipesConfig()
      Get the underlying PipesConfig for advanced configuration.
      Returns:
      the pipes configuration
    • getContentHandlerFactory

      public ContentHandlerFactory getContentHandlerFactory()
      Get the content handler factory that specifies how content should be handled.
      Returns:
      the content handler factory
    • setContentHandlerFactory

      public PipesForkParserConfig setContentHandlerFactory(ContentHandlerFactory contentHandlerFactory)
      Set the content handler factory.
      Parameters:
      contentHandlerFactory - the content handler factory
      Returns:
      this config for chaining
    • getParseMode

      public ParseMode getParseMode()
      Get the parse mode.
      Returns:
      the parse mode
    • setHandlerType

      Set the handler type (TEXT, HTML, XML, etc.).
      Parameters:
      type - the handler type
      Returns:
      this config for chaining
    • setParseMode

      public PipesForkParserConfig setParseMode(ParseMode parseMode)
      Set the parse mode (RMETA for recursive metadata, CONCATENATE for single document).
      Parameters:
      parseMode - the parse mode
      Returns:
      this config for chaining
    • setWriteLimit

      public PipesForkParserConfig setWriteLimit(int writeLimit)
      Set the write limit for content extraction.
      Parameters:
      writeLimit - the maximum characters to extract (-1 for unlimited)
      Returns:
      this config for chaining
    • setMaxEmbeddedCount

      public PipesForkParserConfig setMaxEmbeddedCount(int maxEmbeddedCount)
      Set the maximum number of embedded resources to process. This sets the maxCount on EmbeddedLimits which will be applied to ParseContext.
      Parameters:
      maxEmbeddedCount - the maximum embedded count (-1 for unlimited)
      Returns:
      this config for chaining
    • getEmbeddedLimits

      public EmbeddedLimits getEmbeddedLimits()
      Get the embedded limits configuration.
      Returns:
      the embedded limits, or null if not set
    • setEmbeddedLimits

      public PipesForkParserConfig setEmbeddedLimits(EmbeddedLimits embeddedLimits)
      Set the embedded limits configuration.
      Parameters:
      embeddedLimits - the embedded limits
      Returns:
      this config for chaining
    • getFetcherName

      public String getFetcherName()
      Get the fetcher name used for file system fetching.
      Returns:
      the fetcher name
    • setFetcherName

      public PipesForkParserConfig setFetcherName(String fetcherName)
      Set the fetcher name.
      Parameters:
      fetcherName - the fetcher name
      Returns:
      this config for chaining
    • setTimeoutLimits

      public PipesForkParserConfig setTimeoutLimits(TimeoutLimits timeoutLimits)
      Set the timeout limits for parsing operations.

      The progress timeout bounds the time between progress updates (catches hung parsers). The total task timeout bounds overall wall-clock time.

      Parameters:
      timeoutLimits - the timeout limits
      Returns:
      this config for chaining
    • getTimeoutLimits

      public TimeoutLimits getTimeoutLimits()
      Get the timeout limits.
      Returns:
      the timeout limits, or null if not set (defaults will be used)
    • setJvmArgs

      public PipesForkParserConfig setJvmArgs(List<String> jvmArgs)
      Set the JVM arguments for the forked process.
      Parameters:
      jvmArgs - the JVM arguments (e.g., "-Xmx512m")
      Returns:
      this config for chaining
    • addJvmArg

      public PipesForkParserConfig addJvmArg(String arg)
      Add a JVM argument for the forked process.
      Parameters:
      arg - the JVM argument to add
      Returns:
      this config for chaining
    • setJavaPath

      public PipesForkParserConfig setJavaPath(String javaPath)
      Set the Java executable path.
      Parameters:
      javaPath - path to the java executable
      Returns:
      this config for chaining
    • setMaxFilesPerProcess

      public PipesForkParserConfig setMaxFilesPerProcess(int maxFiles)
      Set the maximum number of files to process before restarting the forked process. This helps prevent memory leaks from accumulating.
      Parameters:
      maxFiles - the maximum files per process (-1 for unlimited)
      Returns:
      this config for chaining
    • setNumClients

      public PipesForkParserConfig setNumClients(int numClients)
      EXPERT: Set the number of forked JVM processes (clients) to use for parsing.

      This enables concurrent parsing across multiple forked processes. Each client is an independent JVM that can parse documents in parallel. When multiple threads call PipesForkParser.parse(java.nio.file.Path), requests are distributed across the pool of forked processes.

      When to use: Set this higher than 1 when you need to parse many documents concurrently and have sufficient CPU cores and memory. Each forked process consumes memory independently (based on your JVM args like -Xmx).

      Default: 1 (single forked process, suitable for simple sequential use)

      Parameters:
      numClients - the number of forked JVM processes (must be >= 1)
      Returns:
      this config for chaining
      Throws:
      IllegalArgumentException - if numClients is less than 1
    • getNumClients

      public int getNumClients()
      Get the number of forked JVM processes configured.
      Returns:
      the number of clients
    • setStartupTimeoutMillis

      public PipesForkParserConfig setStartupTimeoutMillis(long startupTimeoutMillis)
      Set the startup timeout in milliseconds.
      Parameters:
      startupTimeoutMillis - the startup timeout
      Returns:
      this config for chaining
    • getPluginsDir

      public Path getPluginsDir()
      Get the plugins directory.
      Returns:
      the plugins directory, or null if not set
    • setPluginsDir

      public PipesForkParserConfig setPluginsDir(Path pluginsDir)
      Set the plugins directory where plugin zips are located. This directory should contain the tika-pipes-file-system zip and any other required plugins.
      Parameters:
      pluginsDir - the plugins directory
      Returns:
      this config for chaining
    • getUserConfigPath

      public Path getUserConfigPath()
      Get the user-provided configuration file path. If set, this config will be merged with the generated configuration.
      Returns:
      the user config path, or null if not set
    • setUserConfigPath

      public PipesForkParserConfig setUserConfigPath(Path userConfigPath)
      Set a user-provided configuration file path. The user's configuration will be merged with the automatically generated configuration for PipesForkParser. User settings are preserved except for the internal fetcher which is always added.
      Parameters:
      userConfigPath - path to the user's configuration file
      Returns:
      this config for chaining