Class PipesConfig

java.lang.Object
org.apache.tika.pipes.core.PipesConfig

public class PipesConfig extends Object
  • Field Details

    • DEFAULT_STARTUP_TIMEOUT_MILLIS

      public static final long DEFAULT_STARTUP_TIMEOUT_MILLIS
      See Also:
    • DEFAULT_SHUTDOWN_CLIENT_AFTER_MILLS

      public static final long DEFAULT_SHUTDOWN_CLIENT_AFTER_MILLS
      See Also:
    • DEFAULT_NUM_CLIENTS

      public static final int DEFAULT_NUM_CLIENTS
      See Also:
    • DEFAULT_MAX_FILES_PROCESSED_PER_PROCESS

      public static final int DEFAULT_MAX_FILES_PROCESSED_PER_PROCESS
      See Also:
    • DEFAULT_MAX_WAIT_FOR_CLIENT_MS

      public static final long DEFAULT_MAX_WAIT_FOR_CLIENT_MS
      See Also:
    • DEFAULT_SOCKET_TIMEOUT_MS

      public static final long DEFAULT_SOCKET_TIMEOUT_MS
      See Also:
    • DEFAULT_HEARTBEAT_INTERVAL_MS

      public static final long DEFAULT_HEARTBEAT_INTERVAL_MS
      See Also:
    • DEFAULT_USE_SHARED_SERVER

      public static final boolean DEFAULT_USE_SHARED_SERVER
      See Also:
    • DEFAULT_STALE_FETCHER_TIMEOUT_SECONDS

      public static final int DEFAULT_STALE_FETCHER_TIMEOUT_SECONDS
      See Also:
    • DEFAULT_STALE_FETCHER_DELAY_SECONDS

      public static final int DEFAULT_STALE_FETCHER_DELAY_SECONDS
      See Also:
    • DEFAULT_EMIT_WITHIN_MILLIS

      public static final long DEFAULT_EMIT_WITHIN_MILLIS
      See Also:
    • DEFAULT_EMIT_MAX_ESTIMATED_BYTES

      public static final long DEFAULT_EMIT_MAX_ESTIMATED_BYTES
      See Also:
    • DEFAULT_QUEUE_SIZE

      public static final int DEFAULT_QUEUE_SIZE
      See Also:
    • DEFAULT_NUM_EMITTERS

      public static final int DEFAULT_NUM_EMITTERS
      See Also:
  • Constructor Details

    • PipesConfig

      public PipesConfig()
  • Method Details

    • load

      public static PipesConfig load(TikaJsonConfig tikaJsonConfig) throws IOException, TikaConfigException
      Loads PipesConfig from the "pipes" section of the JSON configuration.

      This configuration is used by both PipesServer (forking process) and AsyncProcessor (async processing). Some fields are specific to each:

      • PipesServer uses: numClients, timeoutMillis, directEmitThresholdBytes, etc.
      • AsyncProcessor uses: emitWithinMillis, queueSize, numEmitters, etc.
      Unused fields in each context are simply ignored.
      Parameters:
      tikaJsonConfig - the JSON configuration to load from
      Returns:
      the loaded PipesConfig, or a new default instance if not found in config
      Throws:
      IOException - if deserialization fails
      TikaConfigException - if configuration is invalid
    • getSocketTimeoutMs

      public long getSocketTimeoutMs()
    • setSocketTimeoutMs

      public void setSocketTimeoutMs(long socketTimeoutMs)
      Socket timeout in milliseconds for reading from the forked process. If no data is received within this time, the connection is considered timed out. This is different from timeoutMillis which is the parse/processing timeout.
      Parameters:
      socketTimeoutMs -
    • getHeartbeatIntervalMs

      public long getHeartbeatIntervalMs()
    • setHeartbeatIntervalMs

      public void setHeartbeatIntervalMs(long heartbeatIntervalMs)
      Interval in milliseconds between heartbeat messages sent from server to client. Should be significantly less than socketTimeoutMs to ensure the client doesn't timeout. WARNING: Setting this >= socketTimeoutMs will cause socket timeouts during normal processing. This only exists for testing. We encourage you never to use it.
      Parameters:
      heartbeatIntervalMs -
    • getShutdownClientAfterMillis

      public long getShutdownClientAfterMillis()
    • setShutdownClientAfterMillis

      public void setShutdownClientAfterMillis(long shutdownClientAfterMillis)
      If the client has been inactive after this many milliseconds, shut it down.
      Parameters:
      shutdownClientAfterMillis -
    • getNumClients

      public int getNumClients()
    • setNumClients

      public void setNumClients(int numClients)
    • setForkedJvmArgs

      public void setForkedJvmArgs(ArrayList<String> jvmArgs)
    • getForkedJvmArgs

      public ArrayList<String> getForkedJvmArgs()
    • setStartupTimeoutMillis

      public void setStartupTimeoutMillis(long startupTimeoutMillis)
    • getMaxFilesProcessedPerProcess

      public int getMaxFilesProcessedPerProcess()
      Restart the forked PipesServer after it has processed this many files to avoid slow-building memory leaks.
      Returns:
    • setMaxFilesProcessedPerProcess

      public void setMaxFilesProcessedPerProcess(int maxFilesProcessedPerProcess)
    • getJavaPath

      public String getJavaPath()
    • setJavaPath

      public void setJavaPath(String javaPath)
    • getStartupTimeoutMillis

      public long getStartupTimeoutMillis()
    • getEmitStrategy

      public EmitStrategyConfig getEmitStrategy()
      Get the emit strategy configuration.
      Returns:
      the emit strategy configuration
    • setEmitStrategy

      public void setEmitStrategy(EmitStrategyConfig emitStrategy)
      Set the emit strategy configuration.
      Parameters:
      emitStrategy - the emit strategy configuration
    • getSleepOnStartupTimeoutMillis

      public long getSleepOnStartupTimeoutMillis()
    • setSleepOnStartupTimeoutMillis

      public void setSleepOnStartupTimeoutMillis(long sleepOnStartupTimeoutMillis)
    • getStaleFetcherTimeoutSeconds

      public int getStaleFetcherTimeoutSeconds()
    • setStaleFetcherTimeoutSeconds

      public void setStaleFetcherTimeoutSeconds(int staleFetcherTimeoutSeconds)
    • getStaleFetcherDelaySeconds

      public int getStaleFetcherDelaySeconds()
    • setStaleFetcherDelaySeconds

      public void setStaleFetcherDelaySeconds(int staleFetcherDelaySeconds)
    • getMaxWaitForClientMillis

      public long getMaxWaitForClientMillis()
    • setMaxWaitForClientMillis

      public void setMaxWaitForClientMillis(long maxWaitForClientMillis)
    • getEmitWithinMillis

      public long getEmitWithinMillis()
    • setEmitWithinMillis

      public void setEmitWithinMillis(long emitWithinMillis)
      If nothing has been emitted in this amount of time and the getEmitMaxEstimatedBytes() has not been reached yet, emit what's in the emit queue.
      Parameters:
      emitWithinMillis - time in milliseconds
    • getEmitMaxEstimatedBytes

      public long getEmitMaxEstimatedBytes()
      When the emit queue hits this estimated size (sum of estimated extract sizes), emit the batch.
      Returns:
      the maximum estimated bytes before emitting
    • setEmitMaxEstimatedBytes

      public void setEmitMaxEstimatedBytes(long emitMaxEstimatedBytes)
    • getQueueSize

      public int getQueueSize()
      FetchEmitTuple queue size
      Returns:
      the queue size
    • setQueueSize

      public void setQueueSize(int queueSize)
    • getNumEmitters

      public int getNumEmitters()
      Number of emitters
      Returns:
      the number of emitters
    • setNumEmitters

      public void setNumEmitters(int numEmitters)
    • isEmitIntermediateResults

      public boolean isEmitIntermediateResults()
    • setEmitIntermediateResults

      public void setEmitIntermediateResults(boolean emitIntermediateResults)
    • isStopOnlyOnFatal

      public boolean isStopOnlyOnFatal()
      When true, only stop processing on fatal errors (FAILED_TO_INITIALIZE). When false (default), also stop on initialization failures (FETCHER_INITIALIZATION_EXCEPTION, EMITTER_INITIALIZATION_EXCEPTION, CLIENT_UNAVAILABLE_WITHIN_MS) and not-found errors (FETCHER_NOT_FOUND, EMITTER_NOT_FOUND).

      Use true for server mode (tika-server /pipes, /async) where different requests may use different fetchers/emitters - a bad request shouldn't kill the server. Use false (default) for CLI batch mode where all tasks typically use the same fetcher/emitter configuration - no point continuing if configuration is wrong.

      Returns:
      true if only fatal errors should stop processing
    • setStopOnlyOnFatal

      public void setStopOnlyOnFatal(boolean stopOnlyOnFatal)
    • getParseMode

      public ParseMode getParseMode()
      Gets the default parse mode for how embedded documents are handled.
      Returns:
      the default parse mode
    • setParseMode

      public void setParseMode(ParseMode parseMode)
      Sets the default parse mode for how embedded documents are handled. This can be overridden per-file via ParseContext.
      Parameters:
      parseMode - the parse mode (RMETA or CONCATENATE)
    • setParseMode

      public void setParseMode(String parseMode)
      Sets the default parse mode from a string.
      Parameters:
      parseMode - the parse mode name (rmeta or concatenate)
    • getOnParseException

      public FetchEmitTuple.ON_PARSE_EXCEPTION getOnParseException()
      Gets the default behavior when a parse exception occurs.
      Returns:
      the parse exception behavior
    • setOnParseException

      public void setOnParseException(FetchEmitTuple.ON_PARSE_EXCEPTION onParseException)
      Sets the default behavior when a parse exception occurs.
      Parameters:
      onParseException - the parse exception behavior
    • getConfigStoreType

      public String getConfigStoreType()
    • setConfigStoreType

      public void setConfigStoreType(String configStoreType)
    • getConfigStoreParams

      public String getConfigStoreParams()
    • setConfigStoreParams

      public void setConfigStoreParams(String configStoreParams)
    • getTempDirectory

      public String getTempDirectory()
      Gets the directory for temporary files during pipes-based parsing.
      Returns:
      the temp directory path, or null to use system default
    • setTempDirectory

      public void setTempDirectory(String tempDirectory)
      Sets the directory for temporary files during pipes-based parsing. If not set, the system default temp directory will be used. Consider using a RAM-backed filesystem (e.g., /dev/shm or /tmpfs) for better performance.
      Parameters:
      tempDirectory - the temp directory path, or null to use system default
    • isUseSharedServer

      public boolean isUseSharedServer()
      Returns whether shared server mode is enabled.
      Returns:
      true if shared server mode is enabled
      See Also:
    • setUseSharedServer

      public void setUseSharedServer(boolean useSharedServer)
      Sets whether to use shared server mode.

      When true, multiple PipesClients connect to a single shared PipesServer process instead of each client having its own dedicated server. This reduces memory overhead but sacrifices isolation: one crash affects all in-flight requests.

      Not recommended for production. See the Tika Pipes documentation for limitations and guidance.

      Parameters:
      useSharedServer - true to enable shared server mode, false for per-client mode (default)