Package org.apache.tika.pipes.core
Class PipesConfig
java.lang.Object
org.apache.tika.pipes.core.PipesConfig
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final longstatic final longstatic final longstatic final intstatic final longstatic final intstatic final intstatic final intstatic final longstatic final longstatic final intstatic final intstatic final longstatic final boolean -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionlongWhen the emit queue hits this estimated size (sum of estimated extract sizes), emit the batch.Get the emit strategy configuration.longlongintRestart the forked PipesServer after it has processed this many files to avoid slow-building memory leaks.longintintNumber of emittersGets the default behavior when a parse exception occurs.Gets the default parse mode for how embedded documents are handled.intFetchEmitTuple queue sizelonglonglongintintlongGets the directory for temporary files during pipes-based parsing.booleanbooleanWhen true, only stop processing on fatal errors (FAILED_TO_INITIALIZE).booleanReturns whether shared server mode is enabled.static PipesConfigload(TikaJsonConfig tikaJsonConfig) Loads PipesConfig from the "pipes" section of the JSON configuration.voidsetConfigStoreParams(String configStoreParams) voidsetConfigStoreType(String configStoreType) voidsetEmitIntermediateResults(boolean emitIntermediateResults) voidsetEmitMaxEstimatedBytes(long emitMaxEstimatedBytes) voidsetEmitStrategy(EmitStrategyConfig emitStrategy) Set the emit strategy configuration.voidsetEmitWithinMillis(long emitWithinMillis) If nothing has been emitted in this amount of time and thegetEmitMaxEstimatedBytes()has not been reached yet, emit what's in the emit queue.voidsetForkedJvmArgs(ArrayList<String> jvmArgs) voidsetHeartbeatIntervalMs(long heartbeatIntervalMs) Interval in milliseconds between heartbeat messages sent from server to client.voidsetJavaPath(String javaPath) voidsetMaxFilesProcessedPerProcess(int maxFilesProcessedPerProcess) voidsetMaxWaitForClientMillis(long maxWaitForClientMillis) voidsetNumClients(int numClients) voidsetNumEmitters(int numEmitters) voidsetOnParseException(FetchEmitTuple.ON_PARSE_EXCEPTION onParseException) Sets the default behavior when a parse exception occurs.voidsetParseMode(String parseMode) Sets the default parse mode from a string.voidsetParseMode(ParseMode parseMode) Sets the default parse mode for how embedded documents are handled.voidsetQueueSize(int queueSize) voidsetShutdownClientAfterMillis(long shutdownClientAfterMillis) If the client has been inactive after this many milliseconds, shut it down.voidsetSleepOnStartupTimeoutMillis(long sleepOnStartupTimeoutMillis) voidsetSocketTimeoutMs(long socketTimeoutMs) Socket timeout in milliseconds for reading from the forked process.voidsetStaleFetcherDelaySeconds(int staleFetcherDelaySeconds) voidsetStaleFetcherTimeoutSeconds(int staleFetcherTimeoutSeconds) voidsetStartupTimeoutMillis(long startupTimeoutMillis) voidsetStopOnlyOnFatal(boolean stopOnlyOnFatal) voidsetTempDirectory(String tempDirectory) Sets the directory for temporary files during pipes-based parsing.voidsetUseSharedServer(boolean useSharedServer) Sets whether to use shared server mode.
-
Field Details
-
DEFAULT_STARTUP_TIMEOUT_MILLIS
public static final long DEFAULT_STARTUP_TIMEOUT_MILLIS- See Also:
-
DEFAULT_SHUTDOWN_CLIENT_AFTER_MILLS
public static final long DEFAULT_SHUTDOWN_CLIENT_AFTER_MILLS- See Also:
-
DEFAULT_NUM_CLIENTS
public static final int DEFAULT_NUM_CLIENTS- See Also:
-
DEFAULT_MAX_FILES_PROCESSED_PER_PROCESS
public static final int DEFAULT_MAX_FILES_PROCESSED_PER_PROCESS- See Also:
-
DEFAULT_MAX_WAIT_FOR_CLIENT_MS
public static final long DEFAULT_MAX_WAIT_FOR_CLIENT_MS- See Also:
-
DEFAULT_SOCKET_TIMEOUT_MS
public static final long DEFAULT_SOCKET_TIMEOUT_MS- See Also:
-
DEFAULT_HEARTBEAT_INTERVAL_MS
public static final long DEFAULT_HEARTBEAT_INTERVAL_MS- See Also:
-
DEFAULT_USE_SHARED_SERVER
public static final boolean DEFAULT_USE_SHARED_SERVER- See Also:
-
DEFAULT_STALE_FETCHER_TIMEOUT_SECONDS
public static final int DEFAULT_STALE_FETCHER_TIMEOUT_SECONDS- See Also:
-
DEFAULT_STALE_FETCHER_DELAY_SECONDS
public static final int DEFAULT_STALE_FETCHER_DELAY_SECONDS- See Also:
-
DEFAULT_EMIT_WITHIN_MILLIS
public static final long DEFAULT_EMIT_WITHIN_MILLIS- See Also:
-
DEFAULT_EMIT_MAX_ESTIMATED_BYTES
public static final long DEFAULT_EMIT_MAX_ESTIMATED_BYTES- See Also:
-
DEFAULT_QUEUE_SIZE
public static final int DEFAULT_QUEUE_SIZE- See Also:
-
DEFAULT_NUM_EMITTERS
public static final int DEFAULT_NUM_EMITTERS- See Also:
-
-
Constructor Details
-
PipesConfig
public PipesConfig()
-
-
Method Details
-
load
public static PipesConfig load(TikaJsonConfig tikaJsonConfig) throws IOException, TikaConfigException Loads PipesConfig from the "pipes" section of the JSON configuration.This configuration is used by both PipesServer (forking process) and AsyncProcessor (async processing). Some fields are specific to each:
- PipesServer uses: numClients, timeoutMillis, directEmitThresholdBytes, etc.
- AsyncProcessor uses: emitWithinMillis, queueSize, numEmitters, etc.
- Parameters:
tikaJsonConfig- the JSON configuration to load from- Returns:
- the loaded PipesConfig, or a new default instance if not found in config
- Throws:
IOException- if deserialization failsTikaConfigException- if configuration is invalid
-
getSocketTimeoutMs
public long getSocketTimeoutMs() -
setSocketTimeoutMs
public void setSocketTimeoutMs(long socketTimeoutMs) Socket timeout in milliseconds for reading from the forked process. If no data is received within this time, the connection is considered timed out. This is different from timeoutMillis which is the parse/processing timeout.- Parameters:
socketTimeoutMs-
-
getHeartbeatIntervalMs
public long getHeartbeatIntervalMs() -
setHeartbeatIntervalMs
public void setHeartbeatIntervalMs(long heartbeatIntervalMs) Interval in milliseconds between heartbeat messages sent from server to client. Should be significantly less than socketTimeoutMs to ensure the client doesn't timeout. WARNING: Setting this >= socketTimeoutMs will cause socket timeouts during normal processing. This only exists for testing. We encourage you never to use it.- Parameters:
heartbeatIntervalMs-
-
getShutdownClientAfterMillis
public long getShutdownClientAfterMillis() -
setShutdownClientAfterMillis
public void setShutdownClientAfterMillis(long shutdownClientAfterMillis) If the client has been inactive after this many milliseconds, shut it down.- Parameters:
shutdownClientAfterMillis-
-
getNumClients
public int getNumClients() -
setNumClients
public void setNumClients(int numClients) -
setForkedJvmArgs
-
getForkedJvmArgs
-
setStartupTimeoutMillis
public void setStartupTimeoutMillis(long startupTimeoutMillis) -
getMaxFilesProcessedPerProcess
public int getMaxFilesProcessedPerProcess()Restart the forked PipesServer after it has processed this many files to avoid slow-building memory leaks.- Returns:
-
setMaxFilesProcessedPerProcess
public void setMaxFilesProcessedPerProcess(int maxFilesProcessedPerProcess) -
getJavaPath
-
setJavaPath
-
getStartupTimeoutMillis
public long getStartupTimeoutMillis() -
getEmitStrategy
Get the emit strategy configuration.- Returns:
- the emit strategy configuration
-
setEmitStrategy
Set the emit strategy configuration.- Parameters:
emitStrategy- the emit strategy configuration
-
getSleepOnStartupTimeoutMillis
public long getSleepOnStartupTimeoutMillis() -
setSleepOnStartupTimeoutMillis
public void setSleepOnStartupTimeoutMillis(long sleepOnStartupTimeoutMillis) -
getStaleFetcherTimeoutSeconds
public int getStaleFetcherTimeoutSeconds() -
setStaleFetcherTimeoutSeconds
public void setStaleFetcherTimeoutSeconds(int staleFetcherTimeoutSeconds) -
getStaleFetcherDelaySeconds
public int getStaleFetcherDelaySeconds() -
setStaleFetcherDelaySeconds
public void setStaleFetcherDelaySeconds(int staleFetcherDelaySeconds) -
getMaxWaitForClientMillis
public long getMaxWaitForClientMillis() -
setMaxWaitForClientMillis
public void setMaxWaitForClientMillis(long maxWaitForClientMillis) -
getEmitWithinMillis
public long getEmitWithinMillis() -
setEmitWithinMillis
public void setEmitWithinMillis(long emitWithinMillis) If nothing has been emitted in this amount of time and thegetEmitMaxEstimatedBytes()has not been reached yet, emit what's in the emit queue.- Parameters:
emitWithinMillis- time in milliseconds
-
getEmitMaxEstimatedBytes
public long getEmitMaxEstimatedBytes()When the emit queue hits this estimated size (sum of estimated extract sizes), emit the batch.- Returns:
- the maximum estimated bytes before emitting
-
setEmitMaxEstimatedBytes
public void setEmitMaxEstimatedBytes(long emitMaxEstimatedBytes) -
getQueueSize
public int getQueueSize()FetchEmitTuple queue size- Returns:
- the queue size
-
setQueueSize
public void setQueueSize(int queueSize) -
getNumEmitters
public int getNumEmitters()Number of emitters- Returns:
- the number of emitters
-
setNumEmitters
public void setNumEmitters(int numEmitters) -
isEmitIntermediateResults
public boolean isEmitIntermediateResults() -
setEmitIntermediateResults
public void setEmitIntermediateResults(boolean emitIntermediateResults) -
isStopOnlyOnFatal
public boolean isStopOnlyOnFatal()When true, only stop processing on fatal errors (FAILED_TO_INITIALIZE). When false (default), also stop on initialization failures (FETCHER_INITIALIZATION_EXCEPTION, EMITTER_INITIALIZATION_EXCEPTION, CLIENT_UNAVAILABLE_WITHIN_MS) and not-found errors (FETCHER_NOT_FOUND, EMITTER_NOT_FOUND).Use true for server mode (tika-server /pipes, /async) where different requests may use different fetchers/emitters - a bad request shouldn't kill the server. Use false (default) for CLI batch mode where all tasks typically use the same fetcher/emitter configuration - no point continuing if configuration is wrong.
- Returns:
- true if only fatal errors should stop processing
-
setStopOnlyOnFatal
public void setStopOnlyOnFatal(boolean stopOnlyOnFatal) -
getParseMode
Gets the default parse mode for how embedded documents are handled.- Returns:
- the default parse mode
-
setParseMode
Sets the default parse mode for how embedded documents are handled. This can be overridden per-file via ParseContext.- Parameters:
parseMode- the parse mode (RMETA or CONCATENATE)
-
setParseMode
Sets the default parse mode from a string.- Parameters:
parseMode- the parse mode name (rmeta or concatenate)
-
getOnParseException
Gets the default behavior when a parse exception occurs.- Returns:
- the parse exception behavior
-
setOnParseException
Sets the default behavior when a parse exception occurs.- Parameters:
onParseException- the parse exception behavior
-
getConfigStoreType
-
setConfigStoreType
-
getConfigStoreParams
-
setConfigStoreParams
-
getTempDirectory
Gets the directory for temporary files during pipes-based parsing.- Returns:
- the temp directory path, or null to use system default
-
setTempDirectory
Sets the directory for temporary files during pipes-based parsing. If not set, the system default temp directory will be used. Consider using a RAM-backed filesystem (e.g., /dev/shm or /tmpfs) for better performance.- Parameters:
tempDirectory- the temp directory path, or null to use system default
-