Class ForkParser
- java.lang.Object
-
- org.apache.tika.fork.ForkParser
-
- All Implemented Interfaces:
Closeable
,Serializable
,AutoCloseable
,Parser
public class ForkParser extends Object implements Parser, Closeable
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description ForkParser()
ForkParser(ClassLoader loader)
ForkParser(ClassLoader loader, Parser parser)
ForkParser(Path tikaBin, ParserFactoryFactory factoryFactory)
If you have a directory with, say, tike-app.jar and you want the forked process/server to build a parser and run it from that -- so that you can keep all of those dependencies out of your client code, use this initializer.ForkParser(Path tikaBin, ParserFactoryFactory parserFactoryFactory, ClassLoader classLoader)
EXPERT
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
List<String>
getJavaCommandAsList()
Returns the command used to start the forked server process.int
getPoolSize()
Returns the size of the process pool.Set<MediaType>
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used with the given parse context.void
parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
This sends the objects to the server for parsing, and the server via the proxies acts on the handler as if it were updating it directly.void
setJavaCommand(List<String> java)
Sets the command used to start the forked server process.void
setMaxFilesProcessedPerServer(int maxFilesProcessedPerClient)
If there is a slowly building memory leak in one of the parsers, it is useful to set a limit on the number of files processed by a server before it is shutdown and restarted.void
setPoolSize(int poolSize)
Sets the size of the process pool.void
setServerParseTimeoutMillis(long serverParseTimeoutMillis)
The maximum amount of time allowed for the server to try to parse a file.void
setServerPulseMillis(long serverPulseMillis)
The amount of time in milliseconds that the server should wait before checking to see if the parse has timed out or if the wait has timed out The default is 5 seconds.void
setServerWaitTimeoutMillis(long serverWaitTimeoutMillis)
The maximum amount of time allowed for the server to wait for a new request to parse a file.
-
-
-
Constructor Detail
-
ForkParser
public ForkParser(Path tikaBin, ParserFactoryFactory factoryFactory)
If you have a directory with, say, tike-app.jar and you want the forked process/server to build a parser and run it from that -- so that you can keep all of those dependencies out of your client code, use this initializer.- Parameters:
tikaBin
- directory containing the tika-app.jar or similar -- full jar including tika-core and all desired parsers and dependenciesfactoryFactory
-
-
ForkParser
public ForkParser(Path tikaBin, ParserFactoryFactory parserFactoryFactory, ClassLoader classLoader)
EXPERT- Parameters:
tikaBin
- directory containing the tika-app.jar or similar -- full jar including tika-core and all desired parsers and dependenciesparserFactoryFactory
- -- the factory to use to generate the parser factory in the forked process/serverclassLoader
- to use for all classes besides the parser in the forked process/server
-
ForkParser
public ForkParser(ClassLoader loader, Parser parser)
- Parameters:
loader
- The ClassLoader to useparser
- the parser to delegate to. This one cannot be another ForkParser
-
ForkParser
public ForkParser(ClassLoader loader)
-
ForkParser
public ForkParser()
-
-
Method Detail
-
getPoolSize
public int getPoolSize()
Returns the size of the process pool.- Returns:
- process pool size
-
setPoolSize
public void setPoolSize(int poolSize)
Sets the size of the process pool.- Parameters:
poolSize
- process pool size
-
setJavaCommand
public void setJavaCommand(List<String> java)
Sets the command used to start the forked server process. The arguments "-jar" and "/path/to/bootstrap.jar" or "-cp" and "/path/to/tika_bin" are appended to the given command when starting the process. The default setting is {"java", "-Xmx32m"}. Creates a defensive copy.- Parameters:
java
- java command line
-
getJavaCommandAsList
public List<String> getJavaCommandAsList()
Returns the command used to start the forked server process. Returned list is unmodifiable.- Returns:
- java command line args
-
getSupportedTypes
public Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from interface:Parser
Returns the set of media types supported by this parser when used with the given parse context.- Specified by:
getSupportedTypes
in interfaceParser
- Parameters:
context
- parse context- Returns:
- immutable set of media types
-
parse
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
This sends the objects to the server for parsing, and the server via the proxies acts on the handler as if it were updating it directly.If using a
RecursiveParserWrapper
, there are two options:- Send in a class that extends
RecursiveParserWrapperHandler
, and the server will proxy back the data as best it can[0]. - Send in a class that extends
AbstractRecursiveParserWrapperHandler
and the server will act on the class but not proxy back the data. This can be used, for example, if all you want to do is write to disc, extendAbstractRecursiveParserWrapperHandler
to write to disc whenAbstractRecursiveParserWrapperHandler.endDocument(ContentHandler, Metadata)
is called, and the server will take care of the writing via the handler.
NOTE:[0] "the server will proxy back the data as best it can". If the handler implements Serializable and is actually serializable, the server will send it and the
Metadata
back upon {@link org.apache.tika.sax.RecursiveParserWrapperHandler# endEmbeddedDocument(ContentHandler, Metadata)} or {@link org.apache.tika.sax.RecursiveParserWrapperHandler# endEmbeddedDocument(ContentHandler, Metadata)}. If the handler does not implementSerializable
or if there is aNotSerializableException
thrown during serialization, the server will callContentHandler#toString()
on the ContentHandler and set that value with theTikaCoreProperties.TIKA_CONTENT
key and then serialize and proxy that data back.- Specified by:
parse
in interfaceParser
- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context- Throws:
IOException
SAXException
TikaException
- Send in a class that extends
-
close
public void close()
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
-
setServerPulseMillis
public void setServerPulseMillis(long serverPulseMillis)
The amount of time in milliseconds that the server should wait before checking to see if the parse has timed out or if the wait has timed out The default is 5 seconds.- Parameters:
serverPulseMillis
- milliseconds to sleep before checking if there has been any activity
-
setServerParseTimeoutMillis
public void setServerParseTimeoutMillis(long serverParseTimeoutMillis)
The maximum amount of time allowed for the server to try to parse a file. If more than this time elapses, the server shuts down, and the ForkParser throws an exception.- Parameters:
serverParseTimeoutMillis
-
-
setServerWaitTimeoutMillis
public void setServerWaitTimeoutMillis(long serverWaitTimeoutMillis)
The maximum amount of time allowed for the server to wait for a new request to parse a file. The server will shutdown after this amount of time, and a new server will have to be started by a new client.- Parameters:
serverWaitTimeoutMillis
-
-
setMaxFilesProcessedPerServer
public void setMaxFilesProcessedPerServer(int maxFilesProcessedPerClient)
If there is a slowly building memory leak in one of the parsers, it is useful to set a limit on the number of files processed by a server before it is shutdown and restarted. Default value is -1.- Parameters:
maxFilesProcessedPerClient
- maximum number of files that a server can handle before the parser shuts down a client and creates a new process. If set to -1, the server is never restarted because of the number of files handled.
-
-