public class ForkParser extends Object implements Parser, Closeable
  • Constructor Details

    • ForkParser

      public ForkParser(Path tikaBin, ParserFactoryFactory factoryFactory)
      If you have a directory with, say, tike-app.jar and you want the forked process/server to build a parser and run it from that -- so that you can keep all of those dependencies out of your client code, use this initializer.
      tikaBin - directory containing the tika-app.jar or similar -- full jar including tika-core and all desired parsers and dependencies
      factoryFactory -
    • ForkParser

      public ForkParser(Path tikaBin, ParserFactoryFactory parserFactoryFactory, ClassLoader classLoader)
      tikaBin - directory containing the tika-app.jar or similar -- full jar including tika-core and all desired parsers and dependencies
      parserFactoryFactory - -- the factory to use to generate the parser factory in the forked process/server
      classLoader - to use for all classes besides the parser in the forked process/server
    • ForkParser

      public ForkParser(ClassLoader loader, Parser parser)
      loader - The ClassLoader to use
      parser - the parser to delegate to. This one cannot be another ForkParser
    • ForkParser

      public ForkParser(ClassLoader loader)
    • ForkParser

      public ForkParser()
  • Method Details

    • getPoolSize

      public int getPoolSize()
      Returns the size of the process pool.
      process pool size
    • setPoolSize

      public void setPoolSize(int poolSize)
      Sets the size of the process pool.
      poolSize - process pool size
    • setJavaCommand

      public void setJavaCommand(List<String> java)
      Sets the command used to start the forked server process. The arguments "-jar" and "/path/to/bootstrap.jar" or "-cp" and "/path/to/tika_bin" are appended to the given command when starting the process. The default setting is {"java", "-Xmx32m"}.

      Creates a defensive copy.

      java - java command line
    • getJavaCommandAsList

      public List<String> getJavaCommandAsList()
      Returns the command used to start the forked server process.

      Returned list is unmodifiable.

      java command line args
    • getSupportedTypes

      public Set<MediaType> getSupportedTypes(ParseContext context)
      Returns the set of media types supported by this parser when used with the given parse context.
      context - parse context
      immutable set of media types
    • parse

      public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
      This sends the objects to the server for parsing, and the server via the proxies acts on the handler as if it were updating it directly.

      If using a RecursiveParserWrapper, there are two options:

      1. Send in a class that extends RecursiveParserWrapperHandler, and the server will proxy back the data as best it can[0].
      2. Send in a class that extends AbstractRecursiveParserWrapperHandler and the server will act on the class but not proxy back the data. This can be used, for example, if all you want to do is write to disc, extend AbstractRecursiveParserWrapperHandler to write to disc when AbstractRecursiveParserWrapperHandler.endDocument(ContentHandler, Metadata) is called, and the server will take care of the writing via the handler.

      NOTE:[0] "the server will proxy back the data as best it can". If the handler implements Serializable and is actually serializable, the server will send it and the Metadata back upon {@link org.apache.tika.sax.RecursiveParserWrapperHandler# endEmbeddedDocument(ContentHandler, Metadata)} or {@link org.apache.tika.sax.RecursiveParserWrapperHandler# endEmbeddedDocument(ContentHandler, Metadata)}. If the handler does not implement Serializable or if there is a NotSerializableException thrown during serialization, the server will call ContentHandler#toString() on the ContentHandler and set that value with the TikaCoreProperties.TIKA_CONTENT key and then serialize and proxy that data back.

      stream - the document stream (input)
      handler - handler for the XHTML SAX events (output)
      metadata - document metadata (input and output)
      context - parse context
    • close

      public void close()
    • setServerPulseMillis

      public void setServerPulseMillis(long serverPulseMillis)
      The amount of time in milliseconds that the server should wait before checking to see if the parse has timed out or if the wait has timed out The default is 5 seconds.
      serverPulseMillis - milliseconds to sleep before checking if there has been any activity
    • setServerParseTimeoutMillis

      public void setServerParseTimeoutMillis(long serverParseTimeoutMillis)
      The maximum amount of time allowed for the server to try to parse a file. If more than this time elapses, the server shuts down, and the ForkParser throws an exception.
      serverParseTimeoutMillis -
    • setServerWaitTimeoutMillis

      public void setServerWaitTimeoutMillis(long serverWaitTimeoutMillis)
      The maximum amount of time allowed for the server to wait for a new request to parse a file. The server will shutdown after this amount of time, and a new server will have to be started by a new client.
      serverWaitTimeoutMillis -
    • setMaxFilesProcessedPerServer

      public void setMaxFilesProcessedPerServer(int maxFilesProcessedPerClient)
      If there is a slowly building memory leak in one of the parsers, it is useful to set a limit on the number of files processed by a server before it is shutdown and restarted. Default value is -1.
      maxFilesProcessedPerClient - maximum number of files that a server can handle before the parser shuts down a client and creates a new process. If set to -1, the server is never restarted because of the number of files handled.