Class ForkParser

    • Constructor Detail

      • ForkParser

        public ForkParser​(Path tikaBin,
                          ParserFactoryFactory factoryFactory)
        If you have a directory with, say, tike-app.jar and you want the forked process/server to build a parser and run it from that -- so that you can keep all of those dependencies out of your client code, use this initializer.
        Parameters:
        tikaBin - directory containing the tika-app.jar or similar -- full jar including tika-core and all desired parsers and dependencies
        factoryFactory -
      • ForkParser

        public ForkParser​(Path tikaBin,
                          ParserFactoryFactory parserFactoryFactory,
                          ClassLoader classLoader)
        EXPERT
        Parameters:
        tikaBin - directory containing the tika-app.jar or similar -- full jar including tika-core and all desired parsers and dependencies
        parserFactoryFactory - -- the factory to use to generate the parser factory in the forked process/server
        classLoader - to use for all classes besides the parser in the forked process/server
      • ForkParser

        public ForkParser​(ClassLoader loader,
                          Parser parser)
        Parameters:
        loader - The ClassLoader to use
        parser - the parser to delegate to. This one cannot be another ForkParser
      • ForkParser

        public ForkParser​(ClassLoader loader)
      • ForkParser

        public ForkParser()
    • Method Detail

      • getPoolSize

        public int getPoolSize()
        Returns the size of the process pool.
        Returns:
        process pool size
      • setPoolSize

        public void setPoolSize​(int poolSize)
        Sets the size of the process pool.
        Parameters:
        poolSize - process pool size
      • setJavaCommand

        public void setJavaCommand​(List<String> java)
        Sets the command used to start the forked server process. The arguments "-jar" and "/path/to/bootstrap.jar" or "-cp" and "/path/to/tika_bin" are appended to the given command when starting the process. The default setting is {"java", "-Xmx32m"}.

        Creates a defensive copy.

        Parameters:
        java - java command line
      • getJavaCommandAsList

        public List<String> getJavaCommandAsList()
        Returns the command used to start the forked server process.

        Returned list is unmodifiable.

        Returns:
        java command line args
      • getSupportedTypes

        public Set<MediaType> getSupportedTypes​(ParseContext context)
        Description copied from interface: Parser
        Returns the set of media types supported by this parser when used with the given parse context.
        Specified by:
        getSupportedTypes in interface Parser
        Parameters:
        context - parse context
        Returns:
        immutable set of media types
      • setServerPulseMillis

        public void setServerPulseMillis​(long serverPulseMillis)
        The amount of time in milliseconds that the server should wait before checking to see if the parse has timed out or if the wait has timed out The default is 5 seconds.
        Parameters:
        serverPulseMillis - milliseconds to sleep before checking if there has been any activity
      • setServerParseTimeoutMillis

        public void setServerParseTimeoutMillis​(long serverParseTimeoutMillis)
        The maximum amount of time allowed for the server to try to parse a file. If more than this time elapses, the server shuts down, and the ForkParser throws an exception.
        Parameters:
        serverParseTimeoutMillis -
      • setServerWaitTimeoutMillis

        public void setServerWaitTimeoutMillis​(long serverWaitTimeoutMillis)
        The maximum amount of time allowed for the server to wait for a new request to parse a file. The server will shutdown after this amount of time, and a new server will have to be started by a new client.
        Parameters:
        serverWaitTimeoutMillis -
      • setMaxFilesProcessedPerServer

        public void setMaxFilesProcessedPerServer​(int maxFilesProcessedPerClient)
        If there is a slowly building memory leak in one of the parsers, it is useful to set a limit on the number of files processed by a server before it is shutdown and restarted. Default value is -1.
        Parameters:
        maxFilesProcessedPerClient - maximum number of files that a server can handle before the parser shuts down a client and creates a new process. If set to -1, the server is never restarted because of the number of files handled.