public class ExternalParser extends AbstractParser implements Initializable
ExternalParser
.
Specifically, it relies more on configuration than the SPI model.
Further, users can specify a parser to handle the output
of the external process.Modifier and Type | Field and Description |
---|---|
static long |
DEFAULT_TIMEOUT_MS |
static String |
INPUT_FILE_TOKEN |
static String |
OUTPUT_FILE_TOKEN |
Constructor and Description |
---|
ExternalParser() |
Modifier and Type | Method and Description |
---|---|
void |
checkInitialization(InitializableProblemHandler problemHandler) |
Parser |
getOutputParser() |
Set<MediaType> |
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used
with the given parse context.
|
void |
initialize(Map<String,Param> params) |
void |
parse(InputStream stream,
ContentHandler handler,
Metadata metadata,
ParseContext context)
Parses a document stream into a sequence of XHTML SAX events.
|
void |
setCommandLine(List<String> commandLine)
Use this to specify the full commandLine.
|
void |
setMaxStdErr(int maxStdErr) |
void |
setMaxStdOut(int maxStdOut) |
void |
setOutputParser(Parser parser)
This parser is called on the output of the process.
|
void |
setReturnStderr(boolean returnStderr)
If set to true, this will return the stderr in the metadata
via
ExternalProcess.STD_ERR . |
void |
setReturnStdout(boolean returnStdout)
If set to true, this will return the stdout in the metadata
via
ExternalProcess.STD_OUT . |
void |
setSupportedTypes(List<String> supportedTypes)
This is set during initialization from a tika-config.
|
void |
setTimeoutMs(long timeoutMs) |
parse
public static final long DEFAULT_TIMEOUT_MS
public static final String INPUT_FILE_TOKEN
public static final String OUTPUT_FILE_TOKEN
public Set<MediaType> getSupportedTypes(ParseContext context)
Parser
getSupportedTypes
in interface Parser
context
- parse contextpublic void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
Parser
The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.
parse
in interface Parser
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse contextIOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed@Field public void setSupportedTypes(List<String> supportedTypes)
IllegalStateException
.supportedTypes
- @Field public void setTimeoutMs(long timeoutMs)
@Field public void setMaxStdErr(int maxStdErr)
@Field public void setMaxStdOut(int maxStdOut)
@Field public void setCommandLine(List<String> commandLine)
INPUT_FILE_TOKEN
.
If the external process writes to an output file, specify
OUTPUT_FILE_TOKEN
.commandLine
- @Field public void setReturnStdout(boolean returnStdout)
ExternalProcess.STD_OUT
.
Default is false
because this should normally
be handled by the outputParserreturnStdout
- @Field public void setReturnStderr(boolean returnStderr)
ExternalProcess.STD_ERR
.
Default is true
returnStderr
- @Field public void setOutputParser(Parser parser)
OUTPUT_FILE_TOKEN
, this parser will parse that file,
otherwise it will parse the UTF-8 encoded bytes from the process' STD_OUT.parser
- public Parser getOutputParser()
public void initialize(Map<String,Param> params) throws TikaConfigException
initialize
in interface Initializable
params
- params to use for initializationTikaConfigException
public void checkInitialization(InitializableProblemHandler problemHandler) throws TikaConfigException
checkInitialization
in interface Initializable
problemHandler
- if there is a problem and no
custom initializableProblemHandler has been configured
via Initializable parameters,
this is called to respond.TikaConfigException
Copyright © 2007–2023 The Apache Software Foundation. All rights reserved.