public abstract class AbstractMultipleParser extends AbstractParser
FallbackParser
or
SupplementingParser
along with a Strategy.
Note that unless you give a ContentHandlerFactory
,
you'll get content from every parser tried mushed together!Modifier and Type | Class and Description |
---|---|
static class |
AbstractMultipleParser.MetadataPolicy
The various strategies for handling metadata emitted by
multiple parsers.
|
Modifier and Type | Field and Description |
---|---|
protected static String |
METADATA_POLICY_CONFIG_KEY |
Constructor and Description |
---|
AbstractMultipleParser(MediaTypeRegistry registry,
AbstractMultipleParser.MetadataPolicy policy,
Collection<? extends Parser> parsers) |
AbstractMultipleParser(MediaTypeRegistry registry,
AbstractMultipleParser.MetadataPolicy policy,
Parser... parsers) |
AbstractMultipleParser(MediaTypeRegistry registry,
Collection<? extends Parser> parsers,
Map<String,Param> params) |
Modifier and Type | Method and Description |
---|---|
List<Parser> |
getAllParsers() |
MediaTypeRegistry |
getMediaTypeRegistry()
Returns the media type registry used to infer type relationships.
|
AbstractMultipleParser.MetadataPolicy |
getMetadataPolicy() |
protected static AbstractMultipleParser.MetadataPolicy |
getMetadataPolicy(Map<String,Param> params) |
Set<MediaType> |
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used
with the given parse context.
|
protected static Metadata |
mergeMetadata(Metadata newMetadata,
Metadata lastMetadata,
AbstractMultipleParser.MetadataPolicy policy) |
void |
parse(InputStream stream,
ContentHandlerFactory handlers,
Metadata metadata,
ParseContext context)
Deprecated.
The
ContentHandlerFactory override is still experimental
and the method signature is subject to change before Tika 2.0 |
void |
parse(InputStream stream,
ContentHandler handler,
Metadata metadata,
ParseContext context)
Processes the given Stream through one or more parsers,
resetting things between parsers as requested by policy.
|
protected abstract boolean |
parserCompleted(Parser parser,
Metadata metadata,
ContentHandler handler,
ParseContext context,
Exception exception)
Used to notify implementations that a Parser has Finished
or Failed, and to allow them to decide to continue or
abort further parsing
|
protected void |
parserPrepare(Parser parser,
Metadata metadata,
ParseContext context)
Used to allow implementations to prepare or change things
before parsing occurs
|
void |
setMediaTypeRegistry(MediaTypeRegistry registry)
Sets the media type registry used to infer type relationships.
|
parse
protected static final String METADATA_POLICY_CONFIG_KEY
public AbstractMultipleParser(MediaTypeRegistry registry, Collection<? extends Parser> parsers, Map<String,Param> params)
public AbstractMultipleParser(MediaTypeRegistry registry, AbstractMultipleParser.MetadataPolicy policy, Parser... parsers)
public AbstractMultipleParser(MediaTypeRegistry registry, AbstractMultipleParser.MetadataPolicy policy, Collection<? extends Parser> parsers)
protected static AbstractMultipleParser.MetadataPolicy getMetadataPolicy(Map<String,Param> params)
protected static Metadata mergeMetadata(Metadata newMetadata, Metadata lastMetadata, AbstractMultipleParser.MetadataPolicy policy)
public MediaTypeRegistry getMediaTypeRegistry()
public void setMediaTypeRegistry(MediaTypeRegistry registry)
registry
- media type registrypublic Set<MediaType> getSupportedTypes(ParseContext context)
Parser
context
- parse contextpublic AbstractMultipleParser.MetadataPolicy getMetadataPolicy()
protected void parserPrepare(Parser parser, Metadata metadata, ParseContext context)
protected abstract boolean parserCompleted(Parser parser, Metadata metadata, ContentHandler handler, ParseContext context, Exception exception)
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
Parser
s.
Note that you'll get text from every parser this way, to have
control of which content is from which parser you need to
call the method with a ContentHandlerFactory
instead.
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse contextIOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsedpublic void parse(InputStream stream, ContentHandlerFactory handlers, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
ContentHandlerFactory
override is still experimental
and the method signature is subject to change before Tika 2.0Parser
s.
You will get one ContentHandler fetched for each Parser used.
TODO Do we need to return all the ContentHandler instances we created?IOException
SAXException
TikaException
Copyright © 2007–2023 The Apache Software Foundation. All rights reserved.