Package org.apache.tika.parser.multiple
Class AbstractMultipleParser
java.lang.Object
org.apache.tika.parser.multiple.AbstractMultipleParser
- All Implemented Interfaces:
Serializable
,Parser
- Direct Known Subclasses:
FallbackParser
,PickBestTextEncodingParser
,SupplementingParser
Abstract base class for parser wrappers which may / will
process a given stream multiple times, merging the results
of the various parsers used.
End users should normally use
FallbackParser
or
SupplementingParser
along with a Strategy.
Note that unless you give a ContentHandlerFactory
,
you'll get content from every parser tried mushed together!- Since:
- Apache Tika 1.18
- See Also:
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic enum
The various strategies for handling metadata emitted by multiple parsers. -
Field Summary
-
Constructor Summary
ConstructorDescriptionAbstractMultipleParser
(MediaTypeRegistry registry, Collection<? extends Parser> parsers, Map<String, Param> params) AbstractMultipleParser
(MediaTypeRegistry registry, AbstractMultipleParser.MetadataPolicy policy, Collection<? extends Parser> parsers) AbstractMultipleParser
(MediaTypeRegistry registry, AbstractMultipleParser.MetadataPolicy policy, Parser... parsers) -
Method Summary
Modifier and TypeMethodDescriptionReturns the media type registry used to infer type relationships.protected static AbstractMultipleParser.MetadataPolicy
getMetadataPolicy
(Map<String, Param> params) getSupportedTypes
(ParseContext context) Returns the set of media types supported by this parser when used with the given parse context.protected static Metadata
mergeMetadata
(Metadata newMetadata, Metadata lastMetadata, AbstractMultipleParser.MetadataPolicy policy) void
parse
(InputStream stream, ContentHandlerFactory handlers, Metadata metadata, ParseContext context) Deprecated.void
parse
(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) Processes the given Stream through one or more parsers, resetting things between parsers as requested by policy.protected abstract boolean
parserCompleted
(Parser parser, Metadata metadata, ContentHandler handler, ParseContext context, Exception exception) Used to notify implementations that a Parser has Finished or Failed, and to allow them to decide to continue or abort further parsingprotected void
parserPrepare
(Parser parser, Metadata metadata, ParseContext context) Used to allow implementations to prepare or change things before parsing occursvoid
setMediaTypeRegistry
(MediaTypeRegistry registry) Sets the media type registry used to infer type relationships.
-
Field Details
-
METADATA_POLICY_CONFIG_KEY
- See Also:
-
-
Constructor Details
-
AbstractMultipleParser
public AbstractMultipleParser(MediaTypeRegistry registry, Collection<? extends Parser> parsers, Map<String, Param> params) -
AbstractMultipleParser
public AbstractMultipleParser(MediaTypeRegistry registry, AbstractMultipleParser.MetadataPolicy policy, Parser... parsers) -
AbstractMultipleParser
public AbstractMultipleParser(MediaTypeRegistry registry, AbstractMultipleParser.MetadataPolicy policy, Collection<? extends Parser> parsers)
-
-
Method Details
-
getMetadataPolicy
-
mergeMetadata
protected static Metadata mergeMetadata(Metadata newMetadata, Metadata lastMetadata, AbstractMultipleParser.MetadataPolicy policy) -
getMediaTypeRegistry
Returns the media type registry used to infer type relationships.- Returns:
- media type registry
-
setMediaTypeRegistry
Sets the media type registry used to infer type relationships.- Parameters:
registry
- media type registry
-
getSupportedTypes
Description copied from interface:Parser
Returns the set of media types supported by this parser when used with the given parse context.- Specified by:
getSupportedTypes
in interfaceParser
- Parameters:
context
- parse context- Returns:
- immutable set of media types
-
getMetadataPolicy
-
getAllParsers
-
parserPrepare
Used to allow implementations to prepare or change things before parsing occurs -
parserCompleted
protected abstract boolean parserCompleted(Parser parser, Metadata metadata, ContentHandler handler, ParseContext context, Exception exception) Used to notify implementations that a Parser has Finished or Failed, and to allow them to decide to continue or abort further parsing -
parse
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException Processes the given Stream through one or more parsers, resetting things between parsers as requested by policy. The actual processing is delegated to one or moreParser
s.Note that you'll get text from every parser this way, to have control of which content is from which parser you need to call the method with a
ContentHandlerFactory
instead.- Specified by:
parse
in interfaceParser
- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context- Throws:
IOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed
-
parse
@Deprecated public void parse(InputStream stream, ContentHandlerFactory handlers, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException Deprecated.TheContentHandlerFactory
override is still experimental and the method signature is subject to change before Tika 2.0Processes the given Stream through one or more parsers, resetting things between parsers as requested by policy. The actual processing is delegated to one or moreParser
s. You will get one ContentHandler fetched for each Parser used. TODO Do we need to return all the ContentHandler instances we created?- Throws:
IOException
SAXException
TikaException
-
ContentHandlerFactory
override is still experimental and the method signature is subject to change before Tika 2.0