Package org.apache.tika.parser
Class CompositeParser
- java.lang.Object
-
- org.apache.tika.parser.AbstractParser
-
- org.apache.tika.parser.CompositeParser
-
- All Implemented Interfaces:
Serializable
,Parser
- Direct Known Subclasses:
AutoDetectParser
,CompositeExternalParser
,DefaultParser
public class CompositeParser extends AbstractParser
Composite parser that delegates parsing tasks to a component parser based on the declared content type of the incoming document. A fallback parser is defined for cases where a parser for the given content type is not available.- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description CompositeParser()
CompositeParser(MediaTypeRegistry registry, List<Parser> parsers)
CompositeParser(MediaTypeRegistry registry, List<Parser> parsers, Collection<Class<? extends Parser>> excludeParsers)
CompositeParser(MediaTypeRegistry registry, Parser... parsers)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Map<MediaType,List<Parser>>
findDuplicateParsers(ParseContext context)
Utility method that goes through all the component parsers and finds all media types for which more than one parser declares support.List<Parser>
getAllComponentParsers()
Returns all parsers registered with the Composite Parser, including ones which may not currently be active.Parser
getFallback()
Returns the fallback parser.MediaTypeRegistry
getMediaTypeRegistry()
Returns the media type registry used to infer type relationships.protected Parser
getParser(Metadata metadata)
Returns the parser that best matches the given metadata.protected Parser
getParser(Metadata metadata, ParseContext context)
Map<MediaType,Parser>
getParsers()
Returns the component parsers.Map<MediaType,Parser>
getParsers(ParseContext context)
Set<MediaType>
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used with the given parse context.void
parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
Delegates the call to the matching component parser.void
setFallback(Parser fallback)
Sets the fallback parser.void
setMediaTypeRegistry(MediaTypeRegistry registry)
Sets the media type registry used to infer type relationships.void
setParsers(Map<MediaType,Parser> parsers)
Sets the component parsers.-
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
-
-
-
Constructor Detail
-
CompositeParser
public CompositeParser(MediaTypeRegistry registry, List<Parser> parsers, Collection<Class<? extends Parser>> excludeParsers)
-
CompositeParser
public CompositeParser(MediaTypeRegistry registry, List<Parser> parsers)
-
CompositeParser
public CompositeParser(MediaTypeRegistry registry, Parser... parsers)
-
CompositeParser
public CompositeParser()
-
-
Method Detail
-
getParsers
public Map<MediaType,Parser> getParsers(ParseContext context)
-
findDuplicateParsers
public Map<MediaType,List<Parser>> findDuplicateParsers(ParseContext context)
Utility method that goes through all the component parsers and finds all media types for which more than one parser declares support. This is useful in tracking down conflicting parser definitions.- Parameters:
context
- parsing context- Returns:
- media types that are supported by at least two component parsers
- Since:
- Apache Tika 0.10
- See Also:
- TIKA-660
-
getMediaTypeRegistry
public MediaTypeRegistry getMediaTypeRegistry()
Returns the media type registry used to infer type relationships.- Returns:
- media type registry
- Since:
- Apache Tika 0.8
-
setMediaTypeRegistry
public void setMediaTypeRegistry(MediaTypeRegistry registry)
Sets the media type registry used to infer type relationships.- Parameters:
registry
- media type registry- Since:
- Apache Tika 0.8
-
getAllComponentParsers
public List<Parser> getAllComponentParsers()
Returns all parsers registered with the Composite Parser, including ones which may not currently be active. This won't include the Fallback Parser, if defined
-
getParsers
public Map<MediaType,Parser> getParsers()
Returns the component parsers.- Returns:
- component parsers, keyed by media type
-
setParsers
public void setParsers(Map<MediaType,Parser> parsers)
Sets the component parsers.- Parameters:
parsers
- component parsers, keyed by media type
-
getFallback
public Parser getFallback()
Returns the fallback parser.- Returns:
- fallback parser
-
setFallback
public void setFallback(Parser fallback)
Sets the fallback parser.- Parameters:
fallback
- fallback parser
-
getParser
protected Parser getParser(Metadata metadata)
Returns the parser that best matches the given metadata. By default looks for a parser that matches the content type metadata property, and uses the fallback parser if a better match is not found. The type hierarchy information included in the configured media type registry is used when looking for a matching parser instance.Subclasses can override this method to provide more accurate parser resolution.
- Parameters:
metadata
- document metadata- Returns:
- matching parser
-
getParser
protected Parser getParser(Metadata metadata, ParseContext context)
-
getSupportedTypes
public Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from interface:Parser
Returns the set of media types supported by this parser when used with the given parse context.- Parameters:
context
- parse context- Returns:
- immutable set of media types
-
parse
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
Delegates the call to the matching component parser.Potential
RuntimeException
s,IOException
s andSAXException
s unrelated to the given input stream and content handler are automatically wrapped intoTikaException
s to better honor theParser
contract.- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context- Throws:
IOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed
-
-