org.apache.tika.parser
Class CompositeParser

java.lang.Object
  extended by org.apache.tika.parser.AbstractParser
      extended by org.apache.tika.parser.CompositeParser
All Implemented Interfaces:
Serializable, Parser
Direct Known Subclasses:
AutoDetectParser, CompositeExternalParser, DefaultParser

public class CompositeParser
extends AbstractParser

Composite parser that delegates parsing tasks to a component parser based on the declared content type of the incoming document. A fallback parser is defined for cases where a parser for the given content type is not available.

See Also:
Serialized Form

Constructor Summary
CompositeParser()
           
CompositeParser(MediaTypeRegistry registry, List<Parser> parsers)
           
CompositeParser(MediaTypeRegistry registry, Parser... parsers)
           
 
Method Summary
 Map<MediaType,List<Parser>> findDuplicateParsers(ParseContext context)
          Utility method that goes through all the component parsers and finds all media types for which more than one parser declares support.
 Parser getFallback()
          Returns the fallback parser.
 MediaTypeRegistry getMediaTypeRegistry()
          Returns the media type registry used to infer type relationships.
protected  Parser getParser(Metadata metadata)
          Returns the parser that best matches the given metadata.
protected  Parser getParser(Metadata metadata, ParseContext context)
           
 Map<MediaType,Parser> getParsers()
          Returns the component parsers.
 Map<MediaType,Parser> getParsers(ParseContext context)
           
 Set<MediaType> getSupportedTypes(ParseContext context)
          Returns the set of media types supported by this parser when used with the given parse context.
 void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
          Delegates the call to the matching component parser.
 void setFallback(Parser fallback)
          Sets the fallback parser.
 void setMediaTypeRegistry(MediaTypeRegistry registry)
          Sets the media type registry used to infer type relationships.
 void setParsers(Map<MediaType,Parser> parsers)
          Sets the component parsers.
 
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CompositeParser

public CompositeParser(MediaTypeRegistry registry,
                       List<Parser> parsers)

CompositeParser

public CompositeParser(MediaTypeRegistry registry,
                       Parser... parsers)

CompositeParser

public CompositeParser()
Method Detail

getParsers

public Map<MediaType,Parser> getParsers(ParseContext context)

findDuplicateParsers

public Map<MediaType,List<Parser>> findDuplicateParsers(ParseContext context)
Utility method that goes through all the component parsers and finds all media types for which more than one parser declares support. This is useful in tracking down conflicting parser definitions.

Parameters:
context - parsing context
Returns:
media types that are supported by at least two component parsers
Since:
Apache Tika 0.10
See Also:
TIKA-660

getMediaTypeRegistry

public MediaTypeRegistry getMediaTypeRegistry()
Returns the media type registry used to infer type relationships.

Returns:
media type registry
Since:
Apache Tika 0.8

setMediaTypeRegistry

public void setMediaTypeRegistry(MediaTypeRegistry registry)
Sets the media type registry used to infer type relationships.

Parameters:
registry - media type registry
Since:
Apache Tika 0.8

getParsers

public Map<MediaType,Parser> getParsers()
Returns the component parsers.

Returns:
component parsers, keyed by media type

setParsers

public void setParsers(Map<MediaType,Parser> parsers)
Sets the component parsers.

Parameters:
parsers - component parsers, keyed by media type

getFallback

public Parser getFallback()
Returns the fallback parser.

Returns:
fallback parser

setFallback

public void setFallback(Parser fallback)
Sets the fallback parser.

Parameters:
fallback - fallback parser

getParser

protected Parser getParser(Metadata metadata)
Returns the parser that best matches the given metadata. By default looks for a parser that matches the content type metadata property, and uses the fallback parser if a better match is not found. The type hierarchy information included in the configured media type registry is used when looking for a matching parser instance.

Subclasses can override this method to provide more accurate parser resolution.

Parameters:
metadata - document metadata
Returns:
matching parser

getParser

protected Parser getParser(Metadata metadata,
                           ParseContext context)

getSupportedTypes

public Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from interface: Parser
Returns the set of media types supported by this parser when used with the given parse context.

Parameters:
context - parse context
Returns:
immutable set of media types

parse

public void parse(InputStream stream,
                  ContentHandler handler,
                  Metadata metadata,
                  ParseContext context)
           throws IOException,
                  SAXException,
                  TikaException
Delegates the call to the matching component parser.

Potential RuntimeExceptions, IOExceptions and SAXExceptions unrelated to the given input stream and content handler are automatically wrapped into TikaExceptions to better honor the Parser contract.

Parameters:
stream - the document stream (input)
handler - handler for the XHTML SAX events (output)
metadata - document metadata (input and output)
context - parse context
Throws:
IOException - if the document stream could not be read
SAXException - if the SAX events could not be processed
TikaException - if the document could not be parsed


Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.