Class CompositeParser

All Implemented Interfaces:
Serializable, Parser
Direct Known Subclasses:
AutoDetectParser, CompositeExternalParser, DefaultParser

public class CompositeParser extends AbstractParser
Composite parser that delegates parsing tasks to a component parser based on the declared content type of the incoming document. A fallback parser is defined for cases where a parser for the given content type is not available.
See Also:
  • Constructor Details

  • Method Details

    • getParsers

      public Map<MediaType,Parser> getParsers(ParseContext context)
    • findDuplicateParsers

      public Map<MediaType,List<Parser>> findDuplicateParsers(ParseContext context)
      Utility method that goes through all the component parsers and finds all media types for which more than one parser declares support. This is useful in tracking down conflicting parser definitions.
      context - parsing context
      media types that are supported by at least two component parsers
      Apache Tika 0.10
      See Also:
    • getMediaTypeRegistry

      public MediaTypeRegistry getMediaTypeRegistry()
      Returns the media type registry used to infer type relationships.
      media type registry
      Apache Tika 0.8
    • setMediaTypeRegistry

      public void setMediaTypeRegistry(MediaTypeRegistry registry)
      Sets the media type registry used to infer type relationships.
      registry - media type registry
      Apache Tika 0.8
    • getAllComponentParsers

      public List<Parser> getAllComponentParsers()
      Returns all parsers registered with the Composite Parser, including ones which may not currently be active. This won't include the Fallback Parser, if defined
    • getParsers

      public Map<MediaType,Parser> getParsers()
      Returns the component parsers.
      component parsers, keyed by media type
    • setParsers

      public void setParsers(Map<MediaType,Parser> parsers)
      Sets the component parsers.
      parsers - component parsers, keyed by media type
    • getFallback

      public Parser getFallback()
      Returns the fallback parser.
      fallback parser
    • setFallback

      public void setFallback(Parser fallback)
      Sets the fallback parser.
      fallback - fallback parser
    • getParser

      protected Parser getParser(Metadata metadata)
      Returns the parser that best matches the given metadata. By default looks for a parser that matches the content type metadata property, and uses the fallback parser if a better match is not found. The type hierarchy information included in the configured media type registry is used when looking for a matching parser instance.

      Subclasses can override this method to provide more accurate parser resolution.

      metadata - document metadata
      matching parser
    • getParser

      protected Parser getParser(Metadata metadata, ParseContext context)
    • getSupportedTypes

      public Set<MediaType> getSupportedTypes(ParseContext context)
      Description copied from interface: Parser
      Returns the set of media types supported by this parser when used with the given parse context.
      context - parse context
      immutable set of media types
    • parse

      public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
      Delegates the call to the matching component parser.

      Potential RuntimeExceptions, IOExceptions and SAXExceptions unrelated to the given input stream and content handler are automatically wrapped into TikaExceptions to better honor the Parser contract.

      stream - the document stream (input)
      handler - handler for the XHTML SAX events (output)
      metadata - document metadata (input and output)
      context - parse context
      IOException - if the document stream could not be read
      SAXException - if the SAX events could not be processed
      TikaException - if the document could not be parsed