org.apache.tika.extractor
Class ParserContainerExtractor

java.lang.Object
  extended by org.apache.tika.extractor.ParserContainerExtractor
All Implemented Interfaces:
java.io.Serializable, ContainerExtractor

public class ParserContainerExtractor
extends java.lang.Object
implements ContainerExtractor

An implementation of ContainerExtractor powered by the regular Parser classes. This allows you to easily extract out all the embedded resources from within contain files, whilst using the normal parsers to do the work. By default the AutoDetectParser will be used, to allow extraction from the widest range of containers.

See Also:
Serialized Form

Constructor Summary
ParserContainerExtractor()
           
ParserContainerExtractor(Parser parser, Detector detector)
           
ParserContainerExtractor(TikaConfig config)
           
 
Method Summary
 void extract(TikaInputStream stream, ContainerExtractor recurseExtractor, EmbeddedResourceHandler handler)
          Processes a container file, and extracts all the embedded resources from within it.
 boolean isSupported(TikaInputStream input)
          Is this Container Extractor able to process the supplied container?
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ParserContainerExtractor

public ParserContainerExtractor()

ParserContainerExtractor

public ParserContainerExtractor(TikaConfig config)

ParserContainerExtractor

public ParserContainerExtractor(Parser parser,
                                Detector detector)
Method Detail

isSupported

public boolean isSupported(TikaInputStream input)
                    throws java.io.IOException
Description copied from interface: ContainerExtractor
Is this Container Extractor able to process the supplied container?

Specified by:
isSupported in interface ContainerExtractor
Throws:
java.io.IOException

extract

public void extract(TikaInputStream stream,
                    ContainerExtractor recurseExtractor,
                    EmbeddedResourceHandler handler)
             throws java.io.IOException,
                    TikaException
Description copied from interface: ContainerExtractor
Processes a container file, and extracts all the embedded resources from within it.

The EmbeddedResourceHandler you supply will be called for each embedded resource in the container. It is up to you whether you process the contents of the resource or not.

The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.

If required, nested containers (such as a .docx within a .zip) can automatically be recursed into, and processed inline. If no recurseExtractor is given, the nested containers will be treated as with any other embedded resources.

Specified by:
extract in interface ContainerExtractor
Parameters:
stream - the document stream (input)
recurseExtractor - the extractor to use on any embedded containers
handler - handler for the embedded files (output)
Throws:
java.io.IOException - if the document stream could not be read
TikaException - if the container could not be parsed


Copyright © 2007-2010 The Apache Software Foundation. All Rights Reserved.