org.apache.tika.parser.external
Class ExternalParser

java.lang.Object
  extended by org.apache.tika.parser.AbstractParser
      extended by org.apache.tika.parser.external.ExternalParser
All Implemented Interfaces:
java.io.Serializable, Parser

public class ExternalParser
extends AbstractParser

Parser that uses an external program (like catdoc or pdf2txt) to extract text content and metadata from a given document.

See Also:
Serialized Form

Field Summary
static java.lang.String INPUT_FILE_TOKEN
          The token, which if present in the Command string, will be replaced with the input filename.
static java.lang.String OUTPUT_FILE_TOKEN
          The token, which if present in the Command string, will be replaced with the output filename.
 
Constructor Summary
ExternalParser()
           
 
Method Summary
static boolean check(java.lang.String[] checkCmd, int... errorValue)
           
static boolean check(java.lang.String checkCmd, int... errorValue)
          Checks to see if the command can be run.
 java.lang.String[] getCommand()
           
 java.util.Map<java.util.regex.Pattern,java.lang.String> getMetadataExtractionPatterns()
           
 java.util.Set<MediaType> getSupportedTypes()
           
 java.util.Set<MediaType> getSupportedTypes(ParseContext context)
          Returns the set of media types supported by this parser when used with the given parse context.
 void parse(java.io.InputStream stream, org.xml.sax.ContentHandler handler, Metadata metadata, ParseContext context)
          Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler.
 void setCommand(java.lang.String... command)
          Sets the command to be run.
 void setMetadataExtractionPatterns(java.util.Map<java.util.regex.Pattern,java.lang.String> patterns)
          Sets the map of regular expression patterns and Metadata keys.
 void setSupportedTypes(java.util.Set<MediaType> supportedTypes)
           
 
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INPUT_FILE_TOKEN

public static final java.lang.String INPUT_FILE_TOKEN
The token, which if present in the Command string, will be replaced with the input filename. Alternately, the input data can be streamed over STDIN.

See Also:
Constant Field Values

OUTPUT_FILE_TOKEN

public static final java.lang.String OUTPUT_FILE_TOKEN
The token, which if present in the Command string, will be replaced with the output filename. Alternately, the output data can be collected on STDOUT.

See Also:
Constant Field Values
Constructor Detail

ExternalParser

public ExternalParser()
Method Detail

getSupportedTypes

public java.util.Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from interface: Parser
Returns the set of media types supported by this parser when used with the given parse context.

Parameters:
context - parse context
Returns:
immutable set of media types

getSupportedTypes

public java.util.Set<MediaType> getSupportedTypes()

setSupportedTypes

public void setSupportedTypes(java.util.Set<MediaType> supportedTypes)

getCommand

public java.lang.String[] getCommand()

setCommand

public void setCommand(java.lang.String... command)
Sets the command to be run. This can include either of INPUT_FILE_TOKEN or OUTPUT_FILE_TOKEN if the command needs filenames.

See Also:
Runtime.exec(String[])

getMetadataExtractionPatterns

public java.util.Map<java.util.regex.Pattern,java.lang.String> getMetadataExtractionPatterns()

setMetadataExtractionPatterns

public void setMetadataExtractionPatterns(java.util.Map<java.util.regex.Pattern,java.lang.String> patterns)
Sets the map of regular expression patterns and Metadata keys. Any matching patterns will have the matching metadata entries set. Set this to null to disable Metadata extraction.


parse

public void parse(java.io.InputStream stream,
                  org.xml.sax.ContentHandler handler,
                  Metadata metadata,
                  ParseContext context)
           throws java.io.IOException,
                  org.xml.sax.SAXException,
                  TikaException
Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler. Metadata is only extracted if setMetadataExtractionPatterns(Map) has been called to set patterns.

Parameters:
stream - the document stream (input)
handler - handler for the XHTML SAX events (output)
metadata - document metadata (input and output)
context - parse context
Throws:
java.io.IOException - if the document stream could not be read
org.xml.sax.SAXException - if the SAX events could not be processed
TikaException - if the document could not be parsed

check

public static boolean check(java.lang.String checkCmd,
                            int... errorValue)
Checks to see if the command can be run. Typically used with something like "myapp --version" to check to see if "myapp" is installed and on the path.

Parameters:
checkCmd - The check command to run
errorValue - What is considered an error value?

check

public static boolean check(java.lang.String[] checkCmd,
                            int... errorValue)


Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.