org.apache.tika.parser.pdf.PDFParser

All Implemented Interfaces:: Serializable, SelfConfiguring, Parser, RenderingParser

public class PDFParser extends Object implements Parser, RenderingParser

PDF parser.

This parser can process also encrypted PDF documents if the required password is given as a part of the input metadata associated with a document. If no password is given, then this parser will try decrypting the document using the empty password that's often used with PDFs. If the PDF contains any embedded documents (for example as part of a PDF package) then this parser will use the EmbeddedDocumentExtractor to handle them.

As of Tika 1.6, it is possible to extract inline images with the EmbeddedDocumentExtractor as if they were regular attachments. By default, this feature is turned off because of the potentially enormous number and size of inline images. To turn this feature on, see PDFParserConfig.setExtractInlineImages(boolean).

Please note that many pdfs do not store table structures. So you should not expect table markup for what looks like a table. It takes significant computation to identify and then correctly extract tables from PDFs. As of this writing, the PDFParser extracts text within tables, but it does not compute table cell boundaries or table row boundaries. Please see tabula for one project that tries to maintain the structure of tables represented in PDFs. If your PDFs contain marked content or tags, consider PDFParserConfig.setExtractMarkedContent(boolean)

See Also:

Serialized Form

Field Summary

Fields

Modifier and Type

Field

Description

static final MediaType

MEDIA_TYPE
Constructor Summary

Constructors

Constructor

Description

PDFParser()

PDFParser(JsonConfig jsonConfig)

Constructor for JSON configuration.

PDFParser(PDFParserConfig config)

Constructor with explicit PDFParserConfig object.
Method Summary

Modifier and Type

Method

Description

PDFParserConfig

getDefaultConfig()

protected org.apache.pdfbox.pdmodel.PDDocument

getPDDocument(Path path, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext parseContext)

protected org.apache.pdfbox.pdmodel.PDDocument

getPDDocument(TikaInputStream tis, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext context)

protected org.apache.pdfbox.pdmodel.PDDocument

getPDDocumentFromStream(InputStream inputStream, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext parseContext)

PDFParserConfig

getPDFParserConfig()

Renderer

getRenderer()

Set<MediaType>

getSupportedTypes(ParseContext context)

Returns the set of media types supported by this parser when used with the given parse context.

void

parse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext context)

Parses a document stream into a sequence of XHTML SAX events.

void

setRenderer(Renderer renderer)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- MEDIA_TYPE
  
  public static final MediaType MEDIA_TYPE
Constructor Details
- PDFParser
  
  public PDFParser()
- PDFParser
  
  public PDFParser(PDFParserConfig config)
  
  Constructor with explicit PDFParserConfig object.
  
  Parameters:
  
  config - the configuration
- PDFParser
  
  public PDFParser(JsonConfig jsonConfig)
  
  Constructor for JSON configuration. Requires Jackson on the classpath.
  
  Parameters:
  
  jsonConfig - JSON configuration
Method Details
- getSupportedTypes
  
  public Set<MediaType> getSupportedTypes(ParseContext context)
  
  Description copied from interface: Parser
  
  Returns the set of media types supported by this parser when used with the given parse context.
  
  Specified by:
  
  getSupportedTypes in interface Parser
  
  Parameters:
  
  context - parse context
  
  Returns:
  
  immutable set of media types
- parse
  
  public void parse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
  
  Description copied from interface: Parser
  
  Parses a document stream into a sequence of XHTML SAX events. Fills in related document metadata in the given metadata object.
  The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
  Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.
  
  Specified by:
  
  parse in interface Parser
  
  handler - handler for the XHTML SAX events (output)
  
  metadata - document metadata (input and output)
  
  context - parse context
  
  Throws:
  
  IOException - if the document stream could not be read
  
  SAXException - if the SAX events could not be processed
  
  TikaException - if the document could not be parsed
- getPDDocument
  
  protected org.apache.pdfbox.pdmodel.PDDocument getPDDocument(TikaInputStream tis, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext context) throws IOException, EncryptedDocumentException
  
  Throws:
  
  IOException
  
  EncryptedDocumentException
- getPDDocumentFromStream
  
  protected org.apache.pdfbox.pdmodel.PDDocument getPDDocumentFromStream(InputStream inputStream, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext parseContext) throws IOException
  
  Throws:
  
  IOException
- getPDDocument
  
  protected org.apache.pdfbox.pdmodel.PDDocument getPDDocument(Path path, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext parseContext) throws IOException
  
  Throws:
  
  IOException
- getPDFParserConfig
  
  public PDFParserConfig getPDFParserConfig()
- getDefaultConfig
  
  public PDFParserConfig getDefaultConfig()
- setRenderer
  
  public void setRenderer(Renderer renderer)
  
  Specified by:
  
  setRenderer in interface RenderingParser
- getRenderer
  
  public Renderer getRenderer()

Class PDFParser

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

MEDIA_TYPE

Constructor Details

PDFParser

PDFParser

PDFParser

Method Details

getSupportedTypes

parse

getPDDocument

getPDDocumentFromStream

getPDDocument

getPDFParserConfig

getDefaultConfig

setRenderer

getRenderer