org.apache.tika.parser.pdf.PDFParser

All Implemented Interfaces:: Serializable, Initializable, Parser, RenderingParser

public class PDFParser extends Object implements Parser, RenderingParser, Initializable

PDF parser.

This parser can process also encrypted PDF documents if the required password is given as a part of the input metadata associated with a document. If no password is given, then this parser will try decrypting the document using the empty password that's often used with PDFs. If the PDF contains any embedded documents (for example as part of a PDF package) then this parser will use the EmbeddedDocumentExtractor to handle them.

As of Tika 1.6, it is possible to extract inline images with the EmbeddedDocumentExtractor as if they were regular attachments. By default, this feature is turned off because of the potentially enormous number and size of inline images. To turn this feature on, see PDFParserConfig.setExtractInlineImages(boolean).

Please note that many pdfs do not store table structures. So you should not expect table markup for what looks like a table. It takes significant computation to identify and then correctly extract tables from PDFs. As of this writing, the PDFParser extracts text within tables, but it does not compute table cell boundaries or table row boundaries. Please see tabula for one project that tries to maintain the structure of tables represented in PDFs. If your PDFs contain marked content or tags, consider PDFParserConfig.setExtractMarkedContent(boolean)

See Also:

Serialized Form

Field Summary

Fields

Modifier and Type

Field

Description

static final MediaType

MEDIA_TYPE
Constructor Summary

Constructors

Constructor

Description

PDFParser()
Method Summary

Modifier and Type

Method

Description

void

checkInitialization(InitializableProblemHandler handler)

float

getAverageCharTolerance()

float

getDropThreshold()

ImageGraphicsEngineFactory

getImageGraphicsEngineFactory()

String

getImageStrategy()

int

getMaxIncrementalUpdates()

long

getMaxMainMemoryBytes()

int

getOcrDPI()

String

getOcrImageFormatName()

float

getOcrImageQuality()

String

getOcrImageType()

String

getOcrRenderingStrategy()

String

getOcrStrategy()

String

getOcrStrategyAuto()

protected org.apache.pdfbox.pdmodel.PDDocument

getPDDocument(InputStream inputStream, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext parseContext)

protected org.apache.pdfbox.pdmodel.PDDocument

getPDDocument(InputStream stream, TikaInputStream tstream, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext context)

protected org.apache.pdfbox.pdmodel.PDDocument

getPDDocument(Path path, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext parseContext)

PDFParserConfig

getPDFParserConfig()

Renderer

getRenderer()

float

getSpacingTolerance()

Set<MediaType>

getSupportedTypes(ParseContext context)

Returns the set of media types supported by this parser when used with the given parse context.

void

initialize(Map<String,Param> params)

This is a no-op.

boolean

isAllowExtractionForAccessibility()

boolean

isCatchIntermediateExceptions()

boolean

isDetectAngles()

boolean

isEnableAutoSpace()

boolean

isExtractAcroFormContent()

boolean

isExtractActions()

boolean

isExtractAnnotationText()

If true, text in annotations will be extracted.

boolean

isExtractBookmarksText()

boolean

isExtractFontNames()

boolean

isExtractIncrementalUpdateInfo()

boolean

isExtractInlineImageMetadataOnly()

boolean

isExtractInlineImages()

boolean

isExtractMarkedContent()

boolean

isExtractUniqueInlineImagesOnly()

boolean

isIfXFAExtractOnlyXFA()

boolean

isIgnoreContentStreamSpaceGlyphs()

boolean

isParseIncrementalUpdates()

boolean

isSetKCMS()

boolean

isSortByPosition()

boolean

isSuppressDuplicateOverlappingText()

boolean

isThrowOnEncryptedPayload()

void

parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)

Parses a document stream into a sequence of XHTML SAX events.

void

setAllowExtractionForAccessibility(boolean allowExtractionForAccessibility)

void

setAverageCharTolerance(float averageCharTolerance)

void

setCatchIntermediateExceptions(boolean catchIntermediateExceptions)

void

setDetectAngles(boolean detectAngles)

void

setDropThreshold(float dropThreshold)

void

setEnableAutoSpace(boolean v)

If true (the default), the parser should estimate where spaces should be inserted between words.

void

setExtractAcroFormContent(boolean extractAcroFormContent)

void

setExtractActions(boolean extractActions)

void

setExtractAnnotationText(boolean v)

If true (the default), text in annotations will be extracted.

void

setExtractBookmarksText(boolean extractBookmarksText)

void

setExtractFontNames(boolean extractFontNames)

void

setExtractIncrementalUpdateInfo(boolean setExtractIncrementalUpdateInfo)

Whether or not to scan a PDF for incremental updates.

void

setExtractInlineImageMetadataOnly(boolean extractInlineImageMetadataOnly)

void

setExtractInlineImages(boolean extractInlineImages)

void

setExtractMarkedContent(boolean extractMarkedContent)

void

setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)

void

setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)

void

setIgnoreContentStreamSpaceGlyphs(boolean v)

If true, the parser should ignore spaces in the content stream and rely purely on the algorithm to determine where word breaks are (PDFBOX-3774).

void

setImageGraphicsEngineFactory(ImageGraphicsEngineFactory imageGraphicsEngineFactory)

void

setImageStrategy(String imageStrategy)

void

setMaxIncrementalUpdates(int maxIncrementalUpdates)

Set the maximum number of incremental updates to parse

void

setMaxMainMemoryBytes(long maxMainMemoryBytes)

void

setOcrDPI(int dpi)

void

setOcrImageFormatName(String formatName)

void

setOcrImageQuality(float imageQuality)

void

setOcrImageType(String imageType)

void

setOcrRenderingStrategy(String ocrRenderingStrategy)

void

setOcrStrategy(String ocrStrategyString)

void

setOcrStrategyAuto(String ocrStrategyAuto)

void

setParseIncrementalUpdates(boolean parseIncrementalUpdates)

If set to true, this will parse incremental updates if they exist within a PDF.

void

setPDFParserConfig(PDFParserConfig config)

void

setRenderer(Renderer renderer)

void

setSetKCMS(boolean setKCMS)

void

setSortByPosition(boolean v)

If true, sort text tokens by their x/y position before extracting text.

void

setSpacingTolerance(float spacingTolerance)

void

setSuppressDuplicateOverlappingText(boolean v)

If true, the parser should try to remove duplicated text over the same region.

void

setThrowOnEncryptedPayload(boolean throwOnEncryptedPayload)

If the file is a 'Collection' and contains an embedded file with a defined 'AssociatedFile' value of 'EncryptedPayload', then throw an EncryptedDocumentException.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- MEDIA_TYPE
  
  public static final MediaType MEDIA_TYPE
Constructor Details
- PDFParser
  
  public PDFParser()
Method Details
- getSupportedTypes
  
  public Set<MediaType> getSupportedTypes(ParseContext context)
  
  Description copied from interface: Parser
  
  Returns the set of media types supported by this parser when used with the given parse context.
  
  Specified by:
  
  getSupportedTypes in interface Parser
  
  Parameters:
  
  context - parse context
  
  Returns:
  
  immutable set of media types
- parse
  
  public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
  
  Description copied from interface: Parser
  
  Parses a document stream into a sequence of XHTML SAX events. Fills in related document metadata in the given metadata object.
  The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
  Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.
  
  Specified by:
  
  parse in interface Parser
  
  Parameters:
  
  stream - the document stream (input)
  
  handler - handler for the XHTML SAX events (output)
  
  metadata - document metadata (input and output)
  
  context - parse context
  
  Throws:
  
  IOException - if the document stream could not be read
  
  SAXException - if the SAX events could not be processed
  
  TikaException - if the document could not be parsed
- getPDDocument
  
  protected org.apache.pdfbox.pdmodel.PDDocument getPDDocument(InputStream stream, TikaInputStream tstream, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext context) throws IOException, EncryptedDocumentException
  
  Throws:
  
  IOException
  
  EncryptedDocumentException
- getPDDocument
  
  protected org.apache.pdfbox.pdmodel.PDDocument getPDDocument(InputStream inputStream, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext parseContext) throws IOException
  
  Throws:
  
  IOException
- getPDDocument
  
  protected org.apache.pdfbox.pdmodel.PDDocument getPDDocument(Path path, String password, org.apache.pdfbox.io.RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction, Metadata metadata, ParseContext parseContext) throws IOException
  
  Throws:
  
  IOException
- getPDFParserConfig
  
  public PDFParserConfig getPDFParserConfig()
- setPDFParserConfig
  
  public void setPDFParserConfig(PDFParserConfig config)
- isEnableAutoSpace
  
  public boolean isEnableAutoSpace()
  See Also:
  
  setEnableAutoSpace(boolean)
- setEnableAutoSpace
  
  @Field public void setEnableAutoSpace(boolean v)
  
  If true (the default), the parser should estimate where spaces should be inserted between words. For many PDFs this is necessary as they do not include explicit whitespace characters.
- isExtractAnnotationText
  
  public boolean isExtractAnnotationText()
  
  If true, text in annotations will be extracted.
- setExtractAnnotationText
  
  @Field public void setExtractAnnotationText(boolean v)
  
  If true (the default), text in annotations will be extracted.
- isSuppressDuplicateOverlappingText
  
  public boolean isSuppressDuplicateOverlappingText()
  See Also:
  
  setSuppressDuplicateOverlappingText(boolean)
- setIgnoreContentStreamSpaceGlyphs
  
  @Field public void setIgnoreContentStreamSpaceGlyphs(boolean v)
  
  If true, the parser should ignore spaces in the content stream and rely purely on the algorithm to determine where word breaks are (PDFBOX-3774). This can improve text extraction results where the content stream is sorted by position and has text overlapping spaces, but could cause some word breaks to not be added to the output. By default this is disabled.
- isIgnoreContentStreamSpaceGlyphs
  
  public boolean isIgnoreContentStreamSpaceGlyphs()
  See Also:
  
  setIgnoreContentStreamSpaceGlyphs(boolean)
- setSuppressDuplicateOverlappingText
  
  @Field public void setSuppressDuplicateOverlappingText(boolean v)
  
  If true, the parser should try to remove duplicated text over the same region. This is needed for some PDFs that achieve bolding by re-writing the same text in the same area. Note that this can slow down extraction substantially (PDFBOX-956) and sometimes remove characters that were not in fact duplicated (PDFBOX-1155). By default this is disabled.
- isSortByPosition
  
  public boolean isSortByPosition()
  See Also:
  
  setSortByPosition(boolean)
- setSortByPosition
  
  @Field public void setSortByPosition(boolean v)
  
  If true, sort text tokens by their x/y position before extracting text. This may be necessary for some PDFs (if the text tokens are not rendered "in order"), while for other PDFs it can produce the wrong result (for example if there are 2 columns, the text will be interleaved). Default is false.
- setOcrStrategy
  
  @Field public void setOcrStrategy(String ocrStrategyString)
- getOcrStrategy
  
  public String getOcrStrategy()
- setOcrStrategyAuto
  
  @Field public void setOcrStrategyAuto(String ocrStrategyAuto)
- getOcrStrategyAuto
  
  public String getOcrStrategyAuto()
- setOcrRenderingStrategy
  
  @Field public void setOcrRenderingStrategy(String ocrRenderingStrategy)
- getOcrRenderingStrategy
  
  public String getOcrRenderingStrategy()
- setOcrImageType
  
  @Field public void setOcrImageType(String imageType)
- getOcrImageType
  
  public String getOcrImageType()
- setOcrDPI
  
  @Field public void setOcrDPI(int dpi)
- getOcrDPI
  
  public int getOcrDPI()
- setOcrImageQuality
  
  @Field public void setOcrImageQuality(float imageQuality)
- getOcrImageQuality
  
  public float getOcrImageQuality()
- setOcrImageFormatName
  
  @Field public void setOcrImageFormatName(String formatName)
- getOcrImageFormatName
  
  public String getOcrImageFormatName()
- setExtractBookmarksText
  
  @Field public void setExtractBookmarksText(boolean extractBookmarksText)
- isExtractBookmarksText
  
  public boolean isExtractBookmarksText()
- setExtractInlineImages
  
  @Field public void setExtractInlineImages(boolean extractInlineImages)
- isExtractInlineImages
  
  public boolean isExtractInlineImages()
- setExtractInlineImageMetadataOnly
  
  @Field public void setExtractInlineImageMetadataOnly(boolean extractInlineImageMetadataOnly)
- isExtractInlineImageMetadataOnly
  
  public boolean isExtractInlineImageMetadataOnly()
- setAverageCharTolerance
  
  @Field public void setAverageCharTolerance(float averageCharTolerance)
- getAverageCharTolerance
  
  public float getAverageCharTolerance()
- setSpacingTolerance
  
  @Field public void setSpacingTolerance(float spacingTolerance)
- getSpacingTolerance
  
  public float getSpacingTolerance()
- setCatchIntermediateExceptions
  
  @Field public void setCatchIntermediateExceptions(boolean catchIntermediateExceptions)
- isCatchIntermediateExceptions
  
  public boolean isCatchIntermediateExceptions()
- setExtractAcroFormContent
  
  @Field public void setExtractAcroFormContent(boolean extractAcroFormContent)
- isExtractAcroFormContent
  
  public boolean isExtractAcroFormContent()
- setIfXFAExtractOnlyXFA
  
  @Field public void setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)
- isIfXFAExtractOnlyXFA
  
  public boolean isIfXFAExtractOnlyXFA()
- setAllowExtractionForAccessibility
  
  @Field public void setAllowExtractionForAccessibility(boolean allowExtractionForAccessibility)
- isAllowExtractionForAccessibility
  
  public boolean isAllowExtractionForAccessibility()
- setExtractUniqueInlineImagesOnly
  
  @Field public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
- isExtractUniqueInlineImagesOnly
  
  public boolean isExtractUniqueInlineImagesOnly()
- setExtractActions
  
  @Field public void setExtractActions(boolean extractActions)
- isExtractActions
  
  public boolean isExtractActions()
- setExtractFontNames
  
  @Field public void setExtractFontNames(boolean extractFontNames)
- isExtractFontNames
  
  public boolean isExtractFontNames()
- setSetKCMS
  
  @Field public void setSetKCMS(boolean setKCMS)
- isSetKCMS
  
  public boolean isSetKCMS()
- setDetectAngles
  
  @Field public void setDetectAngles(boolean detectAngles)
- isDetectAngles
  
  public boolean isDetectAngles()
- setExtractMarkedContent
  
  @Field public void setExtractMarkedContent(boolean extractMarkedContent)
- isExtractMarkedContent
  
  public boolean isExtractMarkedContent()
- setDropThreshold
  
  @Field public void setDropThreshold(float dropThreshold)
- getDropThreshold
  
  public float getDropThreshold()
- setMaxMainMemoryBytes
  
  @Field public void setMaxMainMemoryBytes(long maxMainMemoryBytes)
- setExtractIncrementalUpdateInfo
  
  @Field public void setExtractIncrementalUpdateInfo(boolean setExtractIncrementalUpdateInfo)
  
  Whether or not to scan a PDF for incremental updates.
  
  Parameters:
  
  setExtractIncrementalUpdateInfo -
- getMaxMainMemoryBytes
  
  public long getMaxMainMemoryBytes()
- isExtractIncrementalUpdateInfo
  
  public boolean isExtractIncrementalUpdateInfo()
- setParseIncrementalUpdates
  
  @Field public void setParseIncrementalUpdates(boolean parseIncrementalUpdates)
  
  If set to true, this will parse incremental updates if they exist within a PDF. If set to true, this will override setExtractIncrementalUpdateInfo(boolean).
  
  Parameters:
  
  parseIncrementalUpdates -
- isParseIncrementalUpdates
  
  public boolean isParseIncrementalUpdates()
- setMaxIncrementalUpdates
  
  @Field public void setMaxIncrementalUpdates(int maxIncrementalUpdates)
  
  Set the maximum number of incremental updates to parse
  
  Parameters:
  
  maxIncrementalUpdates -
- getMaxIncrementalUpdates
  
  public int getMaxIncrementalUpdates()
- setThrowOnEncryptedPayload
  
  @Field public void setThrowOnEncryptedPayload(boolean throwOnEncryptedPayload)
  
  If the file is a 'Collection' and contains an embedded file with a defined 'AssociatedFile' value of 'EncryptedPayload', then throw an EncryptedDocumentException.
  Microsoft IRM v2 wraps the encrypted document inside a container PDF. See TIKA-4082.
  The goal of this is to make the user experience the same for traditionally encrypted files and PDFs that are containers for `EncryptedPayload`s.
  The default value is false.
  
  Parameters:
  
  throwOnEncryptedPayload -
- isThrowOnEncryptedPayload
  
  public boolean isThrowOnEncryptedPayload()
- initialize
  
  public void initialize(Map<String,Param> params) throws TikaConfigException
  
  This is a no-op. There is no need to initialize multiple fields. The regular field loading should happen without this.
  
  Specified by:
  
  initialize in interface Initializable
  
  Parameters:
  
  params - params to use for initialization
  
  Throws:
  
  TikaConfigException
- checkInitialization
  
  public void checkInitialization(InitializableProblemHandler handler) throws TikaConfigException
  
  Specified by:
  
  checkInitialization in interface Initializable
  
  Parameters:
  
  handler - if there is a problem and no custom initializableProblemHandler has been configured via Initializable parameters, this is called to respond.
  
  Throws:
  
  TikaConfigException
- setRenderer
  
  public void setRenderer(Renderer renderer)
  
  Specified by:
  
  setRenderer in interface RenderingParser
- getRenderer
  
  public Renderer getRenderer()
- setImageGraphicsEngineFactory
  
  @Field public void setImageGraphicsEngineFactory(ImageGraphicsEngineFactory imageGraphicsEngineFactory)
- getImageGraphicsEngineFactory
  
  public ImageGraphicsEngineFactory getImageGraphicsEngineFactory()
- setImageStrategy
  
  @Field public void setImageStrategy(String imageStrategy)
- getImageStrategy
  
  public String getImageStrategy()

Class PDFParser

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

MEDIA_TYPE

Constructor Details

PDFParser

Method Details

getSupportedTypes

parse

getPDDocument

getPDDocument

getPDDocument

getPDFParserConfig

setPDFParserConfig

isEnableAutoSpace

setEnableAutoSpace

isExtractAnnotationText

setExtractAnnotationText

isSuppressDuplicateOverlappingText

setIgnoreContentStreamSpaceGlyphs

isIgnoreContentStreamSpaceGlyphs

setSuppressDuplicateOverlappingText

isSortByPosition

setSortByPosition

setOcrStrategy

getOcrStrategy

setOcrStrategyAuto

getOcrStrategyAuto

setOcrRenderingStrategy

getOcrRenderingStrategy

setOcrImageType

getOcrImageType

setOcrDPI

getOcrDPI

setOcrImageQuality

getOcrImageQuality

setOcrImageFormatName

getOcrImageFormatName

setExtractBookmarksText

isExtractBookmarksText

setExtractInlineImages

isExtractInlineImages

setExtractInlineImageMetadataOnly

isExtractInlineImageMetadataOnly

setAverageCharTolerance

getAverageCharTolerance

setSpacingTolerance

getSpacingTolerance

setCatchIntermediateExceptions

isCatchIntermediateExceptions

setExtractAcroFormContent

isExtractAcroFormContent

setIfXFAExtractOnlyXFA

isIfXFAExtractOnlyXFA

setAllowExtractionForAccessibility

isAllowExtractionForAccessibility

setExtractUniqueInlineImagesOnly

isExtractUniqueInlineImagesOnly

setExtractActions

isExtractActions

setExtractFontNames

isExtractFontNames

setSetKCMS

isSetKCMS

setDetectAngles

isDetectAngles

setExtractMarkedContent

isExtractMarkedContent

setDropThreshold

getDropThreshold

setMaxMainMemoryBytes

setExtractIncrementalUpdateInfo

getMaxMainMemoryBytes

isExtractIncrementalUpdateInfo

setParseIncrementalUpdates

isParseIncrementalUpdates

setMaxIncrementalUpdates