java.lang.Object

org.apache.tika.parser.pdf.PDFParserConfig

All Implemented Interfaces:: Serializable

public class PDFParserConfig extends Object implements Serializable

Config for PDFParser.

This allows parameters to be set programmatically:

Calls to PDFParser, i.e. parser.getPDFParserConfig().setEnableAutoSpace() (as before)
Passing to PDFParser through a ParseContext: context.set(PDFParserConfig.class, config);

See Also:

Serialized Form

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

PDFParserConfig.IMAGE_STRATEGY

static enum

PDFParserConfig.OCR_RENDERING_STRATEGY

static enum

PDFParserConfig.OCR_STRATEGY

static class

PDFParserConfig.OCRStrategyAuto

Encapsulate the numbers used to control OCR Strategy when set to auto

static enum

PDFParserConfig.TikaImageType
Constructor Summary

Constructors

Constructor

Description

PDFParserConfig()
Method Summary

Modifier and Type

Method

Description

PDFParserConfig

cloneAndUpdate(PDFParserConfig updates)

void

configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)

Configures the given pdf2XHTML.

AccessChecker

getAccessChecker()

Float

getAverageCharTolerance()

Float

getDropThreshold()

ImageGraphicsEngineFactory

getImageGraphicsEngineFactory()

PDFParserConfig.IMAGE_STRATEGY

getImageStrategy()

int

getMaxIncrementalUpdates()

long

getMaxMainMemoryBytes()

The maximum amount of memory to use when loading a pdf into a PDDocument.

int

getOcrDPI()

Dots per inch used to render the page image for OCR

String

getOcrImageFormatName()

String representation of the image format used to render the page image for OCR (examples: png, tiff, jpeg)

float

getOcrImageQuality()

Image quality used to render the page image for OCR.

PDFParserConfig.TikaImageType

getOcrImageType()

Image type used to render the page image for OCR.

PDFParserConfig.OCR_RENDERING_STRATEGY

getOcrRenderingStrategy()

PDFParserConfig.OCR_STRATEGY

getOcrStrategy()

PDFParserConfig.OCRStrategyAuto

getOcrStrategyAuto()

Renderer

getRenderer()

Float

getSpacingTolerance()

boolean

isCatchIntermediateIOExceptions()

See setCatchIntermediateIOExceptions(boolean)

boolean

isDetectAngles()

boolean

isEnableAutoSpace()

boolean

isExtractAcroFormContent()

boolean

isExtractActions()

boolean

isExtractAnnotationText()

boolean

isExtractBookmarksText()

boolean

isExtractFontNames()

boolean

isExtractIncrementalUpdateInfo()

boolean

isExtractInlineImageMetadataOnly()

boolean

isExtractInlineImages()

boolean

isExtractMarkedContent()

boolean

isExtractUniqueInlineImagesOnly()

boolean

isIfXFAExtractOnlyXFA()

boolean

isParseIncrementalUpdates()

boolean

isSetKCMS()

boolean

isSortByPosition()

boolean

isSuppressDuplicateOverlappingText()

boolean

isThrowOnEncryptedPayload()

void

setAccessChecker(AccessChecker accessChecker)

void

setAverageCharTolerance(Float averageCharTolerance)

See PDFTextStripper.setAverageCharTolerance(float)

void

setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)

The PDFBox parser will throw an IOException if there is a problem with a stream.

void

setDetectAngles(boolean detectAngles)

void

setDropThreshold(Float dropThreshold)

See PDFTextStripper.setDropThreshold(float)

void

setEnableAutoSpace(boolean enableAutoSpace)

If true (the default), the parser should estimate where spaces should be inserted between words.

void

setExtractAcroFormContent(boolean extractAcroFormContent)

If true (the default), extract content from AcroForms at the end of the document.

void

setExtractActions(boolean v)

Whether or not to extract PDActions from the file.

void

setExtractAnnotationText(boolean extractAnnotationText)

If true (the default), text in annotations will be extracted.

void

setExtractBookmarksText(boolean extractBookmarksText)

If true, extract bookmarks (document outline) text.

void

setExtractFontNames(boolean extractFontNames)

Extract font names into a metadata field

void

setExtractIncrementalUpdateInfo(boolean extractIncrementalUpdateInfo)

void

setExtractInlineImageMetadataOnly(boolean extractInlineImageMetadataOnly)

Use this when you want to know how many images of what formats are in a PDF but you don't need to render the images (e.g. for OCR).

void

setExtractInlineImages(boolean extractInlineImages)

If true, extract the literal inline embedded OBXImages.

void

setExtractMarkedContent(boolean extractMarkedContent)

If the PDF contains marked content, try to extract text and its marked structure.

void

setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)

Multiple pages within a PDF file might refer to the same underlying image.

void

setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)

If false (the default), extract content from the full PDF as well as the XFA form.

void

setImageGraphicsEngineFactory(ImageGraphicsEngineFactory imageGraphicsEngineFactory)

EXPERT: Customize the class that handles inline images within a PDF page.

void

setImageStrategy(String imageStrategy)

void

setImageStrategy(PDFParserConfig.IMAGE_STRATEGY imageStrategy)

void

setMaxIncrementalUpdates(int maxIncrementalUpdates)

The maximum number of incremental updates to parse.

void

setMaxMainMemoryBytes(long maxMainMemoryBytes)

void

setOcrDPI(int ocrDPI)

Dots per inch used to render the page image for OCR.

void

setOcrImageFormatName(String ocrImageFormatName)

void

setOcrImageQuality(float ocrImageQuality)

Image quality used to render the page image for OCR.

void

setOcrImageType(String ocrImageTypeString)

Image type used to render the page image for OCR.

void

setOcrImageType(PDFParserConfig.TikaImageType ocrImageType)

Image type used to render the page image for OCR.

void

setOcrRenderingStrategy(String ocrRenderingStrategyString)

void

setOcrRenderingStrategy(PDFParserConfig.OCR_RENDERING_STRATEGY ocrRenderingStrategy)

When rendering the page for OCR, do you want to include the rendering of the electronic text, ALL, or do you only want to run OCR on the images and vector graphics (NO_TEXT)?

void

setOcrStrategy(String ocrStrategyString)

Which strategy to use for OCR

void

setOcrStrategy(PDFParserConfig.OCR_STRATEGY ocrStrategy)

Which strategy to use for OCR

void

setOcrStrategyAuto(String ocrStrategyAuto)

void

setParseIncrementalUpdates(boolean parseIncrementalUpdates)

void

setRenderer(Renderer renderer)

void

setSetKCMS(boolean setKCMS)

Whether to call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider").

void

setSortByPosition(boolean sortByPosition)

If true, sort text tokens by their x/y position before extracting text.

void

setSpacingTolerance(Float spacingTolerance)

See PDFTextStripper.setSpacingTolerance(float)

void

setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)

If true, the parser should try to remove duplicated text over the same region.

void

setThrowOnEncryptedPayload(boolean throwOnEncryptedPayload)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- PDFParserConfig
  
  public PDFParserConfig()
Method Details
- isExtractInlineImageMetadataOnly
  
  public boolean isExtractInlineImageMetadataOnly()
  
  Returns:
  
  whether or not to extract only inline image metadata and not render the images
- setExtractInlineImageMetadataOnly
  
  public void setExtractInlineImageMetadataOnly(boolean extractInlineImageMetadataOnly)
  
  Use this when you want to know how many images of what formats are in a PDF but you don't need to render the images (e.g. for OCR). This is far faster than extractInlineImages because it doesn't have to render the images, which can be very slow. This does not extract metadata from within each image, rather it extracts the XMP that may be stored external to an image in PDImageXObjects.
  
  Parameters:
  
  extractInlineImageMetadataOnly -
  
  Since:
  
  1.25
- isExtractMarkedContent
  
  public boolean isExtractMarkedContent()
- setExtractMarkedContent
  
  public void setExtractMarkedContent(boolean extractMarkedContent)
  
  If the PDF contains marked content, try to extract text and its marked structure. If the PDF does not contain marked content, backoff to the regular PDF2XHTML for text extraction. As of 1.24, this is an "alpha" version.
  
  Parameters:
  
  extractMarkedContent -
  
  Since:
  
  1.24
- configure
  
  public void configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
  
  Configures the given pdf2XHTML.
  
  Parameters:
  
  pdf2XHTML -
- isExtractAcroFormContent
  
  public boolean isExtractAcroFormContent()
  See Also:
  
  setExtractAcroFormContent(boolean)
- setExtractAcroFormContent
  
  public void setExtractAcroFormContent(boolean extractAcroFormContent)
  
  If true (the default), extract content from AcroForms at the end of the document. If an XFA is found, try to process that, otherwise, process the AcroForm.
  
  Parameters:
  
  extractAcroFormContent -
- isIfXFAExtractOnlyXFA
  
  public boolean isIfXFAExtractOnlyXFA()
  Returns:
  
  how to handle XFA data if it exists
  
  See Also:
  
  setIfXFAExtractOnlyXFA(boolean)
- setIfXFAExtractOnlyXFA
  
  public void setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)
  
  If false (the default), extract content from the full PDF as well as the XFA form. This will likely lead to some duplicative content.
  
  Parameters:
  
  ifXFAExtractOnlyXFA -
- isExtractBookmarksText
  
  public boolean isExtractBookmarksText()
  See Also:
  
  setExtractBookmarksText(boolean)
- setExtractBookmarksText
  
  public void setExtractBookmarksText(boolean extractBookmarksText)
  
  If true, extract bookmarks (document outline) text.
  Te default is true
  
  Parameters:
  
  extractBookmarksText -
- isExtractFontNames
  
  public boolean isExtractFontNames()
- setExtractFontNames
  
  public void setExtractFontNames(boolean extractFontNames)
  
  Extract font names into a metadata field
  
  Parameters:
  
  extractFontNames -
- isExtractInlineImages
  
  public boolean isExtractInlineImages()
  See Also:
  
  setExtractInlineImages(boolean)
- setExtractInlineImages
  
  public void setExtractInlineImages(boolean extractInlineImages)
  
  If true, extract the literal inline embedded OBXImages.
  Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors.
  Along the same lines, note that this does not extract "logical" images. Some PDF writers break up a single logical image into hundreds of little images. With this option set to true, you might get those hundreds of little images.
  NOTE ALSO: this extracts the raw images without clipping, rotation, masks, color inversion, etc. The images that this extracts may look nothing like what a human would expect given the appearance of the PDF.
  Set to true only with the greatest caution. The default is false.
  Parameters:
  
  extractInlineImages -
  
  See Also:
  
  setExtractUniqueInlineImagesOnly(boolean)
- isExtractUniqueInlineImagesOnly
  
  public boolean isExtractUniqueInlineImagesOnly()
  See Also:
  
  setExtractUniqueInlineImagesOnly(boolean)
- setExtractUniqueInlineImagesOnly
  
  public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
  
  Multiple pages within a PDF file might refer to the same underlying image. If extractUniqueInlineImagesOnly is set to false, the parser will call the EmbeddedExtractor each time the image appears on a page. This might be desired for some use cases. However, to avoid duplication of extracted images, set this to true. The default is true.
  Note that uniqueness is determined only by the underlying PDF COSObject id, not by file hash or similar equality metric. If the PDF actually contains multiple copies of the same image -- all with different object ids -- then all images will be extracted.
  For this parameter to have any effect, extractInlineImages must be set to true.
  Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting of this parameter, the extractor will only pull out one copy of each image per page. This parameter tries to capture uniqueness across the entire document.
  
  Parameters:
  
  extractUniqueInlineImagesOnly -
- isEnableAutoSpace
  
  public boolean isEnableAutoSpace()
  See Also:
  
  setEnableAutoSpace(boolean)
- setEnableAutoSpace
  
  public void setEnableAutoSpace(boolean enableAutoSpace)
  
  If true (the default), the parser should estimate where spaces should be inserted between words. For many PDFs this is necessary as they do not include explicit whitespace characters.
- isSuppressDuplicateOverlappingText
  
  public boolean isSuppressDuplicateOverlappingText()
  See Also:
  
  setSuppressDuplicateOverlappingText(boolean)
- setSuppressDuplicateOverlappingText
  
  public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
  
  If true, the parser should try to remove duplicated text over the same region. This is needed for some PDFs that achieve bolding by re-writing the same text in the same area. Note that this can slow down extraction substantially (PDFBOX-956) and sometimes remove characters that were not in fact duplicated (PDFBOX-1155). By default this is disabled.
- isExtractAnnotationText
  
  public boolean isExtractAnnotationText()
  See Also:
  
  setExtractAnnotationText(boolean)
- setExtractAnnotationText
  
  public void setExtractAnnotationText(boolean extractAnnotationText)
  
  If true (the default), text in annotations will be extracted.
- isSortByPosition
  
  public boolean isSortByPosition()
  See Also:
  
  setSortByPosition(boolean)
- setSortByPosition
  
  public void setSortByPosition(boolean sortByPosition)
  
  If true, sort text tokens by their x/y position before extracting text. This may be necessary for some PDFs (if the text tokens are not rendered "in order"), while for other PDFs it can produce the wrong result (for example if there are 2 columns, the text will be interleaved). Default is false.
- getAverageCharTolerance
  
  public Float getAverageCharTolerance()
  See Also:
  
  setAverageCharTolerance(Float)
- setAverageCharTolerance
  
  public void setAverageCharTolerance(Float averageCharTolerance)
  
  See PDFTextStripper.setAverageCharTolerance(float)
- getSpacingTolerance
  
  public Float getSpacingTolerance()
  See Also:
  
  setSpacingTolerance(Float)
- setSpacingTolerance
  
  public void setSpacingTolerance(Float spacingTolerance)
  
  See PDFTextStripper.setSpacingTolerance(float)
- getDropThreshold
  
  public Float getDropThreshold()
  See Also:
  
  setDropThreshold(Float)
- setDropThreshold
  
  public void setDropThreshold(Float dropThreshold)
  
  See PDFTextStripper.setDropThreshold(float)
- getAccessChecker
  
  public AccessChecker getAccessChecker()
- setAccessChecker
  
  public void setAccessChecker(AccessChecker accessChecker)
- isCatchIntermediateIOExceptions
  
  public boolean isCatchIntermediateIOExceptions()
  
  See setCatchIntermediateIOExceptions(boolean)
  
  Returns:
  
  whether or not to catch IOExceptions
- setCatchIntermediateIOExceptions
  
  public void setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)
  
  The PDFBox parser will throw an IOException if there is a problem with a stream. If this is set to true, Tika's PDFParser will catch these exceptions and try to parse the rest of the document. After the parse is completed, Tika's PDFParser will throw the first caught exception.
  
  Parameters:
  
  catchIntermediateIOExceptions -
- getOcrStrategy
  
  public PDFParserConfig.OCR_STRATEGY getOcrStrategy()
  
  Returns:
  
  strategy to use for OCR
- getOcrStrategyAuto
  
  public PDFParserConfig.OCRStrategyAuto getOcrStrategyAuto()
  
  Returns:
  
  ocr auto strategy to use when ocr_strategy = Auto
- setOcrStrategy
  
  public void setOcrStrategy(PDFParserConfig.OCR_STRATEGY ocrStrategy)
  
  Which strategy to use for OCR
  
  Parameters:
  
  ocrStrategy -
- setOcrStrategyAuto
  
  public void setOcrStrategyAuto(String ocrStrategyAuto)
- setOcrStrategy
  
  public void setOcrStrategy(String ocrStrategyString)
  
  Which strategy to use for OCR
  
  Parameters:
  
  ocrStrategyString -
- getOcrRenderingStrategy
  
  public PDFParserConfig.OCR_RENDERING_STRATEGY getOcrRenderingStrategy()
- setOcrRenderingStrategy
  
  public void setOcrRenderingStrategy(String ocrRenderingStrategyString)
- setOcrRenderingStrategy
  
  public void setOcrRenderingStrategy(PDFParserConfig.OCR_RENDERING_STRATEGY ocrRenderingStrategy)
  
  When rendering the page for OCR, do you want to include the rendering of the electronic text, ALL, or do you only want to run OCR on the images and vector graphics (NO_TEXT)?
  
  Parameters:
  
  ocrRenderingStrategy -
- getOcrImageFormatName
  
  public String getOcrImageFormatName()
  
  String representation of the image format used to render the page image for OCR (examples: png, tiff, jpeg)
  
  Returns:
- setOcrImageFormatName
  
  public void setOcrImageFormatName(String ocrImageFormatName)
  Parameters:
  
  ocrImageFormatName - name of image format used to render page image
  
  See Also:
  
  getOcrImageFormatName()
- getOcrImageType
  
  public PDFParserConfig.TikaImageType getOcrImageType()
  
  Image type used to render the page image for OCR.
  Returns:
  
  image type
  
  See Also:
  
  setOcrImageType(TikaImageType)
- setOcrImageType
  
  public void setOcrImageType(PDFParserConfig.TikaImageType ocrImageType)
  
  Image type used to render the page image for OCR.
  
  Parameters:
  
  ocrImageType -
- setOcrImageType
  
  public void setOcrImageType(String ocrImageTypeString)
  
  Image type used to render the page image for OCR.
  See Also:
  
  setOcrImageType(TikaImageType)
- getOcrDPI
  
  public int getOcrDPI()
  
  Dots per inch used to render the page image for OCR
  
  Returns:
  
  dots per inch
- setOcrDPI
  
  public void setOcrDPI(int ocrDPI)
  
  Dots per inch used to render the page image for OCR. This does not apply to all image formats.
  
  Parameters:
  
  ocrDPI -
- getOcrImageQuality
  
  public float getOcrImageQuality()
  
  Image quality used to render the page image for OCR. This does not apply to all image formats
  
  Returns:
- setOcrImageQuality
  
  public void setOcrImageQuality(float ocrImageQuality)
  
  Image quality used to render the page image for OCR. This does not apply to all image formats
- isExtractActions
  
  public boolean isExtractActions()
  Returns:
  
  whether or not to extract PDActions
  
  See Also:
  
  setExtractActions(boolean)
- setExtractActions
  
  public void setExtractActions(boolean v)
  
  Whether or not to extract PDActions from the file. Most Action types are handled inline; javascript macros are processed as embedded documents.
  
  Parameters:
  
  v -
- getMaxMainMemoryBytes
  
  public long getMaxMainMemoryBytes()
  
  The maximum amount of memory to use when loading a pdf into a PDDocument. Additional buffering is done using a temp file. The default is 512MB.
  
  Returns:
- setMaxMainMemoryBytes
  
  public void setMaxMainMemoryBytes(long maxMainMemoryBytes)
- isSetKCMS
  
  public boolean isSetKCMS()
- setSetKCMS
  
  public void setSetKCMS(boolean setKCMS)
  
  Whether to call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider"). KCMS is the unmaintained, legacy provider and is far faster than the newer replacement. However, there are stability and security risks with using the unmaintained legacy provider.
  
  Note, of course, that this is not thread safe. If the value is false in your first thread, and the second thread changes this to true, the system property in the first thread will now be true.
  
  Default is false.
  
  Parameters:
  
  setKCMS - whether or not to set KCMS
- isDetectAngles
  
  public boolean isDetectAngles()
- setDetectAngles
  
  public void setDetectAngles(boolean detectAngles)
- cloneAndUpdate
  
  public PDFParserConfig cloneAndUpdate(PDFParserConfig updates) throws TikaException
  
  Throws:
  
  TikaException
- setRenderer
  
  public void setRenderer(Renderer renderer)
- getRenderer
  
  public Renderer getRenderer()
- setImageStrategy
  
  public void setImageStrategy(String imageStrategy)
- setImageStrategy
  
  public void setImageStrategy(PDFParserConfig.IMAGE_STRATEGY imageStrategy)
- setImageGraphicsEngineFactory
  
  public void setImageGraphicsEngineFactory(ImageGraphicsEngineFactory imageGraphicsEngineFactory)
  
  EXPERT: Customize the class that handles inline images within a PDF page.
  
  Parameters:
  
  imageGraphicsEngineFactory -
- getImageGraphicsEngineFactory
  
  public ImageGraphicsEngineFactory getImageGraphicsEngineFactory()
- getImageStrategy
  
  public PDFParserConfig.IMAGE_STRATEGY getImageStrategy()
- isExtractIncrementalUpdateInfo
  
  public boolean isExtractIncrementalUpdateInfo()
- setExtractIncrementalUpdateInfo
  
  public void setExtractIncrementalUpdateInfo(boolean extractIncrementalUpdateInfo)
- isParseIncrementalUpdates
  
  public boolean isParseIncrementalUpdates()
- setParseIncrementalUpdates
  
  public void setParseIncrementalUpdates(boolean parseIncrementalUpdates)
- getMaxIncrementalUpdates
  
  public int getMaxIncrementalUpdates()
- setMaxIncrementalUpdates
  
  public void setMaxIncrementalUpdates(int maxIncrementalUpdates)
  
  The maximum number of incremental updates to parse.
  
  Parameters:
  
  maxIncrementalUpdates -
- setThrowOnEncryptedPayload
  
  public void setThrowOnEncryptedPayload(boolean throwOnEncryptedPayload)
- isThrowOnEncryptedPayload
  
  public boolean isThrowOnEncryptedPayload()

Class PDFParserConfig

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

PDFParserConfig

Method Details

isExtractInlineImageMetadataOnly

setExtractInlineImageMetadataOnly

isExtractMarkedContent

setExtractMarkedContent

configure

isExtractAcroFormContent

setExtractAcroFormContent

isIfXFAExtractOnlyXFA

setIfXFAExtractOnlyXFA

isExtractBookmarksText

setExtractBookmarksText

isExtractFontNames

setExtractFontNames

isExtractInlineImages

setExtractInlineImages

isExtractUniqueInlineImagesOnly

setExtractUniqueInlineImagesOnly

isEnableAutoSpace

setEnableAutoSpace

isSuppressDuplicateOverlappingText

setSuppressDuplicateOverlappingText

isExtractAnnotationText

setExtractAnnotationText

isSortByPosition

setSortByPosition

getAverageCharTolerance

setAverageCharTolerance

getSpacingTolerance

setSpacingTolerance

getDropThreshold

setDropThreshold

getAccessChecker

setAccessChecker

isCatchIntermediateIOExceptions

setCatchIntermediateIOExceptions

getOcrStrategy

getOcrStrategyAuto

setOcrStrategy

setOcrStrategyAuto

setOcrStrategy

getOcrRenderingStrategy

setOcrRenderingStrategy

setOcrRenderingStrategy

getOcrImageFormatName

setOcrImageFormatName

getOcrImageType

setOcrImageType

setOcrImageType

getOcrDPI

setOcrDPI

getOcrImageQuality

setOcrImageQuality

isExtractActions

setExtractActions

getMaxMainMemoryBytes

setMaxMainMemoryBytes

isSetKCMS

setSetKCMS

isDetectAngles

setDetectAngles

cloneAndUpdate

setRenderer

getRenderer

setImageStrategy

setImageStrategy

setImageGraphicsEngineFactory

getImageGraphicsEngineFactory

getImageStrategy

isExtractIncrementalUpdateInfo

setExtractIncrementalUpdateInfo

isParseIncrementalUpdates

setParseIncrementalUpdates