Class PDFParserConfig

java.lang.Object
org.apache.tika.parser.pdf.PDFParserConfig
All Implemented Interfaces:
Serializable

public class PDFParserConfig extends Object implements Serializable
Config for PDFParser.

This allows parameters to be set programmatically:

  1. Calls to PDFParser, i.e. parser.getPDFParserConfig().setEnableAutoSpace() (as before)
  2. Passing to PDFParser through a ParseContext: context.set(PDFParserConfig.class, config);

See Also:
  • Constructor Details

    • PDFParserConfig

      public PDFParserConfig()
  • Method Details

    • isExtractInlineImageMetadataOnly

      public boolean isExtractInlineImageMetadataOnly()
      Returns:
      whether or not to extract only inline image metadata and not render the images
    • setExtractInlineImageMetadataOnly

      public void setExtractInlineImageMetadataOnly(boolean extractInlineImageMetadataOnly)
      Use this when you want to know how many images of what formats are in a PDF but you don't need to render the images (e.g. for OCR). This is far faster than extractInlineImages because it doesn't have to render the images, which can be very slow. This does not extract metadata from within each image, rather it extracts the XMP that may be stored external to an image in PDImageXObjects.
      Parameters:
      extractInlineImageMetadataOnly -
      Since:
      1.25
    • isExtractMarkedContent

      public boolean isExtractMarkedContent()
    • setExtractMarkedContent

      public void setExtractMarkedContent(boolean extractMarkedContent)
      If the PDF contains marked content, try to extract text and its marked structure. If the PDF does not contain marked content, backoff to the regular PDF2XHTML for text extraction. As of 1.24, this is an "alpha" version.
      Parameters:
      extractMarkedContent -
      Since:
      1.24
    • configure

      public void configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
      Configures the given pdf2XHTML.
      Parameters:
      pdf2XHTML -
    • isExtractAcroFormContent

      public boolean isExtractAcroFormContent()
      See Also:
    • setExtractAcroFormContent

      public void setExtractAcroFormContent(boolean extractAcroFormContent)
      If true (the default), extract content from AcroForms at the end of the document. If an XFA is found, try to process that, otherwise, process the AcroForm.
      Parameters:
      extractAcroFormContent -
    • isIfXFAExtractOnlyXFA

      public boolean isIfXFAExtractOnlyXFA()
      Returns:
      how to handle XFA data if it exists
      See Also:
    • setIfXFAExtractOnlyXFA

      public void setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)
      If false (the default), extract content from the full PDF as well as the XFA form. This will likely lead to some duplicative content.
      Parameters:
      ifXFAExtractOnlyXFA -
    • isExtractBookmarksText

      public boolean isExtractBookmarksText()
      See Also:
    • setExtractBookmarksText

      public void setExtractBookmarksText(boolean extractBookmarksText)
      If true, extract bookmarks (document outline) text.

      Te default is true

      Parameters:
      extractBookmarksText -
    • isExtractFontNames

      public boolean isExtractFontNames()
    • setExtractFontNames

      public void setExtractFontNames(boolean extractFontNames)
      Extract font names into a metadata field
      Parameters:
      extractFontNames -
    • isExtractInlineImages

      public boolean isExtractInlineImages()
      See Also:
    • setExtractInlineImages

      public void setExtractInlineImages(boolean extractInlineImages)
      If true, extract the literal inline embedded OBXImages.

      Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors.

      Along the same lines, note that this does not extract "logical" images. Some PDF writers break up a single logical image into hundreds of little images. With this option set to true, you might get those hundreds of little images.

      NOTE ALSO: this extracts the raw images without clipping, rotation, masks, color inversion, etc. The images that this extracts may look nothing like what a human would expect given the appearance of the PDF.

      Set to true only with the greatest caution. The default is false.

      Parameters:
      extractInlineImages -
      See Also:
    • isExtractUniqueInlineImagesOnly

      public boolean isExtractUniqueInlineImagesOnly()
      See Also:
    • setExtractUniqueInlineImagesOnly

      public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
      Multiple pages within a PDF file might refer to the same underlying image. If extractUniqueInlineImagesOnly is set to false, the parser will call the EmbeddedExtractor each time the image appears on a page. This might be desired for some use cases. However, to avoid duplication of extracted images, set this to true. The default is true.

      Note that uniqueness is determined only by the underlying PDF COSObject id, not by file hash or similar equality metric. If the PDF actually contains multiple copies of the same image -- all with different object ids -- then all images will be extracted.

      For this parameter to have any effect, extractInlineImages must be set to true.

      Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting of this parameter, the extractor will only pull out one copy of each image per page. This parameter tries to capture uniqueness across the entire document.

      Parameters:
      extractUniqueInlineImagesOnly -
    • isEnableAutoSpace

      public boolean isEnableAutoSpace()
      See Also:
    • setEnableAutoSpace

      public void setEnableAutoSpace(boolean enableAutoSpace)
      If true (the default), the parser should estimate where spaces should be inserted between words. For many PDFs this is necessary as they do not include explicit whitespace characters.
    • isSuppressDuplicateOverlappingText

      public boolean isSuppressDuplicateOverlappingText()
      See Also:
    • setSuppressDuplicateOverlappingText

      public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
      If true, the parser should try to remove duplicated text over the same region. This is needed for some PDFs that achieve bolding by re-writing the same text in the same area. Note that this can slow down extraction substantially (PDFBOX-956) and sometimes remove characters that were not in fact duplicated (PDFBOX-1155). By default this is disabled.
    • isIgnoreContentStreamSpaceGlyphs

      public boolean isIgnoreContentStreamSpaceGlyphs()
      See Also:
    • setIgnoreContentStreamSpaceGlyphs

      public void setIgnoreContentStreamSpaceGlyphs(boolean ignoreContentStreamSpaceGlyphs)
      If true, the parser should ignore spaces in the content stream and rely purely on the algorithm to determine where word breaks are (PDFBOX-3774). This can improve text extraction results where the content stream is sorted by position and has text overlapping spaces, but could cause some word breaks to not be added to the output. By default this is disabled.
    • isExtractAnnotationText

      public boolean isExtractAnnotationText()
      See Also:
    • setExtractAnnotationText

      public void setExtractAnnotationText(boolean extractAnnotationText)
      If true (the default), text in annotations will be extracted.
    • isSortByPosition

      public boolean isSortByPosition()
      See Also:
    • setSortByPosition

      public void setSortByPosition(boolean sortByPosition)
      If true, sort text tokens by their x/y position before extracting text. This may be necessary for some PDFs (if the text tokens are not rendered "in order"), while for other PDFs it can produce the wrong result (for example if there are 2 columns, the text will be interleaved). Default is false.
    • getAverageCharTolerance

      public Float getAverageCharTolerance()
      See Also:
    • setAverageCharTolerance

      public void setAverageCharTolerance(Float averageCharTolerance)
      See PDFTextStripper.setAverageCharTolerance(float)
    • getSpacingTolerance

      public Float getSpacingTolerance()
      See Also:
    • setSpacingTolerance

      public void setSpacingTolerance(Float spacingTolerance)
      See PDFTextStripper.setSpacingTolerance(float)
    • getDropThreshold

      public Float getDropThreshold()
      See Also:
    • setDropThreshold

      public void setDropThreshold(Float dropThreshold)
      See PDFTextStripper.setDropThreshold(float)
    • getAccessCheckMode

      public PDFParserConfig.AccessCheckMode getAccessCheckMode()
    • setAccessCheckMode

      public void setAccessCheckMode(PDFParserConfig.AccessCheckMode accessCheckMode)
    • isCatchIntermediateIOExceptions

      public boolean isCatchIntermediateIOExceptions()
      Returns:
      whether or not to catch IOExceptions
    • setCatchIntermediateIOExceptions

      public void setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)
      The PDFBox parser will throw an IOException if there is a problem with a stream. If this is set to true, Tika's PDFParser will catch these exceptions and try to parse the rest of the document. After the parse is completed, Tika's PDFParser will throw the first caught exception.
      Parameters:
      catchIntermediateIOExceptions -
    • getOcr

      public OcrConfig getOcr()
      Returns:
      the OCR configuration
    • setOcr

      public void setOcr(OcrConfig ocr)
      Parameters:
      ocr - the OCR configuration
    • getOcrStrategy

      public OcrConfig.Strategy getOcrStrategy()
      Returns:
      strategy to use for OCR
    • getOcrStrategyAuto

      public OcrConfig.StrategyAuto getOcrStrategyAuto()
      Returns:
      ocr auto strategy to use when ocr_strategy = Auto
    • setOcrStrategy

      public void setOcrStrategy(OcrConfig.Strategy ocrStrategy)
      Which strategy to use for OCR
    • setOcrStrategyAuto

      public void setOcrStrategyAuto(OcrConfig.StrategyAuto ocrStrategyAuto)
      Sets the OCR strategy auto configuration.
    • getOcrRenderingStrategy

      public OcrConfig.RenderingStrategy getOcrRenderingStrategy()
    • setOcrRenderingStrategy

      public void setOcrRenderingStrategy(OcrConfig.RenderingStrategy ocrRenderingStrategy)
      When rendering the page for OCR, do you want to include the rendering of the electronic text, ALL, or do you only want to run OCR on the images and vector graphics (NO_TEXT)?
    • getOcrImageFormat

      public OcrConfig.ImageFormat getOcrImageFormat()
    • setOcrImageFormat

      public void setOcrImageFormat(OcrConfig.ImageFormat ocrImageFormat)
    • getOcrImageType

      public OcrConfig.ImageType getOcrImageType()
    • setOcrImageType

      public void setOcrImageType(OcrConfig.ImageType ocrImageType)
    • getOcrDPI

      public int getOcrDPI()
      Returns:
      dots per inch used to render the page image for OCR
    • setOcrDPI

      public void setOcrDPI(int ocrDPI)
      Dots per inch used to render the page image for OCR.
    • getOcrImageQuality

      public float getOcrImageQuality()
      Returns:
      image quality used to render the page image for OCR
    • setOcrImageQuality

      public void setOcrImageQuality(float ocrImageQuality)
      Image quality used to render the page image for OCR.
    • getOcrMaxImagePixels

      public long getOcrMaxImagePixels()
      Returns:
      maximum total pixels (width × height) allowed for a rendered page image before OCR is skipped
    • setOcrMaxImagePixels

      public void setOcrMaxImagePixels(long ocrMaxImagePixels)
      Set the maximum total pixels (width × height) for a rendered page image. Pages exceeding this limit are skipped for OCR. Default is 100,000,000 (100 megapixels).
    • getOcrMaxPagesToOcr

      public int getOcrMaxPagesToOcr()
      Returns:
      maximum number of pages to OCR per document, or -1 for no limit
    • setOcrMaxPagesToOcr

      public void setOcrMaxPagesToOcr(int ocrMaxPagesToOcr)
      Set the maximum number of pages to OCR per document. Default is -1 (no limit).
    • isExtractActions

      public boolean isExtractActions()
      Returns:
      whether or not to extract PDActions
      See Also:
    • setExtractActions

      public void setExtractActions(boolean v)
      Whether or not to extract PDActions from the file. Most Action types are handled inline; javascript macros are processed as embedded documents.
      Parameters:
      v -
    • getMaxMainMemoryBytes

      public long getMaxMainMemoryBytes()
      The maximum amount of memory to use when loading a pdf into a PDDocument. Additional buffering is done using a temp file. The default is 512MB.
      Returns:
    • setMaxMainMemoryBytes

      public void setMaxMainMemoryBytes(long maxMainMemoryBytes)
    • isSetKCMS

      public boolean isSetKCMS()
    • setSetKCMS

      public void setSetKCMS(boolean setKCMS)

      Whether to call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider"). KCMS is the unmaintained, legacy provider and is far faster than the newer replacement. However, there are stability and security risks with using the unmaintained legacy provider.

      Note, of course, that this is not thread safe. If the value is false in your first thread, and the second thread changes this to true, the system property in the first thread will now be true.

      Default is false.

      Parameters:
      setKCMS - whether or not to set KCMS
    • isDetectAngles

      public boolean isDetectAngles()
    • setDetectAngles

      public void setDetectAngles(boolean detectAngles)
    • setImageStrategy

      public void setImageStrategy(PDFParserConfig.IMAGE_STRATEGY imageStrategy)
    • setImageGraphicsEngineFactory

      public void setImageGraphicsEngineFactory(ImageGraphicsEngineFactory imageGraphicsEngineFactory)
      EXPERT: Customize the class that handles inline images within a PDF page.
      Parameters:
      imageGraphicsEngineFactory -
    • setImageGraphicsEngineFactoryClass

      public void setImageGraphicsEngineFactoryClass(String className)
      EXPERT: Customize the class that handles inline images within a PDF page. Use this setter when specifying the factory class name in JSON config.
      Parameters:
      className - fully qualified class name of an ImageGraphicsEngineFactory implementation
    • getImageGraphicsEngineFactory

      public ImageGraphicsEngineFactory getImageGraphicsEngineFactory()
    • getImageStrategy

      public PDFParserConfig.IMAGE_STRATEGY getImageStrategy()
    • isExtractIncrementalUpdateInfo

      public boolean isExtractIncrementalUpdateInfo()
    • setExtractIncrementalUpdateInfo

      public void setExtractIncrementalUpdateInfo(boolean extractIncrementalUpdateInfo)
    • isParseIncrementalUpdates

      public boolean isParseIncrementalUpdates()
    • setParseIncrementalUpdates

      public void setParseIncrementalUpdates(boolean parseIncrementalUpdates)
    • getMaxIncrementalUpdates

      public int getMaxIncrementalUpdates()
    • setMaxIncrementalUpdates

      public void setMaxIncrementalUpdates(int maxIncrementalUpdates)
      The maximum number of incremental updates to parse if setParseIncrementalUpdates(boolean) is set to true
      Parameters:
      maxIncrementalUpdates -
    • setThrowOnEncryptedPayload

      public void setThrowOnEncryptedPayload(boolean throwOnEncryptedPayload)
    • isThrowOnEncryptedPayload

      public boolean isThrowOnEncryptedPayload()