Package org.apache.tika.parser.pdf
Class PDFParserConfig
- java.lang.Object
-
- org.apache.tika.parser.pdf.PDFParserConfig
-
- All Implemented Interfaces:
Serializable
public class PDFParserConfig extends Object implements Serializable
Config for PDFParser. This allows parameters to be set programmatically:- Calls to PDFParser, i.e. parser.getPDFParserConfig().setEnableAutoSpace() (as before)
- Passing to PDFParser through a ParseContext: context.set(PDFParserConfig.class, config);
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
PDFParserConfig.IMAGE_STRATEGY
static class
PDFParserConfig.OCR_RENDERING_STRATEGY
static class
PDFParserConfig.OCR_STRATEGY
static class
PDFParserConfig.OCRStrategyAuto
Encapsulate the numbers used to control OCR Strategy when set to auto
-
Constructor Summary
Constructors Constructor Description PDFParserConfig()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description PDFParserConfig
cloneAndUpdate(PDFParserConfig updates)
void
configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
Configures the given pdf2XHTML.boolean
equals(Object o)
AccessChecker
getAccessChecker()
Float
getAverageCharTolerance()
Float
getDropThreshold()
ImageGraphicsEngineFactory
getImageGraphicsEngineFactory()
PDFParserConfig.IMAGE_STRATEGY
getImageStrategy()
long
getMaxMainMemoryBytes()
The maximum amount of memory to use when loading a pdf into a PDDocument.int
getOcrDPI()
Dots per inch used to render the page image for OCRString
getOcrImageFormatName()
String representation of the image format used to render the page image for OCR (examples: png, tiff, jpeg)float
getOcrImageQuality()
Image quality used to render the page image for OCR.org.apache.pdfbox.rendering.ImageType
getOcrImageType()
Image type used to render the page image for OCR.PDFParserConfig.OCR_RENDERING_STRATEGY
getOcrRenderingStrategy()
PDFParserConfig.OCR_STRATEGY
getOcrStrategy()
PDFParserConfig.OCRStrategyAuto
getOcrStrategyAuto()
Renderer
getRenderer()
Float
getSpacingTolerance()
int
hashCode()
boolean
isCatchIntermediateIOExceptions()
boolean
isDetectAngles()
boolean
isEnableAutoSpace()
boolean
isExtractAcroFormContent()
boolean
isExtractActions()
boolean
isExtractAnnotationText()
boolean
isExtractBookmarksText()
boolean
isExtractFontNames()
boolean
isExtractInlineImageMetadataOnly()
boolean
isExtractInlineImages()
boolean
isExtractMarkedContent()
boolean
isExtractUniqueInlineImagesOnly()
boolean
isIfXFAExtractOnlyXFA()
boolean
isSetKCMS()
boolean
isSortByPosition()
boolean
isSuppressDuplicateOverlappingText()
void
setAccessChecker(AccessChecker accessChecker)
void
setAverageCharTolerance(Float averageCharTolerance)
SeePDFTextStripper.setAverageCharTolerance(float)
void
setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)
The PDFBox parser will throw an IOException if there is a problem with a stream.void
setDetectAngles(boolean detectAngles)
void
setDropThreshold(Float dropThreshold)
SeePDFTextStripper.setDropThreshold(float)
void
setEnableAutoSpace(boolean enableAutoSpace)
If true (the default), the parser should estimate where spaces should be inserted between words.void
setExtractAcroFormContent(boolean extractAcroFormContent)
If true (the default), extract content from AcroForms at the end of the document.void
setExtractActions(boolean v)
Whether or not to extract PDActions from the file.void
setExtractAnnotationText(boolean extractAnnotationText)
If true (the default), text in annotations will be extracted.void
setExtractBookmarksText(boolean extractBookmarksText)
If true, extract bookmarks (document outline) text.void
setExtractFontNames(boolean extractFontNames)
Extract font names into a metadata fieldvoid
setExtractInlineImages(boolean extractInlineImages)
Iftrue
, extract the literal inline embedded OBXImages.void
setExtractMarkedContent(boolean extractMarkedContent)
If the PDF contains marked content, try to extract text and its marked structure.void
setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
Multiple pages within a PDF file might refer to the same underlying image.void
setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)
If false (the default), extract content from the full PDF as well as the XFA form.void
setImageGraphicsEngineFactory(ImageGraphicsEngineFactory imageGraphicsEngineFactory)
EXPERT: Customize the class that handles inline images within a PDF page.void
setImageStrategy(String imageStrategy)
void
setImageStrategy(PDFParserConfig.IMAGE_STRATEGY imageStrategy)
void
setMaxMainMemoryBytes(long maxMainMemoryBytes)
void
setOcrDPI(int ocrDPI)
Dots per inch used to render the page image for OCR.void
setOcrImageFormatName(String ocrImageFormatName)
void
setOcrImageQuality(float ocrImageQuality)
Image quality used to render the page image for OCR.void
setOcrImageType(String ocrImageTypeString)
Image type used to render the page image for OCR.void
setOcrImageType(org.apache.pdfbox.rendering.ImageType ocrImageType)
Image type used to render the page image for OCR.void
setOcrRenderingStrategy(String ocrRenderingStrategyString)
void
setOcrRenderingStrategy(PDFParserConfig.OCR_RENDERING_STRATEGY ocrRenderingStrategy)
When rendering the page for OCR, do you want to include the rendering of the electronic text, ALL, or do you only want to run OCR on the images and vector graphics (NO_TEXT)?void
setOcrStrategy(String ocrStrategyString)
Which strategy to use for OCRvoid
setOcrStrategy(PDFParserConfig.OCR_STRATEGY ocrStrategy)
Which strategy to use for OCRvoid
setOcrStrategyAuto(String ocrStrategyAuto)
void
setRenderer(Renderer renderer)
void
setSetKCMS(boolean setKCMS)
Whether to callSystem.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
.void
setSortByPosition(boolean sortByPosition)
If true, sort text tokens by their x/y position before extracting text.void
setSpacingTolerance(Float spacingTolerance)
SeePDFTextStripper.setSpacingTolerance(float)
void
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
If true, the parser should try to remove duplicated text over the same region.
-
-
-
Method Detail
-
isExtractInlineImageMetadataOnly
public boolean isExtractInlineImageMetadataOnly()
- Returns:
- whether or not to extract only inline image metadata and not render the images
-
isExtractMarkedContent
public boolean isExtractMarkedContent()
-
setExtractMarkedContent
public void setExtractMarkedContent(boolean extractMarkedContent)
If the PDF contains marked content, try to extract text and its marked structure. If the PDF does not contain marked content, backoff to the regular PDF2XHTML for text extraction. As of 1.24, this is an "alpha" version.- Parameters:
extractMarkedContent
-- Since:
- 1.24
-
configure
public void configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
Configures the given pdf2XHTML.- Parameters:
pdf2XHTML
-
-
isExtractAcroFormContent
public boolean isExtractAcroFormContent()
- See Also:
setExtractAcroFormContent(boolean)
-
setExtractAcroFormContent
public void setExtractAcroFormContent(boolean extractAcroFormContent)
If true (the default), extract content from AcroForms at the end of the document. If an XFA is found, try to process that, otherwise, process the AcroForm.- Parameters:
extractAcroFormContent
-
-
isIfXFAExtractOnlyXFA
public boolean isIfXFAExtractOnlyXFA()
- Returns:
- how to handle XFA data if it exists
- See Also:
setIfXFAExtractOnlyXFA(boolean)
-
setIfXFAExtractOnlyXFA
public void setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)
If false (the default), extract content from the full PDF as well as the XFA form. This will likely lead to some duplicative content.- Parameters:
ifXFAExtractOnlyXFA
-
-
isExtractBookmarksText
public boolean isExtractBookmarksText()
- See Also:
setExtractBookmarksText(boolean)
-
setExtractBookmarksText
public void setExtractBookmarksText(boolean extractBookmarksText)
If true, extract bookmarks (document outline) text. Te default istrue
- Parameters:
extractBookmarksText
-
-
isExtractFontNames
public boolean isExtractFontNames()
-
setExtractFontNames
public void setExtractFontNames(boolean extractFontNames)
Extract font names into a metadata field- Parameters:
extractFontNames
-
-
isExtractInlineImages
public boolean isExtractInlineImages()
- See Also:
setExtractInlineImages(boolean)
-
setExtractInlineImages
public void setExtractInlineImages(boolean extractInlineImages)
Iftrue
, extract the literal inline embedded OBXImages. Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Along the same lines, note that this does not extract "logical" images. Some PDF writers break up a single logical image into hundreds of little images. With this option set totrue
, you might get those hundreds of little images. NOTE ALSO: this extracts the raw images without clipping, rotation, masks, color inversion, etc. The images that this extracts may look nothing like what a human would expect given the appearance of the PDF. Set totrue
only with the greatest caution. The default isfalse
.- Parameters:
extractInlineImages
-- See Also:
setExtractUniqueInlineImagesOnly(boolean)
-
isExtractUniqueInlineImagesOnly
public boolean isExtractUniqueInlineImagesOnly()
-
setExtractUniqueInlineImagesOnly
public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
Multiple pages within a PDF file might refer to the same underlying image. IfextractUniqueInlineImagesOnly
is set tofalse
, the parser will call the EmbeddedExtractor each time the image appears on a page. This might be desired for some use cases. However, to avoid duplication of extracted images, set this totrue
. The default istrue
. Note that uniqueness is determined only by the underlying PDF COSObject id, not by file hash or similar equality metric. If the PDF actually contains multiple copies of the same image -- all with different object ids -- then all images will be extracted. For this parameter to have any effect,extractInlineImages
must be set totrue
.Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting of this parameter, the extractor will only pull out one copy of each image per page. This parameter tries to capture uniqueness across the entire document.
- Parameters:
extractUniqueInlineImagesOnly
-
-
isEnableAutoSpace
public boolean isEnableAutoSpace()
- See Also:
setEnableAutoSpace(boolean)
-
setEnableAutoSpace
public void setEnableAutoSpace(boolean enableAutoSpace)
If true (the default), the parser should estimate where spaces should be inserted between words. For many PDFs this is necessary as they do not include explicit whitespace characters.
-
isSuppressDuplicateOverlappingText
public boolean isSuppressDuplicateOverlappingText()
-
setSuppressDuplicateOverlappingText
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
If true, the parser should try to remove duplicated text over the same region. This is needed for some PDFs that achieve bolding by re-writing the same text in the same area. Note that this can slow down extraction substantially (PDFBOX-956) and sometimes remove characters that were not in fact duplicated (PDFBOX-1155). By default this is disabled.
-
isExtractAnnotationText
public boolean isExtractAnnotationText()
- See Also:
setExtractAnnotationText(boolean)
-
setExtractAnnotationText
public void setExtractAnnotationText(boolean extractAnnotationText)
If true (the default), text in annotations will be extracted.
-
isSortByPosition
public boolean isSortByPosition()
- See Also:
setSortByPosition(boolean)
-
setSortByPosition
public void setSortByPosition(boolean sortByPosition)
If true, sort text tokens by their x/y position before extracting text. This may be necessary for some PDFs (if the text tokens are not rendered "in order"), while for other PDFs it can produce the wrong result (for example if there are 2 columns, the text will be interleaved). Default is false.
-
getAverageCharTolerance
public Float getAverageCharTolerance()
- See Also:
setAverageCharTolerance(Float)
-
setAverageCharTolerance
public void setAverageCharTolerance(Float averageCharTolerance)
SeePDFTextStripper.setAverageCharTolerance(float)
-
getSpacingTolerance
public Float getSpacingTolerance()
- See Also:
setSpacingTolerance(Float)
-
setSpacingTolerance
public void setSpacingTolerance(Float spacingTolerance)
SeePDFTextStripper.setSpacingTolerance(float)
-
getDropThreshold
public Float getDropThreshold()
- See Also:
setDropThreshold(Float)
-
setDropThreshold
public void setDropThreshold(Float dropThreshold)
SeePDFTextStripper.setDropThreshold(float)
-
getAccessChecker
public AccessChecker getAccessChecker()
-
setAccessChecker
public void setAccessChecker(AccessChecker accessChecker)
-
isCatchIntermediateIOExceptions
public boolean isCatchIntermediateIOExceptions()
- Returns:
- whether or not to catch IOExceptions
-
setCatchIntermediateIOExceptions
public void setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)
The PDFBox parser will throw an IOException if there is a problem with a stream. If this is set totrue
, Tika's PDFParser will catch these exceptions and try to parse the rest of the document. After the parse is completed, Tika's PDFParser will throw the first caught exception.- Parameters:
catchIntermediateIOExceptions
-
-
getOcrStrategy
public PDFParserConfig.OCR_STRATEGY getOcrStrategy()
- Returns:
- strategy to use for OCR
-
getOcrStrategyAuto
public PDFParserConfig.OCRStrategyAuto getOcrStrategyAuto()
- Returns:
- ocr auto strategy to use when ocr_strategy = Auto
-
setOcrStrategy
public void setOcrStrategy(PDFParserConfig.OCR_STRATEGY ocrStrategy)
Which strategy to use for OCR- Parameters:
ocrStrategy
-
-
setOcrStrategyAuto
public void setOcrStrategyAuto(String ocrStrategyAuto)
-
setOcrStrategy
public void setOcrStrategy(String ocrStrategyString)
Which strategy to use for OCR- Parameters:
ocrStrategyString
-
-
getOcrRenderingStrategy
public PDFParserConfig.OCR_RENDERING_STRATEGY getOcrRenderingStrategy()
-
setOcrRenderingStrategy
public void setOcrRenderingStrategy(String ocrRenderingStrategyString)
-
setOcrRenderingStrategy
public void setOcrRenderingStrategy(PDFParserConfig.OCR_RENDERING_STRATEGY ocrRenderingStrategy)
When rendering the page for OCR, do you want to include the rendering of the electronic text, ALL, or do you only want to run OCR on the images and vector graphics (NO_TEXT)?- Parameters:
ocrRenderingStrategy
-
-
getOcrImageFormatName
public String getOcrImageFormatName()
String representation of the image format used to render the page image for OCR (examples: png, tiff, jpeg)- Returns:
-
setOcrImageFormatName
public void setOcrImageFormatName(String ocrImageFormatName)
- Parameters:
ocrImageFormatName
- name of image format used to render page image- See Also:
getOcrImageFormatName()
-
getOcrImageType
public org.apache.pdfbox.rendering.ImageType getOcrImageType()
Image type used to render the page image for OCR.- Returns:
- image type
- See Also:
setOcrImageType(ImageType)
-
setOcrImageType
public void setOcrImageType(org.apache.pdfbox.rendering.ImageType ocrImageType)
Image type used to render the page image for OCR.- Parameters:
ocrImageType
-
-
setOcrImageType
public void setOcrImageType(String ocrImageTypeString)
Image type used to render the page image for OCR.- See Also:
setOcrImageType(ImageType)
-
getOcrDPI
public int getOcrDPI()
Dots per inch used to render the page image for OCR- Returns:
- dots per inch
-
setOcrDPI
public void setOcrDPI(int ocrDPI)
Dots per inch used to render the page image for OCR. This does not apply to all image formats.- Parameters:
ocrDPI
-
-
getOcrImageQuality
public float getOcrImageQuality()
Image quality used to render the page image for OCR. This does not apply to all image formats- Returns:
-
setOcrImageQuality
public void setOcrImageQuality(float ocrImageQuality)
Image quality used to render the page image for OCR. This does not apply to all image formats
-
isExtractActions
public boolean isExtractActions()
- Returns:
- whether or not to extract PDActions
- See Also:
setExtractActions(boolean)
-
setExtractActions
public void setExtractActions(boolean v)
Whether or not to extract PDActions from the file. Most Action types are handled inline; javascript macros are processed as embedded documents.- Parameters:
v
-
-
getMaxMainMemoryBytes
public long getMaxMainMemoryBytes()
The maximum amount of memory to use when loading a pdf into a PDDocument. Additional buffering is done using a temp file. The default is 512MB.- Returns:
-
setMaxMainMemoryBytes
public void setMaxMainMemoryBytes(long maxMainMemoryBytes)
-
isSetKCMS
public boolean isSetKCMS()
-
setSetKCMS
public void setSetKCMS(boolean setKCMS)
Whether to call
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
. KCMS is the unmaintained, legacy provider and is far faster than the newer replacement. However, there are stability and security risks with using the unmaintained legacy provider.Note, of course, that this is not thread safe. If the value is
false
in your first thread, and the second thread changes this totrue
, the system property in the first thread will now betrue
.Default is
false
.- Parameters:
setKCMS
- whether or not to set KCMS
-
isDetectAngles
public boolean isDetectAngles()
-
setDetectAngles
public void setDetectAngles(boolean detectAngles)
-
cloneAndUpdate
public PDFParserConfig cloneAndUpdate(PDFParserConfig updates) throws TikaException
- Throws:
TikaException
-
setRenderer
public void setRenderer(Renderer renderer)
-
getRenderer
public Renderer getRenderer()
-
setImageStrategy
public void setImageStrategy(String imageStrategy)
-
setImageStrategy
public void setImageStrategy(PDFParserConfig.IMAGE_STRATEGY imageStrategy)
-
setImageGraphicsEngineFactory
public void setImageGraphicsEngineFactory(ImageGraphicsEngineFactory imageGraphicsEngineFactory)
EXPERT: Customize the class that handles inline images within a PDF page.- Parameters:
imageGraphicsEngineFactory
-
-
getImageGraphicsEngineFactory
public ImageGraphicsEngineFactory getImageGraphicsEngineFactory()
-
getImageStrategy
public PDFParserConfig.IMAGE_STRATEGY getImageStrategy()
-
-