public class PDFParserConfig extends Object implements Serializable
Constructor and Description |
---|
PDFParserConfig() |
PDFParserConfig(InputStream is)
Loads properties from InputStream and then tries to close InputStream.
|
Modifier and Type | Method and Description |
---|---|
void |
configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
Configures the given pdf2XHTML.
|
boolean |
equals(Object obj) |
AccessChecker |
getAccessChecker() |
Float |
getAverageCharTolerance() |
boolean |
getEnableAutoSpace() |
boolean |
getExtractAcroFormContent() |
boolean |
getExtractAnnotationText() |
boolean |
getExtractInlineImages() |
boolean |
getExtractUniqueInlineImagesOnly() |
boolean |
getSortByPosition() |
Float |
getSpacingTolerance() |
boolean |
getSuppressDuplicateOverlappingText() |
boolean |
getUseNonSequentialParser() |
int |
hashCode() |
void |
setAccessChecker(AccessChecker accessChecker) |
void |
setAverageCharTolerance(Float averageCharTolerance)
See
PDFTextStripper.setAverageCharTolerance(float) |
void |
setEnableAutoSpace(boolean enableAutoSpace)
If true (the default), the parser should estimate
where spaces should be inserted between words.
|
void |
setExtractAcroFormContent(boolean extractAcroFormContent)
If true (the default), extract content from AcroForms
at the end of the document.
|
void |
setExtractAnnotationText(boolean extractAnnotationText)
If true (the default), text in annotations will be
extracted.
|
void |
setExtractInlineImages(boolean extractInlineImages)
If true, extract inline embedded OBXImages.
|
void |
setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
Multiple pages within a PDF file might refer to the same underlying image.
|
void |
setSortByPosition(boolean sortByPosition)
If true, sort text tokens by their x/y position
before extracting text.
|
void |
setSpacingTolerance(Float spacingTolerance)
See
PDFTextStripper.setSpacingTolerance(float) |
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
If true, the parser should try to remove duplicated
text over the same region.
|
void |
setUseNonSequentialParser(boolean useNonSequentialParser)
If true, uses PDFBox's non-sequential parser.
|
String |
toString() |
public PDFParserConfig()
public PDFParserConfig(InputStream is)
is
- public void configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
pdf2XHTML
- public boolean getExtractAcroFormContent()
setExtractAcroFormContent(boolean)
public void setExtractAcroFormContent(boolean extractAcroFormContent)
extractAcroFormContent
- public boolean getExtractInlineImages()
setExtractInlineImages(boolean)
public void setExtractInlineImages(boolean extractInlineImages)
true
with caution.
The default is false
.
See also: setExtractUniqueInlineImagesOnly(boolean)
;extractInlineImages
- public boolean getExtractUniqueInlineImagesOnly()
public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
extractUniqueInlineImagesOnly
is set to false
, the
parser will call the EmbeddedExtractor each time the image appears on a page.
This might be desired for some use cases. However, to avoid duplication of
extracted images, set this to true
. The default is true
.
Note that uniqueness is determined only by the underlying PDF COSObject id, not by
file hash or similar equality metric.
If the PDF actually contains multiple copies of the same image
-- all with different object ids -- then all images will be extracted.
For this parameter to have any effect, extractInlineImages
must be
set to true
.
Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting of this parameter, the extractor will only pull out one copy of each image per page. This parameter tries to capture uniqueness across the entire document.
extractUniqueInlineImagesOnly
- public boolean getEnableAutoSpace()
setEnableAutoSpace(boolean)
public void setEnableAutoSpace(boolean enableAutoSpace)
public boolean getSuppressDuplicateOverlappingText()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
public boolean getExtractAnnotationText()
setExtractAnnotationText(boolean)
public void setExtractAnnotationText(boolean extractAnnotationText)
public boolean getSortByPosition()
setSortByPosition(boolean)
public void setSortByPosition(boolean sortByPosition)
public boolean getUseNonSequentialParser()
setUseNonSequentialParser(boolean)
public void setUseNonSequentialParser(boolean useNonSequentialParser)
useNonSequentialParser
- public Float getAverageCharTolerance()
setAverageCharTolerance(Float)
public void setAverageCharTolerance(Float averageCharTolerance)
PDFTextStripper.setAverageCharTolerance(float)
public Float getSpacingTolerance()
setSpacingTolerance(Float)
public void setSpacingTolerance(Float spacingTolerance)
PDFTextStripper.setSpacingTolerance(float)
public AccessChecker getAccessChecker()
public void setAccessChecker(AccessChecker accessChecker)
Copyright © 2007–2015 The Apache Software Foundation. All rights reserved.