public class PDFParserConfig extends Object implements Serializable
Constructor and Description |
---|
PDFParserConfig() |
PDFParserConfig(InputStream is)
Loads properties from InputStream and then tries to close InputStream.
|
Modifier and Type | Method and Description |
---|---|
void |
configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
Configures the given pdf2XHTML.
|
boolean |
equals(Object obj) |
Float |
getAverageCharTolerance() |
boolean |
getEnableAutoSpace() |
boolean |
getExtractAcroFormContent() |
boolean |
getExtractAnnotationText() |
boolean |
getExtractInlineImages() |
boolean |
getExtractUniqueInlineImagesOnly() |
boolean |
getSortByPosition() |
Float |
getSpacingTolerance() |
boolean |
getSuppressDuplicateOverlappingText() |
boolean |
getUseNonSequentialParser() |
int |
hashCode() |
void |
setAverageCharTolerance(Float averageCharTolerance)
See
PDFTextStripper.setAverageCharTolerance(float) |
void |
setEnableAutoSpace(boolean enableAutoSpace)
If true (the default), the parser should estimate
where spaces should be inserted between words.
|
void |
setExtractAcroFormContent(boolean extractAcroFormContent)
If true (the default), extract content from AcroForms
at the end of the document.
|
void |
setExtractAnnotationText(boolean extractAnnotationText)
If true (the default), text in annotations will be
extracted.
|
void |
setExtractInlineImages(boolean extractInlineImages)
If true, extract inline embedded OBXImages.
|
void |
setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
Multiple pages within a PDF file might refer to the same underlying image.
|
void |
setSortByPosition(boolean sortByPosition)
If true, sort text tokens by their x/y position
before extracting text.
|
void |
setSpacingTolerance(Float spacingTolerance)
See
PDFTextStripper.setSpacingTolerance(float) |
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
If true, the parser should try to remove duplicated
text over the same region.
|
void |
setUseNonSequentialParser(boolean useNonSequentialParser)
If true, uses PDFBox's non-sequential parser.
|
String |
toString() |
public PDFParserConfig()
public PDFParserConfig(InputStream is)
is
- public void configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
pdf2XHTML
- public void setExtractAcroFormContent(boolean extractAcroFormContent)
extractAcroFormContent
- public boolean getExtractAcroFormContent()
setExtractAcroFormContent(boolean)
public void setExtractInlineImages(boolean extractInlineImages)
true
with caution.
The default is false
.
See also: setExtractUniqueInlineImagesOnly(boolean)
;
extractInlineImages
- public boolean getExtractInlineImages()
setExtractInlineImages(boolean)
public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
extractUniqueInlineImagesOnly
is set to false
, the
parser will call the EmbeddedExtractor each time the image appears on a page.
This might be desired for some use cases. However, to avoid duplication of
extracted images, set this to true
. The default is true
.
Note that uniqueness is determined only by the underlying PDF COSObject id, not by file hash or similar equality metric. If the PDF actually contains multiple copies of the same image -- all with different object ids -- then all images will be extracted.
For this parameter to have any effect, extractInlineImages
must be
set to true
.
extractUniqueInlineImagesOnly
- public boolean getExtractUniqueInlineImagesOnly()
public boolean getEnableAutoSpace()
setEnableAutoSpace(boolean)
public void setEnableAutoSpace(boolean enableAutoSpace)
public boolean getSuppressDuplicateOverlappingText()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
public boolean getExtractAnnotationText()
setExtractAnnotationText(boolean)
public void setExtractAnnotationText(boolean extractAnnotationText)
public boolean getSortByPosition()
setSortByPosition(boolean)
public void setSortByPosition(boolean sortByPosition)
public boolean getUseNonSequentialParser()
setUseNonSequentialParser(boolean)
public void setUseNonSequentialParser(boolean useNonSequentialParser)
Default is false (use the traditional parser)
useNonSequentialParser
- public Float getAverageCharTolerance()
setAverageCharTolerance(Float)
public void setAverageCharTolerance(Float averageCharTolerance)
PDFTextStripper.setAverageCharTolerance(float)
public Float getSpacingTolerance()
setSpacingTolerance(Float)
public void setSpacingTolerance(Float spacingTolerance)
PDFTextStripper.setSpacingTolerance(float)
Copyright © 2007-2015 The Apache Software Foundation. All Rights Reserved.