PDFParserConfig (Apache Tika 1.13 API)

java.lang.Object
- org.apache.tika.parser.pdf.PDFParserConfig

All Implemented Interfaces:

Serializable
```
public class PDFParserConfig
extends Object
implements Serializable
```
Config for PDFParser.
This allows parameters to be set programmatically:
1. Calls to PDFParser, i.e. parser.getPDFParserConfig().setEnableAutoSpace() (as before)
2. Constructor of PDFParser
3. Passing to PDFParser through a ParseContext: context.set(PDFParserConfig.class, config);
Parameters can also be set by modifying the PDFParserConfig.properties file, which lives in the expected places, in trunk: tika-parsers/src/main/resources/org/apache/tika/parser/pdf
Or, in tika-app-x.x.jar or tika-parsers-x.x.jar: org/apache/tika/parser/pdf
See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description

PDFParserConfig()

PDFParserConfig(InputStream is)
Loads properties from InputStream and then tries to close InputStream.

Constructors
Constructor and Description
`PDFParserConfig()`
`PDFParserConfig(InputStream is)` Loads properties from InputStream and then tries to close InputStream.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)` Configures the given pdf2XHTML.
`boolean`	`equals(Object obj)`
`AccessChecker`	`getAccessChecker()`
`Float`	`getAverageCharTolerance()`
`boolean`	`getEnableAutoSpace()`
`boolean`	`getExtractAcroFormContent()`
`boolean`	`getExtractAnnotationText()`
`boolean`	`getExtractInlineImages()`
`boolean`	`getExtractUniqueInlineImagesOnly()`
`boolean`	`getIfXFAExtractOnlyXFA()`
`boolean`	`getSortByPosition()`
`Float`	`getSpacingTolerance()`
`boolean`	`getSuppressDuplicateOverlappingText()`
`int`	`hashCode()`
`boolean`	`isCatchIntermediateIOExceptions()` See `setCatchIntermediateIOExceptions(boolean)`
`void`	`setAccessChecker(AccessChecker accessChecker)`
`void`	`setAverageCharTolerance(Float averageCharTolerance)` See `PDFTextStripper.setAverageCharTolerance(float)`
`void`	`setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)` The PDFBox parser will throw an IOException if there is a problem with a stream.
`void`	`setEnableAutoSpace(boolean enableAutoSpace)` If true (the default), the parser should estimate where spaces should be inserted between words.
`void`	`setExtractAcroFormContent(boolean extractAcroFormContent)` If true (the default), extract content from AcroForms at the end of the document.
`void`	`setExtractAnnotationText(boolean extractAnnotationText)` If true (the default), text in annotations will be extracted.
`void`	`setExtractInlineImages(boolean extractInlineImages)` If true, extract inline embedded OBXImages.
`void`	`setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)` Multiple pages within a PDF file might refer to the same underlying image.
`void`	`setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)` If false (the default), extract content from the full PDF as well as the XFA form.
`void`	`setSortByPosition(boolean sortByPosition)` If true, sort text tokens by their x/y position before extracting text.
`void`	`setSpacingTolerance(Float spacingTolerance)` See `PDFTextStripper.setSpacingTolerance(float)`
`void`	`setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)` If true, the parser should try to remove duplicated text over the same region.
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - PDFParserConfig
```
public PDFParserConfig()
```
  - PDFParserConfig
```
public PDFParserConfig(InputStream is)
```
    Loads properties from InputStream and then tries to close InputStream. If there is an IOException, this silently swallows the exception and goes back to the default.
    
    Parameters:
    
    is -
- Method Detail
  - configure
```
public void configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
```
    Configures the given pdf2XHTML.
    
    Parameters:
    
    pdf2XHTML -
  - getExtractAcroFormContent
```
public boolean getExtractAcroFormContent()
```
    See Also:
    
    setExtractAcroFormContent(boolean)
  - setExtractAcroFormContent
```
public void setExtractAcroFormContent(boolean extractAcroFormContent)
```
    If true (the default), extract content from AcroForms at the end of the document. If an XFA is found, try to process that, otherwise, process the AcroForm.
    
    Parameters:
    
    extractAcroFormContent -
  - getIfXFAExtractOnlyXFA
```
public boolean getIfXFAExtractOnlyXFA()
```
    Returns:
    
    how to handle XFA data if it exists
    
    See Also:
    
    setIfXFAExtractOnlyXFA(boolean)
  - setIfXFAExtractOnlyXFA
```
public void setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)
```
    If false (the default), extract content from the full PDF as well as the XFA form. This will likely lead to some duplicative content.
    
    Parameters:
    
    ifXFAExtractOnlyXFA -
  - getExtractInlineImages
```
public boolean getExtractInlineImages()
```
    See Also:
    
    setExtractInlineImages(boolean)
  - setExtractInlineImages
```
public void setExtractInlineImages(boolean extractInlineImages)
```
    If true, extract inline embedded OBXImages. Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Set to true with caution.
    The default is false.
    See also: setExtractUniqueInlineImagesOnly(boolean);
    
    Parameters:
    
    extractInlineImages -
  - getExtractUniqueInlineImagesOnly
```
public boolean getExtractUniqueInlineImagesOnly()
```
    See Also:
    
    setExtractUniqueInlineImagesOnly(boolean)
  - setExtractUniqueInlineImagesOnly
```
public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
```
    Multiple pages within a PDF file might refer to the same underlying image. If extractUniqueInlineImagesOnly is set to false, the parser will call the EmbeddedExtractor each time the image appears on a page. This might be desired for some use cases. However, to avoid duplication of extracted images, set this to true. The default is true.
    Note that uniqueness is determined only by the underlying PDF COSObject id, not by file hash or similar equality metric. If the PDF actually contains multiple copies of the same image -- all with different object ids -- then all images will be extracted.
    For this parameter to have any effect, extractInlineImages must be set to true.
    Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting of this parameter, the extractor will only pull out one copy of each image per page. This parameter tries to capture uniqueness across the entire document.
    
    Parameters:
    
    extractUniqueInlineImagesOnly -
  - getEnableAutoSpace
```
public boolean getEnableAutoSpace()
```
    See Also:
    
    setEnableAutoSpace(boolean)
  - setEnableAutoSpace
```
public void setEnableAutoSpace(boolean enableAutoSpace)
```
    If true (the default), the parser should estimate where spaces should be inserted between words. For many PDFs this is necessary as they do not include explicit whitespace characters.
  - getSuppressDuplicateOverlappingText
```
public boolean getSuppressDuplicateOverlappingText()
```
    See Also:
    
    setSuppressDuplicateOverlappingText(boolean)
  - setSuppressDuplicateOverlappingText
```
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
```
    If true, the parser should try to remove duplicated text over the same region. This is needed for some PDFs that achieve bolding by re-writing the same text in the same area. Note that this can slow down extraction substantially (PDFBOX-956) and sometimes remove characters that were not in fact duplicated (PDFBOX-1155). By default this is disabled.
  - getExtractAnnotationText
```
public boolean getExtractAnnotationText()
```
    See Also:
    
    setExtractAnnotationText(boolean)
  - setExtractAnnotationText
```
public void setExtractAnnotationText(boolean extractAnnotationText)
```
    If true (the default), text in annotations will be extracted.
  - getSortByPosition
```
public boolean getSortByPosition()
```
    See Also:
    
    setSortByPosition(boolean)
  - setSortByPosition
```
public void setSortByPosition(boolean sortByPosition)
```
    If true, sort text tokens by their x/y position before extracting text. This may be necessary for some PDFs (if the text tokens are not rendered "in order"), while for other PDFs it can produce the wrong result (for example if there are 2 columns, the text will be interleaved). Default is false.
  - getAverageCharTolerance
```
public Float getAverageCharTolerance()
```
    See Also:
    
    setAverageCharTolerance(Float)
  - setAverageCharTolerance
```
public void setAverageCharTolerance(Float averageCharTolerance)
```
    See PDFTextStripper.setAverageCharTolerance(float)
  - getSpacingTolerance
```
public Float getSpacingTolerance()
```
    See Also:
    
    setSpacingTolerance(Float)
  - setSpacingTolerance
```
public void setSpacingTolerance(Float spacingTolerance)
```
    See PDFTextStripper.setSpacingTolerance(float)
  - getAccessChecker
```
public AccessChecker getAccessChecker()
```
  - setAccessChecker
```
public void setAccessChecker(AccessChecker accessChecker)
```
  - isCatchIntermediateIOExceptions
```
public boolean isCatchIntermediateIOExceptions()
```
    See setCatchIntermediateIOExceptions(boolean)
    
    Returns:
    
    whether or not to catch IOExceptions
  - setCatchIntermediateIOExceptions
```
public void setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)
```
    The PDFBox parser will throw an IOException if there is a problem with a stream. If this is set to true, Tika's PDFParser will catch these exceptions and try to parse the rest of the document. After the parse is completed, Tika's PDFParser will throw the first caught exception.
    
    Parameters:
    
    catchIntermediateIOExceptions -
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class Object
  - equals
```
public boolean equals(Object obj)
```
    Overrides:
    
    equals in class Object
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object

Class PDFParserConfig

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

PDFParserConfig

PDFParserConfig

Method Detail

configure

getExtractAcroFormContent

setExtractAcroFormContent

getIfXFAExtractOnlyXFA

setIfXFAExtractOnlyXFA

getExtractInlineImages

setExtractInlineImages

getExtractUniqueInlineImagesOnly

setExtractUniqueInlineImagesOnly

getEnableAutoSpace

setEnableAutoSpace

getSuppressDuplicateOverlappingText

setSuppressDuplicateOverlappingText

getExtractAnnotationText

setExtractAnnotationText

getSortByPosition

setSortByPosition

getAverageCharTolerance

setAverageCharTolerance

getSpacingTolerance

setSpacingTolerance

getAccessChecker

setAccessChecker

isCatchIntermediateIOExceptions

setCatchIntermediateIOExceptions

hashCode

equals

toString