public class OfficeParserConfig extends Object implements Serializable
Constructor and Description |
---|
OfficeParserConfig() |
Modifier and Type | Method and Description |
---|---|
String |
getDateFormatOverride() |
boolean |
isConcatenatePhoneticRuns() |
boolean |
isExtractAllAlternativesFromMSG() |
boolean |
isExtractMacros() |
boolean |
isIncludeDeletedContent() |
boolean |
isIncludeHeadersAndFooters() |
boolean |
isIncludeMissingRows() |
boolean |
isIncludeMoveFromContent() |
boolean |
isIncludeShapeBasedContent() |
boolean |
isIncludeSlideMasterContent() |
boolean |
isIncludeSlideNotes() |
boolean |
isUseSAXDocxExtractor() |
boolean |
isUseSAXPptxExtractor() |
void |
setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)
Microsoft Excel files can sometimes contain phonetic (furigana) strings.
|
void |
setDateOverrideFormat(String format)
A user may wish to override the date formats in xls and xlsx files.
|
void |
setExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)
Some .msg files can contain body content in html, rtf and/or text.
|
void |
setExtractMacros(boolean extractMacros)
Sets whether or not MSOffice parsers should extract macros.
|
void |
setIncludeDeletedContent(boolean includeDeletedContent)
Sets whether or not the parser should include deleted content.
|
void |
setIncludeHeadersAndFooters(boolean includeHeadersAndFooters)
Whether or not to include headers and footers.
|
void |
setIncludeMissingRows(boolean includeMissingRows)
For table-like formats, and tables within other formats, should
missing rows in sparse tables be output where detected?
The default is to only output rows defined within the file, which
avoid lots of blank lines, but means layout isn't preserved.
|
void |
setIncludeMoveFromContent(boolean includeMoveFromContent)
With track changes on, when a section is moved, the content
is stored in both the "moveFrom" section and in the "moveTo" section.
|
void |
setIncludeShapeBasedContent(boolean includeShapeBasedContent)
In Excel and Word, there can be text stored within drawing shapes.
|
void |
setIncludeSlideMasterContent(boolean includeSlideMasterContent)
Whether or not to include contents from any of the three
types of masters -- slide, notes, handout -- in a .ppt or ppt[xm] file.
|
void |
setIncludeSlideNotes(boolean includeSlideNotes)
Whether or not to process slide notes content.
|
void |
setUseSAXDocxExtractor(boolean useSAXDocxExtractor)
Use the experimental SAX-based streaming DOCX parser?
If set to
false , the classic parser will be used; if true ,
the new experimental parser will be used. |
void |
setUseSAXPptxExtractor(boolean useSAXPptxExtractor)
Use the experimental SAX-based streaming DOCX parser?
If set to
false , the classic parser will be used; if true ,
the new experimental parser will be used. |
public boolean isExtractMacros()
public void setExtractMacros(boolean extractMacros)
false
.extractMacros
- public boolean isIncludeDeletedContent()
public void setIncludeDeletedContent(boolean includeDeletedContent)
SXWPFWordExtractorDecorator
so far!!!includeDeletedContent
- public boolean isIncludeMoveFromContent()
public void setIncludeMoveFromContent(boolean includeMoveFromContent)
true
Default: false
This has only been implemented in the streaming docx parser
(SXWPFWordExtractorDecorator
so far!!!includeMoveFromContent
- public boolean isIncludeShapeBasedContent()
public void setIncludeShapeBasedContent(boolean includeShapeBasedContent)
false
Default: true
includeShapeBasedContent
- public boolean isIncludeHeadersAndFooters()
public void setIncludeHeadersAndFooters(boolean includeHeadersAndFooters)
true
includeHeadersAndFooters
- public boolean isUseSAXDocxExtractor()
public void setUseSAXDocxExtractor(boolean useSAXDocxExtractor)
false
, the classic parser will be used; if true
,
the new experimental parser will be used.
Default: false
(classic DOM parser)useSAXDocxExtractor
- public boolean isUseSAXPptxExtractor()
public void setUseSAXPptxExtractor(boolean useSAXPptxExtractor)
false
, the classic parser will be used; if true
,
the new experimental parser will be used.
Default: false
(classic DOM parser)useSAXPptxExtractor
- public boolean isConcatenatePhoneticRuns()
public void setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)
This is currently only supported by the xls and xlsx parsers (not the xlsb parser),
and the default is true
.
concatenatePhoneticRuns
- public boolean isExtractAllAlternativesFromMSG()
public void setExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)
extractAllAlternativesFromMSG
- whether or not to extract all alternative partspublic boolean isIncludeMissingRows()
public void setIncludeMissingRows(boolean includeMissingRows)
public boolean isIncludeSlideNotes()
public void setIncludeSlideNotes(boolean includeSlideNotes)
false
, the parser will skip the text content
and all embedded objects from the slide notes in ppt and ppt[xm].
The default is true
.includeSlideNotes
- whether or not to process slide notespublic boolean isIncludeSlideMasterContent()
public void setIncludeSlideMasterContent(boolean includeSlideMasterContent)
false
, the parser will not extract
text or embedded objects from any of the masters.includeSlideMasterContent
- public String getDateFormatOverride()
public void setDateOverrideFormat(String format)
Note: these formats are "Excel formats" not Java's SimpleDateFormat
format
- Copyright © 2007–2022 The Apache Software Foundation. All rights reserved.