Package org.apache.tika.parser.microsoft
Class OfficeParserConfig
java.lang.Object
org.apache.tika.parser.microsoft.OfficeParserConfig
- All Implemented Interfaces:
Serializable
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionintintMaximum bytes (in KB) per embedded object/pict when extracting from RTF within MSG files.booleanbooleanbooleanbooleanbooleanbooleanbooleanbooleanbooleanbooleanbooleanIn OOXML,mc:AlternateContentwrapsmc:Choice(newer/richer rendering, e.g.booleanThe default changed tofalsein 4.x.voidsetConcatenatePhoneticRuns(boolean concatenatePhoneticRuns) Microsoft Excel files can sometimes contain phonetic (furigana) strings.voidsetDateOverrideFormat(String format) A user may wish to override the date formats in xls and xlsx files.voidsetExtractMacros(boolean extractMacros) Sets whether or not MSOffice parsers should extract macros.voidsetIncludeDeletedContent(boolean includeDeletedContent) Sets whether or not the parser should include deleted content.voidsetIncludeGlossary(boolean includeGlossary) Whether or not to include the glossary (building blocks / AutoText) document from docx files.voidsetIncludeHeadersAndFooters(boolean includeHeadersAndFooters) Whether or not to include headers and footers.voidsetIncludeMissingRows(boolean includeMissingRows) For table-like formats, and tables within other formats, should missing rows in sparse tables be output where detected?voidsetIncludeMoveFromContent(boolean includeMoveFromContent) With track changes on, when a section is moved, the content is stored in both the "moveFrom" section and in the "moveTo" section.voidsetIncludeShapeBasedContent(boolean includeShapeBasedContent) In Excel and Word, there can be text stored within drawing shapes.voidsetIncludeSlideMasterContent(boolean includeSlideMasterContent) Whether or not to include contents from any of the three types of masters -- slide, notes, handout -- in a .ppt or ppt[xm] file.voidsetIncludeSlideNotes(boolean includeSlideNotes) Whether or not to process slide notes content.voidsetMaxOverride(int maxOverride) voidsetPreferAlternateContentChoice(boolean preferAlternateContentChoice) voidsetRtfEmbeddedMaxBytesInKb(int rtfEmbeddedMaxBytesInKb) voidsetWriteSelectHeadersInBody(boolean writeSelectHeadersInBody)
-
Constructor Details
-
OfficeParserConfig
public OfficeParserConfig()
-
-
Method Details
-
isExtractMacros
public boolean isExtractMacros()- Returns:
- whether or not to extract macros
-
setExtractMacros
public void setExtractMacros(boolean extractMacros) Sets whether or not MSOffice parsers should extract macros. As of Tika 1.15, the default isfalse.- Parameters:
extractMacros-
-
isIncludeDeletedContent
public boolean isIncludeDeletedContent() -
setIncludeDeletedContent
public void setIncludeDeletedContent(boolean includeDeletedContent) Sets whether or not the parser should include deleted content. This has only been implemented in the streaming docx parser (SXWPFWordExtractorDecoratorso far!!!- Parameters:
includeDeletedContent-
-
isIncludeMoveFromContent
public boolean isIncludeMoveFromContent() -
setIncludeMoveFromContent
public void setIncludeMoveFromContent(boolean includeMoveFromContent) With track changes on, when a section is moved, the content is stored in both the "moveFrom" section and in the "moveTo" section. If you'd like to include the section both in its original location (moveFrom) and in its new location (moveTo), set this totrueDefault:falseThis has only been implemented in the streaming docx parser (SXWPFWordExtractorDecoratorso far!!!- Parameters:
includeMoveFromContent-
-
isIncludeShapeBasedContent
public boolean isIncludeShapeBasedContent() -
setIncludeShapeBasedContent
public void setIncludeShapeBasedContent(boolean includeShapeBasedContent) In Excel and Word, there can be text stored within drawing shapes. (In PowerPoint everything is in a Shape) If you'd like to skip processing these to look for text, set this tofalseDefault:true- Parameters:
includeShapeBasedContent-
-
isPreferAlternateContentChoice
public boolean isPreferAlternateContentChoice()In OOXML,mc:AlternateContentwrapsmc:Choice(newer/richer rendering, e.g. DrawingML text boxes) andmc:Fallback(degraded VML for older consumers). Whentrue(default), the SAX parser processes the Choice branch and skips Fallback. Whenfalse, it processes Fallback and skips Choice (legacy behavior prior to Tika 4.x).For text extraction, Choice typically contains equal or more content than Fallback.
Default:
true- Returns:
- whether to prefer mc:Choice over mc:Fallback
-
setPreferAlternateContentChoice
public void setPreferAlternateContentChoice(boolean preferAlternateContentChoice) - Parameters:
preferAlternateContentChoice- whether to prefer mc:Choice over mc:Fallback- See Also:
-
isConcatenatePhoneticRuns
public boolean isConcatenatePhoneticRuns() -
setConcatenatePhoneticRuns
public void setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns) Microsoft Excel files can sometimes contain phonetic (furigana) strings. See PHONETIC. This sets whether or not the parser will concatenate the phonetic runs to the original text.This is currently only supported by the xls and xlsx parsers (not the xlsb parser), and the default is
true.- Parameters:
concatenatePhoneticRuns-
-
isIncludeGlossary
public boolean isIncludeGlossary() -
setIncludeGlossary
public void setIncludeGlossary(boolean includeGlossary) Whether or not to include the glossary (building blocks / AutoText) document from docx files. The glossary can contain template content such as form field placeholders that may duplicate content already present in the main body. Default:true- Parameters:
includeGlossary- whether or not to include glossary content
-
isIncludeMissingRows
public boolean isIncludeMissingRows() -
setIncludeMissingRows
public void setIncludeMissingRows(boolean includeMissingRows) For table-like formats, and tables within other formats, should missing rows in sparse tables be output where detected? The default is to only output rows defined within the file, which avoid lots of blank lines, but means layout isn't preserved. -
isIncludeSlideNotes
public boolean isIncludeSlideNotes() -
setIncludeSlideNotes
public void setIncludeSlideNotes(boolean includeSlideNotes) Whether or not to process slide notes content. If set tofalse, the parser will skip the text content and all embedded objects from the slide notes in ppt and ppt[xm]. The default istrue.- Parameters:
includeSlideNotes- whether or not to process slide notes- Since:
- 1.19.1
-
isIncludeSlideMasterContent
public boolean isIncludeSlideMasterContent()- Returns:
- whether or not to process content in slide masters
- Since:
- 1.19.1
-
setIncludeSlideMasterContent
public void setIncludeSlideMasterContent(boolean includeSlideMasterContent) Whether or not to include contents from any of the three types of masters -- slide, notes, handout -- in a .ppt or ppt[xm] file. If set tofalse, the parser will not extract text or embedded objects from any of the masters.- Parameters:
includeSlideMasterContent-- Since:
- 1.19.1
-
getDateFormatOverride
-
setDateOverrideFormat
A user may wish to override the date formats in xls and xlsx files. For example, a user might prefer 'yyyy-mm-dd' to 'mm/dd/yy'.Note: these formats are "Excel formats" not Java's SimpleDateFormat
- Parameters:
format-
-
setMaxOverride
public void setMaxOverride(int maxOverride) -
getMaxOverride
public int getMaxOverride() -
isWriteSelectHeadersInBody
public boolean isWriteSelectHeadersInBody()The default changed tofalsein 4.x. For legacy 3.x behavior, set this totrue.- Returns:
-
setWriteSelectHeadersInBody
public void setWriteSelectHeadersInBody(boolean writeSelectHeadersInBody) -
getRtfEmbeddedMaxBytesInKb
public int getRtfEmbeddedMaxBytesInKb()Maximum bytes (in KB) per embedded object/pict when extracting from RTF within MSG files. Data is streamed to disk, so the default is 2 GB. Set to -1 for unlimited. -
setRtfEmbeddedMaxBytesInKb
public void setRtfEmbeddedMaxBytesInKb(int rtfEmbeddedMaxBytesInKb)
-