Package org.apache.tika.parser.microsoft
Class AbstractOfficeParser
java.lang.Object
org.apache.tika.parser.microsoft.AbstractOfficeParser
- All Implemented Interfaces:
Serializable,Parser
- Direct Known Subclasses:
OfficeParser,OOXMLParser,Word2006MLParser
Intermediate layer to set
OfficeParserConfig uniformly.- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidconfigure(ParseContext parseContext) Checks to see if the user has specified anOfficeParserConfig.intbooleanbooleanbooleanbooleanbooleanbooleanbooleanbooleanbooleanbooleanvoidsetByteArrayMaxOverride(int maxOverride) WARNING: this sets a static variable in POI.voidsetConcatenatePhoneticRuns(boolean concatenatePhoneticRuns) voidsetDateFormatOverride(String format) voidsetExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG) Some .msg files can contain body content in html, rtf and/or text.voidsetExtractMacros(boolean extractMacros) voidsetIncludeDeletedContent(boolean includeDeletedConent) voidsetIncludeHeadersAndFooters(boolean includeHeadersAndFooters) voidsetIncludeMoveFromContent(boolean includeMoveFromContent) voidsetIncludeShapeBasedContent(boolean includeShapeBasedContent) voidsetUseSAXDocxExtractor(boolean useSAXDocxExtractor) voidsetUseSAXPptxExtractor(boolean useSAXPptxExtractor) voidsetWriteSelectHeadersInBody(boolean val) If set totrue, this will write the to/from/cc into the body contentMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.tika.parser.Parser
getSupportedTypes, parse
-
Constructor Details
-
AbstractOfficeParser
public AbstractOfficeParser()
-
-
Method Details
-
configure
Checks to see if the user has specified anOfficeParserConfig. If so, no changes are made; if not, one is added to the context.- Parameters:
parseContext-
-
isIncludeDeletedContent
public boolean isIncludeDeletedContent()- Returns:
- See Also:
-
setIncludeDeletedContent
-
isIncludeMoveFromContent
public boolean isIncludeMoveFromContent()- Returns:
- See Also:
-
setIncludeMoveFromContent
-
isUseSAXDocxExtractor
public boolean isUseSAXDocxExtractor()- Returns:
- See Also:
-
setUseSAXDocxExtractor
-
isExtractMacros
public boolean isExtractMacros()- Returns:
- whether or not to extract macros
- See Also:
-
setExtractMacros
-
setIncludeShapeBasedContent
-
isIncludeShapeBasedContent
public boolean isIncludeShapeBasedContent() -
setUseSAXPptxExtractor
-
isUseSAXPptxExtractor
public boolean isUseSAXPptxExtractor() -
setConcatenatePhoneticRuns
-
isConcatenatePhoneticRuns
public boolean isConcatenatePhoneticRuns() -
isExtractAllAlternativesFromMSG
public boolean isExtractAllAlternativesFromMSG() -
setExtractAllAlternativesFromMSG
Some .msg files can contain body content in html, rtf and/or text. The default behavior is to pick the first non-null value and include only that. If you'd like to extract all non-null body content, which is likely duplicative, set this value to true.- Parameters:
extractAllAlternativesFromMSG- whether or not to extract all alternative parts from msg files- Since:
- 1.17
-
setByteArrayMaxOverride
WARNING: this sets a static variable in POI. This allows users to override POI's protection of the allocation of overly large byte arrays. Use carefully; and please open up issues on POI's bugzilla to bump values for specific records. If the value is <&eq; 0, this value is ignored- Parameters:
maxOverride-
-
getByteArrayMaxOverride
public int getByteArrayMaxOverride() -
setDateFormatOverride
-
getDateFormatOverride
-
setWriteSelectHeadersInBody
If set totrue, this will write the to/from/cc into the body content- Parameters:
val-
-
isWriteSelectHeadersInBody
public boolean isWriteSelectHeadersInBody()
-