Package org.apache.tika.parser.microsoft
Class AbstractOfficeParser
java.lang.Object
org.apache.tika.parser.microsoft.AbstractOfficeParser
- All Implemented Interfaces:
Serializable,Parser
- Direct Known Subclasses:
OfficeParser,OOXMLParser,Word2006MLParser
Intermediate layer to set
OfficeParserConfig uniformly.- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidconfigure(ParseContext parseContext) Checks to see if the user has specified anOfficeParserConfig.intbooleanbooleanbooleanbooleanbooleanbooleanbooleanbooleanbooleanvoidsetByteArrayMaxOverride(int maxOverride) WARNING: this sets a static variable in POI.voidsetConcatenatePhoneticRuns(boolean concatenatePhoneticRuns) voidsetDateFormatOverride(String format) voidsetExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG) Some .msg files can contain body content in html, rtf and/or text.voidsetExtractMacros(boolean extractMacros) voidsetIncludeDeletedContent(boolean includeDeletedConent) voidsetIncludeHeadersAndFooters(boolean includeHeadersAndFooters) voidsetIncludeMoveFromContent(boolean includeMoveFromContent) voidsetIncludeShapeBasedContent(boolean includeShapeBasedContent) voidsetUseSAXDocxExtractor(boolean useSAXDocxExtractor) voidsetUseSAXPptxExtractor(boolean useSAXPptxExtractor) Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.tika.parser.Parser
getSupportedTypes, parse
-
Constructor Details
-
AbstractOfficeParser
public AbstractOfficeParser()
-
-
Method Details
-
configure
Checks to see if the user has specified anOfficeParserConfig. If so, no changes are made; if not, one is added to the context.- Parameters:
parseContext-
-
isIncludeDeletedContent
public boolean isIncludeDeletedContent()- Returns:
- See Also:
-
setIncludeDeletedContent
-
isIncludeMoveFromContent
public boolean isIncludeMoveFromContent()- Returns:
- See Also:
-
setIncludeMoveFromContent
-
isUseSAXDocxExtractor
public boolean isUseSAXDocxExtractor()- Returns:
- See Also:
-
setUseSAXDocxExtractor
-
isExtractMacros
public boolean isExtractMacros()- Returns:
- whether or not to extract macros
- See Also:
-
setExtractMacros
-
setIncludeShapeBasedContent
-
isIncludeShapeBasedContent
public boolean isIncludeShapeBasedContent() -
setUseSAXPptxExtractor
-
isUseSAXPptxExtractor
public boolean isUseSAXPptxExtractor() -
setConcatenatePhoneticRuns
-
isConcatenatePhoneticRuns
public boolean isConcatenatePhoneticRuns() -
isExtractAllAlternativesFromMSG
public boolean isExtractAllAlternativesFromMSG() -
setExtractAllAlternativesFromMSG
Some .msg files can contain body content in html, rtf and/or text. The default behavior is to pick the first non-null value and include only that. If you'd like to extract all non-null body content, which is likely duplicative, set this value to true.- Parameters:
extractAllAlternativesFromMSG- whether or not to extract all alternative parts from msg files- Since:
- 1.17
-
setByteArrayMaxOverride
WARNING: this sets a static variable in POI. This allows users to override POI's protection of the allocation of overly large byte arrays. Use carefully; and please open up issues on POI's bugzilla to bump values for specific records. If the value is <&eq; 0, this value is ignored- Parameters:
maxOverride-
-
getByteArrayMaxOverride
public int getByteArrayMaxOverride() -
setDateFormatOverride
-
getDateFormatOverride
-