Package org.apache.tika.parser.microsoft
Class AbstractOfficeParser
- java.lang.Object
-
- org.apache.tika.parser.AbstractParser
-
- org.apache.tika.parser.microsoft.AbstractOfficeParser
-
- All Implemented Interfaces:
Serializable,Parser
- Direct Known Subclasses:
OfficeParser,OOXMLParser,Word2006MLParser
public abstract class AbstractOfficeParser extends AbstractParser
Intermediate layer to setOfficeParserConfiguniformly.- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description AbstractOfficeParser()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidconfigure(ParseContext parseContext)Checks to see if the user has specified anOfficeParserConfig.intgetByteArrayMaxOverride()StringgetDateFormatOverride()booleanisConcatenatePhoneticRuns()booleanisExtractAllAlternativesFromMSG()booleanisExtractMacros()booleanisIncludeDeletedContent()booleanisIncludeHeadersAndFooters()booleanisIncludeMoveFromContent()booleanisIncludeShapeBasedContent()booleanisUseSAXDocxExtractor()booleanisUseSAXPptxExtractor()voidsetByteArrayMaxOverride(int maxOverride)WARNING: this sets a static variable in POI.voidsetConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)voidsetDateFormatOverride(String format)voidsetExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)Some .msg files can contain body content in html, rtf and/or text.voidsetExtractMacros(boolean extractMacros)voidsetIncludeDeletedContent(boolean includeDeletedConent)voidsetIncludeHeadersAndFooters(boolean includeHeadersAndFooters)voidsetIncludeMoveFromContent(boolean includeMoveFromContent)voidsetIncludeShapeBasedContent(boolean includeShapeBasedContent)voidsetUseSAXDocxExtractor(boolean useSAXDocxExtractor)voidsetUseSAXPptxExtractor(boolean useSAXPptxExtractor)-
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.tika.parser.Parser
getSupportedTypes, parse
-
-
-
-
Method Detail
-
configure
public void configure(ParseContext parseContext)
Checks to see if the user has specified anOfficeParserConfig. If so, no changes are made; if not, one is added to the context.- Parameters:
parseContext-
-
isIncludeDeletedContent
public boolean isIncludeDeletedContent()
- Returns:
- See Also:
OfficeParserConfig.isIncludeDeletedContent()
-
setIncludeDeletedContent
@Field public void setIncludeDeletedContent(boolean includeDeletedConent)
-
isIncludeMoveFromContent
public boolean isIncludeMoveFromContent()
- Returns:
- See Also:
OfficeParserConfig.isIncludeMoveFromContent()
-
setIncludeMoveFromContent
@Field public void setIncludeMoveFromContent(boolean includeMoveFromContent)
-
isUseSAXDocxExtractor
public boolean isUseSAXDocxExtractor()
- Returns:
- See Also:
OfficeParserConfig.isUseSAXDocxExtractor()
-
setUseSAXDocxExtractor
@Field public void setUseSAXDocxExtractor(boolean useSAXDocxExtractor)
-
isExtractMacros
public boolean isExtractMacros()
- Returns:
- whether or not to extract macros
- See Also:
OfficeParserConfig.isExtractMacros()
-
setExtractMacros
@Field public void setExtractMacros(boolean extractMacros)
-
setIncludeShapeBasedContent
@Field public void setIncludeShapeBasedContent(boolean includeShapeBasedContent)
-
isIncludeShapeBasedContent
public boolean isIncludeShapeBasedContent()
-
setUseSAXPptxExtractor
@Field public void setUseSAXPptxExtractor(boolean useSAXPptxExtractor)
-
isUseSAXPptxExtractor
public boolean isUseSAXPptxExtractor()
-
setConcatenatePhoneticRuns
@Field public void setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)
-
isConcatenatePhoneticRuns
public boolean isConcatenatePhoneticRuns()
-
isExtractAllAlternativesFromMSG
public boolean isExtractAllAlternativesFromMSG()
-
setExtractAllAlternativesFromMSG
@Field public void setExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)
Some .msg files can contain body content in html, rtf and/or text. The default behavior is to pick the first non-null value and include only that. If you'd like to extract all non-null body content, which is likely duplicative, set this value to true.- Parameters:
extractAllAlternativesFromMSG- whether or not to extract all alternative parts from msg files- Since:
- 1.17
-
setByteArrayMaxOverride
@Field public void setByteArrayMaxOverride(int maxOverride)
WARNING: this sets a static variable in POI. This allows users to override POI's protection of the allocation of overly large byte arrays. Use carefully; and please open up issues on POI's bugzilla to bump values for specific records. If the value is <&eq; 0, this value is ignored- Parameters:
maxOverride-
-
getByteArrayMaxOverride
public int getByteArrayMaxOverride()
-
getDateFormatOverride
public String getDateFormatOverride()
-
setIncludeHeadersAndFooters
@Field public void setIncludeHeadersAndFooters(boolean includeHeadersAndFooters)
-
isIncludeHeadersAndFooters
public boolean isIncludeHeadersAndFooters()
-
-