Class AbstractOfficeParser

java.lang.Object
org.apache.tika.parser.AbstractParser
org.apache.tika.parser.microsoft.AbstractOfficeParser
All Implemented Interfaces:
Serializable, Parser
Direct Known Subclasses:
OfficeParser, OOXMLParser, Word2006MLParser

public abstract class AbstractOfficeParser extends AbstractParser
Intermediate layer to set OfficeParserConfig uniformly.
See Also:
  • Constructor Details

    • AbstractOfficeParser

      public AbstractOfficeParser()
  • Method Details

    • configure

      public void configure(ParseContext parseContext)
      Checks to see if the user has specified an OfficeParserConfig. If so, no changes are made; if not, one is added to the context.
      Parameters:
      parseContext -
    • isIncludeDeletedContent

      public boolean isIncludeDeletedContent()
      Returns:
      See Also:
    • setIncludeDeletedContent

      @Field public void setIncludeDeletedContent(boolean includeDeletedConent)
    • isIncludeMoveFromContent

      public boolean isIncludeMoveFromContent()
      Returns:
      See Also:
    • setIncludeMoveFromContent

      @Field public void setIncludeMoveFromContent(boolean includeMoveFromContent)
    • isUseSAXDocxExtractor

      public boolean isUseSAXDocxExtractor()
      Returns:
      See Also:
    • setUseSAXDocxExtractor

      @Field public void setUseSAXDocxExtractor(boolean useSAXDocxExtractor)
    • isExtractMacros

      public boolean isExtractMacros()
      Returns:
      whether or not to extract macros
      See Also:
    • setExtractMacros

      @Field public void setExtractMacros(boolean extractMacros)
    • setIncludeShapeBasedContent

      @Field public void setIncludeShapeBasedContent(boolean includeShapeBasedContent)
    • isIncludeShapeBasedContent

      public boolean isIncludeShapeBasedContent()
    • setUseSAXPptxExtractor

      @Field public void setUseSAXPptxExtractor(boolean useSAXPptxExtractor)
    • isUseSAXPptxExtractor

      public boolean isUseSAXPptxExtractor()
    • setConcatenatePhoneticRuns

      @Field public void setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)
    • isConcatenatePhoneticRuns

      public boolean isConcatenatePhoneticRuns()
    • isExtractAllAlternativesFromMSG

      public boolean isExtractAllAlternativesFromMSG()
    • setExtractAllAlternativesFromMSG

      @Field public void setExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)
      Some .msg files can contain body content in html, rtf and/or text. The default behavior is to pick the first non-null value and include only that. If you'd like to extract all non-null body content, which is likely duplicative, set this value to true.
      Parameters:
      extractAllAlternativesFromMSG - whether or not to extract all alternative parts from msg files
      Since:
      1.17
    • setByteArrayMaxOverride

      @Field public void setByteArrayMaxOverride(int maxOverride)
      WARNING: this sets a static variable in POI. This allows users to override POI's protection of the allocation of overly large byte arrays. Use carefully; and please open up issues on POI's bugzilla to bump values for specific records. If the value is <&eq; 0, this value is ignored
      Parameters:
      maxOverride -
    • getByteArrayMaxOverride

      public int getByteArrayMaxOverride()
    • setDateFormatOverride

      @Field public void setDateFormatOverride(String format)
    • getDateFormatOverride

      public String getDateFormatOverride()
    • setIncludeHeadersAndFooters

      @Field public void setIncludeHeadersAndFooters(boolean includeHeadersAndFooters)
    • isIncludeHeadersAndFooters

      public boolean isIncludeHeadersAndFooters()