Class OfficeParserConfig

java.lang.Object
org.apache.tika.parser.microsoft.OfficeParserConfig
All Implemented Interfaces:
Serializable

public class OfficeParserConfig extends Object implements Serializable
See Also:
  • Constructor Details

    • OfficeParserConfig

      public OfficeParserConfig()
  • Method Details

    • isExtractMacros

      public boolean isExtractMacros()
      Returns:
      whether or not to extract macros
    • setExtractMacros

      public void setExtractMacros(boolean extractMacros)
      Sets whether or not MSOffice parsers should extract macros. As of Tika 1.15, the default is false.
      Parameters:
      extractMacros -
    • isIncludeDeletedContent

      public boolean isIncludeDeletedContent()
    • setIncludeDeletedContent

      public void setIncludeDeletedContent(boolean includeDeletedContent)
      Sets whether or not the parser should include deleted content.

      This has only been implemented in the streaming docx parser (SXWPFWordExtractorDecorator so far!!!

      Parameters:
      includeDeletedContent -
    • isIncludeMoveFromContent

      public boolean isIncludeMoveFromContent()
    • setIncludeMoveFromContent

      public void setIncludeMoveFromContent(boolean includeMoveFromContent)
      With track changes on, when a section is moved, the content is stored in both the "moveFrom" section and in the "moveTo" section.

      If you'd like to include the section both in its original location (moveFrom) and in its new location (moveTo), set this to true

      Default: false

      This has only been implemented in the streaming docx parser (SXWPFWordExtractorDecorator so far!!!

      Parameters:
      includeMoveFromContent -
    • isIncludeShapeBasedContent

      public boolean isIncludeShapeBasedContent()
    • setIncludeShapeBasedContent

      public void setIncludeShapeBasedContent(boolean includeShapeBasedContent)
      In Excel and Word, there can be text stored within drawing shapes. (In PowerPoint everything is in a Shape)

      If you'd like to skip processing these to look for text, set this to false

      Default: true

      Parameters:
      includeShapeBasedContent -
    • isIncludeHeadersAndFooters

      public boolean isIncludeHeadersAndFooters()
    • setIncludeHeadersAndFooters

      public void setIncludeHeadersAndFooters(boolean includeHeadersAndFooters)
      Whether or not to include headers and footers.

      This only operates on headers and footers in Word and Excel, not master slide content in Powerpoint.

      Default: true

      Parameters:
      includeHeadersAndFooters -
    • isPreferAlternateContentChoice

      public boolean isPreferAlternateContentChoice()
      In OOXML, mc:AlternateContent wraps mc:Choice (newer/richer rendering, e.g. DrawingML text boxes) and mc:Fallback (degraded VML for older consumers). When true (default), the SAX parser processes the Choice branch and skips Fallback. When false, it processes Fallback and skips Choice (legacy behavior prior to Tika 4.x).

      For text extraction, Choice typically contains equal or more content than Fallback.

      Default: true

      Returns:
      whether to prefer mc:Choice over mc:Fallback
    • setPreferAlternateContentChoice

      public void setPreferAlternateContentChoice(boolean preferAlternateContentChoice)
      Parameters:
      preferAlternateContentChoice - whether to prefer mc:Choice over mc:Fallback
      See Also:
    • isConcatenatePhoneticRuns

      public boolean isConcatenatePhoneticRuns()
    • setConcatenatePhoneticRuns

      public void setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)
      Microsoft Excel files can sometimes contain phonetic (furigana) strings. See PHONETIC. This sets whether or not the parser will concatenate the phonetic runs to the original text.

      This is currently only supported by the xls and xlsx parsers (not the xlsb parser), and the default is true.

      Parameters:
      concatenatePhoneticRuns -
    • isIncludeGlossary

      public boolean isIncludeGlossary()
    • setIncludeGlossary

      public void setIncludeGlossary(boolean includeGlossary)
      Whether or not to include the glossary (building blocks / AutoText) document from docx files. The glossary can contain template content such as form field placeholders that may duplicate content already present in the main body.

      Default: true

      Parameters:
      includeGlossary - whether or not to include glossary content
    • isIncludeMissingRows

      public boolean isIncludeMissingRows()
    • setIncludeMissingRows

      public void setIncludeMissingRows(boolean includeMissingRows)
      For table-like formats, and tables within other formats, should missing rows in sparse tables be output where detected? The default is to only output rows defined within the file, which avoid lots of blank lines, but means layout isn't preserved.
    • isIncludeSlideNotes

      public boolean isIncludeSlideNotes()
    • setIncludeSlideNotes

      public void setIncludeSlideNotes(boolean includeSlideNotes)
      Whether or not to process slide notes content. If set to false, the parser will skip the text content and all embedded objects from the slide notes in ppt and ppt[xm]. The default is true.
      Parameters:
      includeSlideNotes - whether or not to process slide notes
      Since:
      1.19.1
    • isIncludeSlideMasterContent

      public boolean isIncludeSlideMasterContent()
      Returns:
      whether or not to process content in slide masters
      Since:
      1.19.1
    • setIncludeSlideMasterContent

      public void setIncludeSlideMasterContent(boolean includeSlideMasterContent)
      Whether or not to include contents from any of the three types of masters -- slide, notes, handout -- in a .ppt or ppt[xm] file. If set to false, the parser will not extract text or embedded objects from any of the masters.
      Parameters:
      includeSlideMasterContent -
      Since:
      1.19.1
    • getDateFormatOverride

      public String getDateFormatOverride()
    • setDateOverrideFormat

      public void setDateOverrideFormat(String format)
      A user may wish to override the date formats in xls and xlsx files. For example, a user might prefer 'yyyy-mm-dd' to 'mm/dd/yy'.

      Note: these formats are "Excel formats" not Java's SimpleDateFormat

      Parameters:
      format -
    • setMaxOverride

      public void setMaxOverride(int maxOverride)
    • getMaxOverride

      public int getMaxOverride()
    • isWriteSelectHeadersInBody

      public boolean isWriteSelectHeadersInBody()
      The default changed to false in 4.x. For legacy 3.x behavior, set this to true.
      Returns:
    • setWriteSelectHeadersInBody

      public void setWriteSelectHeadersInBody(boolean writeSelectHeadersInBody)
    • getRtfEmbeddedMaxBytesInKb

      public int getRtfEmbeddedMaxBytesInKb()
      Maximum bytes (in KB) per embedded object/pict when extracting from RTF within MSG files. Data is streamed to disk, so the default is 2 GB. Set to -1 for unlimited.
    • setRtfEmbeddedMaxBytesInKb

      public void setRtfEmbeddedMaxBytesInKb(int rtfEmbeddedMaxBytesInKb)