Class RTFEncapsulatedHTMLExtractor

java.lang.Object
org.apache.tika.parser.microsoft.msg.RTFEncapsulatedHTMLExtractor

public class RTFEncapsulatedHTMLExtractor extends Object
Extracts the original HTML from an RTF document that contains encapsulated HTML (as indicated by the \fromhtml1 control word).

The encapsulated HTML format stores HTML in two places:

  1. {\*\htmltag<N> ...} groups — contain the HTML markup (tags, style blocks, etc.)
  2. Text between htmltag groups — contains the actual text content, provided it is NOT wrapped in \htmlrtf ... \htmlrtf0 (which marks RTF-only rendering hints)

Per the MS-OXRTFEX specification, \'xx hex escapes in inter-tag text are decoded using the code page of the currently selected font (\fN). The font-to-charset mapping is built from the RTF font table's \fcharsetN declarations. Inside {\*\htmltag} groups, the document's default code page (\ansicpgN) is used.

  • Constructor Details

    • RTFEncapsulatedHTMLExtractor

      public RTFEncapsulatedHTMLExtractor()
  • Method Details

    • extract

      public static String extract(byte[] rtfBytes)
      Extracts the HTML content from an encapsulated-HTML RTF document.
      Parameters:
      rtfBytes - the decompressed RTF bytes
      Returns:
      the extracted HTML string, or null if the RTF does not contain encapsulated HTML