Class RTFEncapsulatedHTMLExtractor
java.lang.Object
org.apache.tika.parser.microsoft.msg.RTFEncapsulatedHTMLExtractor
Extracts the original HTML from an RTF document that contains encapsulated HTML
(as indicated by the
\fromhtml1 control word).
The encapsulated HTML format stores HTML in two places:
{\*\htmltag<N> ...}groups — contain the HTML markup (tags, style blocks, etc.)- Text between htmltag groups — contains the actual text content, provided it is NOT
wrapped in
\htmlrtf ... \htmlrtf0(which marks RTF-only rendering hints)
Per the MS-OXRTFEX specification, \'xx hex escapes in inter-tag text are decoded
using the code page of the currently selected font (\fN). The font-to-charset mapping
is built from the RTF font table's \fcharsetN declarations. Inside
{\*\htmltag} groups, the document's default code page (\ansicpgN) is used.
-
Constructor Summary
Constructors -
Method Summary
-
Constructor Details
-
RTFEncapsulatedHTMLExtractor
public RTFEncapsulatedHTMLExtractor()
-
-
Method Details
-
extract
Extracts the HTML content from an encapsulated-HTML RTF document.- Parameters:
rtfBytes- the decompressed RTF bytes- Returns:
- the extracted HTML string, or
nullif the RTF does not contain encapsulated HTML
-