org.apache.tika.parser.microsoft.msg.RTFEncapsulatedHTMLExtractor

public class RTFEncapsulatedHTMLExtractor extends Object

Extracts the original HTML from an RTF document that contains encapsulated HTML (as indicated by the \fromhtml1 control word).

The encapsulated HTML format stores HTML in two places:

{\*\htmltag<N> ...} groups — contain the HTML markup (tags, style blocks, etc.)
Text between htmltag groups — contains the actual text content, provided it is NOT wrapped in \htmlrtf ... \htmlrtf0 (which marks RTF-only rendering hints)

Per the MS-OXRTFEX specification, \'xx hex escapes in inter-tag text are decoded using the code page of the currently selected font (\fN). The font-to-charset mapping is built from the RTF font table's \fcharsetN declarations. Inside {\*\htmltag} groups, the document's default code page (\ansicpgN) is used.

Constructor Summary

Constructors

Constructor

Description

RTFEncapsulatedHTMLExtractor()
Method Summary

Modifier and Type

Method

Description

static String

extract(byte[] rtfBytes)

Extracts the HTML content from an encapsulated-HTML RTF document.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- RTFEncapsulatedHTMLExtractor
  
  public RTFEncapsulatedHTMLExtractor()
Method Details
- extract
  
  public static String extract(byte[] rtfBytes)
  
  Extracts the HTML content from an encapsulated-HTML RTF document.
  
  Parameters:
  
  rtfBytes - the decompressed RTF bytes
  
  Returns:
  
  the extracted HTML string, or null if the RTF does not contain encapsulated HTML

Class RTFEncapsulatedHTMLExtractor

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

RTFEncapsulatedHTMLExtractor

Method Details

extract