Package org.apache.tika.parser.html
Interface HtmlMapper
-
- All Known Implementing Classes:
DefaultHtmlMapper
,IdentityHtmlMapper
public interface HtmlMapper
HTML mapper used to make incoming HTML documents easier to handle by Tika clients. TheHtmlParser
looks up an optional HTML mapper from the parse context and uses it to map parsed HTML to "safe" XHTML. A client that wants to customize this mapping can place a custom HtmlMapper instance into the parse context.- Since:
- Apache Tika 0.6
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description boolean
isDiscardElement(String name)
Checks whether all content within the given HTML element should be discarded instead of including it in the parse output.String
mapSafeAttribute(String elementName, String attributeName)
Maps "safe" HTML attribute names to semantic XHTML equivalents.String
mapSafeElement(String name)
Maps "safe" HTML element names to semantic XHTML equivalents.
-
-
-
Method Detail
-
mapSafeElement
String mapSafeElement(String name)
Maps "safe" HTML element names to semantic XHTML equivalents. If the given element is unknown or deemed unsafe for inclusion in the parse output, then this method returnsnull
and the element will be ignored but the content inside it is still processed. See theisDiscardElement(String)
method for a way to discard the entire contents of an element.- Parameters:
name
- HTML element name (upper case)- Returns:
- XHTML element name (lower case), or
null
if the element is unsafe
-
isDiscardElement
boolean isDiscardElement(String name)
Checks whether all content within the given HTML element should be discarded instead of including it in the parse output.- Parameters:
name
- HTML element name (upper case)- Returns:
true
if content inside the named element should be ignored,false
otherwise
-
mapSafeAttribute
String mapSafeAttribute(String elementName, String attributeName)
Maps "safe" HTML attribute names to semantic XHTML equivalents. If the given attribute is unknown or deemed unsafe for inclusion in the parse output, then this method returnsnull
and the attribute will be ignored. This method assumes that the element name is valid and normalised.- Parameters:
elementName
- HTML element name (lower case)attributeName
- HTML attribute name (lower case)- Returns:
- XHTML attribute name (lower case), or
null
if the element is unsafe
-
-