Package org.apache.tika.parser.html
Class DefaultHtmlMapper
java.lang.Object
org.apache.tika.parser.html.DefaultHtmlMapper
- All Implemented Interfaces:
HtmlMapper
The default HTML mapping rules in Tika.
- Since:
- Apache Tika 0.6
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionboolean
isDiscardElement
(String name) Checks whether all content within the given HTML element should be discarded instead of including it in the parse output.mapSafeAttribute
(String elementName, String attributeName) Normalizes an attribute name.mapSafeElement
(String name) Maps "safe" HTML element names to semantic XHTML equivalents.
-
Field Details
-
INSTANCE
- Since:
- Apache Tika 0.8
-
-
Constructor Details
-
DefaultHtmlMapper
public DefaultHtmlMapper()
-
-
Method Details
-
mapSafeElement
Description copied from interface:HtmlMapper
Maps "safe" HTML element names to semantic XHTML equivalents. If the given element is unknown or deemed unsafe for inclusion in the parse output, then this method returnsnull
and the element will be ignored but the content inside it is still processed. See theHtmlMapper.isDiscardElement(String)
method for a way to discard the entire contents of an element.- Specified by:
mapSafeElement
in interfaceHtmlMapper
- Parameters:
name
- HTML element name (upper case)- Returns:
- XHTML element name (lower case), or
null
if the element is unsafe
-
mapSafeAttribute
Normalizes an attribute name. Assumes that the element name is valid and normalized- Specified by:
mapSafeAttribute
in interfaceHtmlMapper
- Parameters:
elementName
- HTML element name (lower case)attributeName
- HTML attribute name (lower case)- Returns:
- XHTML attribute name (lower case), or
null
if the element is unsafe
-
isDiscardElement
Description copied from interface:HtmlMapper
Checks whether all content within the given HTML element should be discarded instead of including it in the parse output.- Specified by:
isDiscardElement
in interfaceHtmlMapper
- Parameters:
name
- HTML element name (upper case)- Returns:
true
if content inside the named element should be ignored,false
otherwise
-