Package org.apache.tika.sax
Class SafeContentHandler
- java.lang.Object
-
- org.xml.sax.helpers.DefaultHandler
-
- org.apache.tika.sax.ContentHandlerDecorator
-
- org.apache.tika.sax.SafeContentHandler
-
- All Implemented Interfaces:
ContentHandler
,DTDHandler
,EntityResolver
,ErrorHandler
- Direct Known Subclasses:
XHTMLContentHandler
,XMPContentHandler
public class SafeContentHandler extends ContentHandlerDecorator
Content handler decorator that makes sure that the character events (characters(char[], int, int)
orignorableWhitespace(char[], int, int)
) passed to the decorated content handler contain only valid XML characters. All invalid characters are replaced with the Unicode replacement character U+FFFD (though a subclass may change this by overriding thewriteReplacement(Output)
method).The XML standard defines the following Unicode character ranges as valid XML characters:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Note that currently this class only detects those invalid characters whose UTF-16 representation fits a single char. Also, this class does not ensure that the UTF-16 encoding of incoming characters is correct.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static interface
SafeContentHandler.Output
Internal interface that allows both character and ignorable whitespace content to be filtered the same way.
-
Constructor Summary
Constructors Constructor Description SafeContentHandler(ContentHandler handler)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
characters(char[] ch, int start, int length)
void
endDocument()
void
endElement(String uri, String localName, String name)
void
ignorableWhitespace(char[] ch, int start, int length)
protected boolean
isInvalid(int ch)
Checks whether the given Unicode character is an invalid XML character and should be replaced for output.void
startElement(String uri, String localName, String name, Attributes atts)
protected void
writeReplacement(SafeContentHandler.Output output)
Outputs the replacement for an invalid character.-
Methods inherited from class org.apache.tika.sax.ContentHandlerDecorator
endPrefixMapping, handleException, processingInstruction, setContentHandler, setDocumentLocator, skippedEntity, startDocument, startPrefixMapping, toString
-
Methods inherited from class org.xml.sax.helpers.DefaultHandler
error, fatalError, notationDecl, resolveEntity, unparsedEntityDecl, warning
-
-
-
-
Constructor Detail
-
SafeContentHandler
public SafeContentHandler(ContentHandler handler)
-
-
Method Detail
-
isInvalid
protected boolean isInvalid(int ch)
Checks whether the given Unicode character is an invalid XML character and should be replaced for output. Subclasses can override this method to use an alternative definition of which characters should be replaced in the XML output. The default definition from the XML specification is:Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
- Parameters:
ch
- character- Returns:
true
if the character should be replaced,false
otherwise
-
writeReplacement
protected void writeReplacement(SafeContentHandler.Output output) throws SAXException
Outputs the replacement for an invalid character. Subclasses can override this method to use a custom replacement.- Parameters:
output
- where the replacement is written to- Throws:
SAXException
- if the replacement could not be written
-
startElement
public void startElement(String uri, String localName, String name, Attributes atts) throws SAXException
- Specified by:
startElement
in interfaceContentHandler
- Overrides:
startElement
in classContentHandlerDecorator
- Throws:
SAXException
-
endElement
public void endElement(String uri, String localName, String name) throws SAXException
- Specified by:
endElement
in interfaceContentHandler
- Overrides:
endElement
in classContentHandlerDecorator
- Throws:
SAXException
-
endDocument
public void endDocument() throws SAXException
- Specified by:
endDocument
in interfaceContentHandler
- Overrides:
endDocument
in classContentHandlerDecorator
- Throws:
SAXException
-
characters
public void characters(char[] ch, int start, int length) throws SAXException
- Specified by:
characters
in interfaceContentHandler
- Overrides:
characters
in classContentHandlerDecorator
- Throws:
SAXException
-
ignorableWhitespace
public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException
- Specified by:
ignorableWhitespace
in interfaceContentHandler
- Overrides:
ignorableWhitespace
in classContentHandlerDecorator
- Throws:
SAXException
-
-