org.apache.tika.sax
Class SafeContentHandler
java.lang.Object
org.xml.sax.helpers.DefaultHandler
org.apache.tika.sax.ContentHandlerDecorator
org.apache.tika.sax.SafeContentHandler
- All Implemented Interfaces:
- ContentHandler, DTDHandler, EntityResolver, ErrorHandler
- Direct Known Subclasses:
- XHTMLContentHandler, XMPContentHandler
public class SafeContentHandler
- extends ContentHandlerDecorator
Content handler decorator that makes sure that the character events
(characters(char[], int, int)
or
ignorableWhitespace(char[], int, int)
) passed to the decorated
content handler contain only valid XML characters. All invalid characters
are replaced with spaces.
The XML standard defines the following Unicode character ranges as
valid XML characters:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Note that currently this class only detects those invalid characters whose
UTF-16 representation fits a single char. Also, this class does not ensure
that the UTF-16 encoding of incoming characters is correct.
Nested Class Summary |
protected static interface |
SafeContentHandler.Output
Internal interface that allows both character and
ignorable whitespace content to be filtered the same way. |
Methods inherited from class org.xml.sax.helpers.DefaultHandler |
error, fatalError, notationDecl, resolveEntity, unparsedEntityDecl, warning |
SafeContentHandler
public SafeContentHandler(ContentHandler handler)
isInvalid
protected boolean isInvalid(int ch)
- Checks whether the given Unicode character is an invalid XML character
and should be replaced for output. Subclasses can override this method
to use an alternative definition of which characters should be replaced
in the XML output. The default definition from the XML specification is:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
- Parameters:
ch
- character
- Returns:
true
if the character should be replaced,
false
otherwise
writeReplacement
protected void writeReplacement(SafeContentHandler.Output output)
throws SAXException
- Outputs the replacement for an invalid character. Subclasses can
override this method to use a custom replacement.
- Parameters:
output
- where the replacement is written to
- Throws:
SAXException
- if the replacement could not be written
startElement
public void startElement(String uri,
String localName,
String name,
Attributes atts)
throws SAXException
- Specified by:
startElement
in interface ContentHandler
- Overrides:
startElement
in class ContentHandlerDecorator
- Throws:
SAXException
endElement
public void endElement(String uri,
String localName,
String name)
throws SAXException
- Specified by:
endElement
in interface ContentHandler
- Overrides:
endElement
in class ContentHandlerDecorator
- Throws:
SAXException
endDocument
public void endDocument()
throws SAXException
- Specified by:
endDocument
in interface ContentHandler
- Overrides:
endDocument
in class ContentHandlerDecorator
- Throws:
SAXException
characters
public void characters(char[] ch,
int start,
int length)
throws SAXException
- Specified by:
characters
in interface ContentHandler
- Overrides:
characters
in class ContentHandlerDecorator
- Throws:
SAXException
ignorableWhitespace
public void ignorableWhitespace(char[] ch,
int start,
int length)
throws SAXException
- Specified by:
ignorableWhitespace
in interface ContentHandler
- Overrides:
ignorableWhitespace
in class ContentHandlerDecorator
- Throws:
SAXException
Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.