org.apache.tika.sax
Class SafeContentHandler
java.lang.Object
org.xml.sax.helpers.DefaultHandler
org.apache.tika.sax.ContentHandlerDecorator
org.apache.tika.sax.SafeContentHandler
- All Implemented Interfaces:
- org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler
- Direct Known Subclasses:
- XHTMLContentHandler
public class SafeContentHandler
- extends ContentHandlerDecorator
Content handler decorator that makes sure that the character events
(characters(char[], int, int)
or
ignorableWhitespace(char[], int, int)
) passed to the decorated
content handler contain only valid XML characters. All invalid characters
are replaced with spaces.
The XML standard defines the following Unicode character ranges as
valid XML characters:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Note that currently this class only detects those invalid characters whose
UTF-16 representation fits a single char. Also, this class does not ensure
that the UTF-16 encoding of incoming characters is correct.
Nested Class Summary |
protected static interface |
SafeContentHandler.Output
Internal interface that allows both character and
ignorable whitespace content to be filtered the same way. |
Method Summary |
void |
characters(char[] ch,
int start,
int length)
|
void |
ignorableWhitespace(char[] ch,
int start,
int length)
|
protected boolean |
isInvalid(char ch)
Checks whether the given character (more accurately a UTF-16 code unit)
is an invalid XML character and should be replaced for output. |
protected void |
writeReplacement(SafeContentHandler.Output output)
Outputs the replacement for an invalid character. |
Methods inherited from class org.apache.tika.sax.ContentHandlerDecorator |
endDocument, endElement, endPrefixMapping, handleException, processingInstruction, setDocumentLocator, skippedEntity, startDocument, startElement, startPrefixMapping, toString |
Methods inherited from class org.xml.sax.helpers.DefaultHandler |
error, fatalError, notationDecl, resolveEntity, unparsedEntityDecl, warning |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
SafeContentHandler
public SafeContentHandler(org.xml.sax.ContentHandler handler)
isInvalid
protected boolean isInvalid(char ch)
- Checks whether the given character (more accurately a UTF-16 code unit)
is an invalid XML character and should be replaced for output.
Subclasses can override this method to use an alternative definition
of which characters should be replaced in the XML output.
- Parameters:
ch
- character
- Returns:
true
if the character should be replaced,
false
otherwise
writeReplacement
protected void writeReplacement(SafeContentHandler.Output output)
throws org.xml.sax.SAXException
- Outputs the replacement for an invalid character. Subclasses can
override this method to use a custom replacement.
- Parameters:
output
- where the replacement is written to
- Throws:
org.xml.sax.SAXException
- if the replacement could not be written
characters
public void characters(char[] ch,
int start,
int length)
throws org.xml.sax.SAXException
- Specified by:
characters
in interface org.xml.sax.ContentHandler
- Overrides:
characters
in class ContentHandlerDecorator
- Throws:
org.xml.sax.SAXException
ignorableWhitespace
public void ignorableWhitespace(char[] ch,
int start,
int length)
throws org.xml.sax.SAXException
- Specified by:
ignorableWhitespace
in interface org.xml.sax.ContentHandler
- Overrides:
ignorableWhitespace
in class ContentHandlerDecorator
- Throws:
org.xml.sax.SAXException
Copyright © 2007-2010 The Apache Software Foundation. All Rights Reserved.