org.apache.tika.sax
Class SafeContentHandler

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by org.apache.tika.sax.ContentHandlerDecorator
          extended by org.apache.tika.sax.SafeContentHandler
All Implemented Interfaces:
org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler
Direct Known Subclasses:
XHTMLContentHandler

public class SafeContentHandler
extends ContentHandlerDecorator

Content handler decorator that makes sure that the character events (characters(char[], int, int) or ignorableWhitespace(char[], int, int)) passed to the decorated content handler contain only valid XML characters. All invalid characters are replaced with spaces.

The XML standard defines the following Unicode character ranges as valid XML characters:

 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 

Note that currently this class only detects those invalid characters whose UTF-16 representation fits a single char. Also, this class does not ensure that the UTF-16 encoding of incoming characters is correct.


Nested Class Summary
protected static interface SafeContentHandler.Output
          Internal interface that allows both character and ignorable whitespace content to be filtered the same way.
 
Constructor Summary
SafeContentHandler(org.xml.sax.ContentHandler handler)
           
 
Method Summary
 void characters(char[] ch, int start, int length)
           
 void ignorableWhitespace(char[] ch, int start, int length)
           
protected  boolean isInvalid(char ch)
          Checks whether the given character (more accurately a UTF-16 code unit) is an invalid XML character and should be replaced for output.
protected  void writeReplacement(SafeContentHandler.Output output)
          Outputs the replacement for an invalid character.
 
Methods inherited from class org.apache.tika.sax.ContentHandlerDecorator
endDocument, endElement, endPrefixMapping, handleException, processingInstruction, setDocumentLocator, skippedEntity, startDocument, startElement, startPrefixMapping, toString
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
error, fatalError, notationDecl, resolveEntity, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

SafeContentHandler

public SafeContentHandler(org.xml.sax.ContentHandler handler)
Method Detail

isInvalid

protected boolean isInvalid(char ch)
Checks whether the given character (more accurately a UTF-16 code unit) is an invalid XML character and should be replaced for output. Subclasses can override this method to use an alternative definition of which characters should be replaced in the XML output.

Parameters:
ch - character
Returns:
true if the character should be replaced, false otherwise

writeReplacement

protected void writeReplacement(SafeContentHandler.Output output)
                         throws org.xml.sax.SAXException
Outputs the replacement for an invalid character. Subclasses can override this method to use a custom replacement.

Parameters:
output - where the replacement is written to
Throws:
org.xml.sax.SAXException - if the replacement could not be written

characters

public void characters(char[] ch,
                       int start,
                       int length)
                throws org.xml.sax.SAXException
Specified by:
characters in interface org.xml.sax.ContentHandler
Overrides:
characters in class ContentHandlerDecorator
Throws:
org.xml.sax.SAXException

ignorableWhitespace

public void ignorableWhitespace(char[] ch,
                                int start,
                                int length)
                         throws org.xml.sax.SAXException
Specified by:
ignorableWhitespace in interface org.xml.sax.ContentHandler
Overrides:
ignorableWhitespace in class ContentHandlerDecorator
Throws:
org.xml.sax.SAXException


Copyright © 2007-2010 The Apache Software Foundation. All Rights Reserved.