org.apache.tika.sax
Class SafeContentHandler

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by org.apache.tika.sax.ContentHandlerDecorator
          extended by org.apache.tika.sax.SafeContentHandler
All Implemented Interfaces:
ContentHandler, DTDHandler, EntityResolver, ErrorHandler
Direct Known Subclasses:
XHTMLContentHandler, XMPContentHandler

public class SafeContentHandler
extends ContentHandlerDecorator

Content handler decorator that makes sure that the character events (characters(char[], int, int) or ignorableWhitespace(char[], int, int)) passed to the decorated content handler contain only valid XML characters. All invalid characters are replaced with spaces.

The XML standard defines the following Unicode character ranges as valid XML characters:

 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 

Note that currently this class only detects those invalid characters whose UTF-16 representation fits a single char. Also, this class does not ensure that the UTF-16 encoding of incoming characters is correct.


Nested Class Summary
protected static interface SafeContentHandler.Output
          Internal interface that allows both character and ignorable whitespace content to be filtered the same way.
 
Constructor Summary
SafeContentHandler(ContentHandler handler)
           
 
Method Summary
 void characters(char[] ch, int start, int length)
           
 void endDocument()
           
 void endElement(String uri, String localName, String name)
           
 void ignorableWhitespace(char[] ch, int start, int length)
           
protected  boolean isInvalid(int ch)
          Checks whether the given Unicode character is an invalid XML character and should be replaced for output.
 void startElement(String uri, String localName, String name, Attributes atts)
           
protected  void writeReplacement(SafeContentHandler.Output output)
          Outputs the replacement for an invalid character.
 
Methods inherited from class org.apache.tika.sax.ContentHandlerDecorator
endPrefixMapping, handleException, processingInstruction, setContentHandler, setDocumentLocator, skippedEntity, startDocument, startPrefixMapping, toString
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
error, fatalError, notationDecl, resolveEntity, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

SafeContentHandler

public SafeContentHandler(ContentHandler handler)
Method Detail

isInvalid

protected boolean isInvalid(int ch)
Checks whether the given Unicode character is an invalid XML character and should be replaced for output. Subclasses can override this method to use an alternative definition of which characters should be replaced in the XML output. The default definition from the XML specification is:
 Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 

Parameters:
ch - character
Returns:
true if the character should be replaced, false otherwise

writeReplacement

protected void writeReplacement(SafeContentHandler.Output output)
                         throws SAXException
Outputs the replacement for an invalid character. Subclasses can override this method to use a custom replacement.

Parameters:
output - where the replacement is written to
Throws:
SAXException - if the replacement could not be written

startElement

public void startElement(String uri,
                         String localName,
                         String name,
                         Attributes atts)
                  throws SAXException
Specified by:
startElement in interface ContentHandler
Overrides:
startElement in class ContentHandlerDecorator
Throws:
SAXException

endElement

public void endElement(String uri,
                       String localName,
                       String name)
                throws SAXException
Specified by:
endElement in interface ContentHandler
Overrides:
endElement in class ContentHandlerDecorator
Throws:
SAXException

endDocument

public void endDocument()
                 throws SAXException
Specified by:
endDocument in interface ContentHandler
Overrides:
endDocument in class ContentHandlerDecorator
Throws:
SAXException

characters

public void characters(char[] ch,
                       int start,
                       int length)
                throws SAXException
Specified by:
characters in interface ContentHandler
Overrides:
characters in class ContentHandlerDecorator
Throws:
SAXException

ignorableWhitespace

public void ignorableWhitespace(char[] ch,
                                int start,
                                int length)
                         throws SAXException
Specified by:
ignorableWhitespace in interface ContentHandler
Overrides:
ignorableWhitespace in class ContentHandlerDecorator
Throws:
SAXException


Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.