Package org.apache.tika.sax
Class PhoneExtractingContentHandler
- java.lang.Object
-
- org.xml.sax.helpers.DefaultHandler
-
- org.apache.tika.sax.ContentHandlerDecorator
-
- org.apache.tika.sax.PhoneExtractingContentHandler
-
- All Implemented Interfaces:
ContentHandler
,DTDHandler
,EntityResolver
,ErrorHandler
public class PhoneExtractingContentHandler extends ContentHandlerDecorator
Class used to extract phone numbers while parsing. Every time a document is parsed in Tika, the content is split into SAX events. Those SAX events are handled by a ContentHandler. You can think of these events as marking a tag in an HTML file. Once you're finished parsing, you can call handler.toString(), for example, to get the text contents of the file. On the other hand, any of the metadata of the file will be added to the Metadata object passed in during the parse() call. So, the Parser class sends metadata to the Metadata object and content to the ContentHandler. This class is an example of how to combine a ContentHandler and a Metadata. As content is passed to the handler, we first check to see if it matches a textual pattern for a phone number. If the extracted content is a phone number, we add it to the metadata under the key "phonenumbers". So, if you used this ContentHandler when you parsed a document, then called metadata.getValues("phonenumbers"), you would get an array of Strings of phone numbers found in the document. Please see the PhoneExtractingContentHandlerTest for an example of how to use this class.
-
-
Constructor Summary
Constructors Modifier Constructor Description protected
PhoneExtractingContentHandler()
Creates a decorator that by default forwards incoming SAX events to a dummy content handler that simply ignores all the events.PhoneExtractingContentHandler(ContentHandler handler, Metadata metadata)
Creates a decorator for the given SAX event handler and Metadata object.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
characters(char[] ch, int start, int length)
The characters method is called whenever a Parser wants to pass raw...void
endDocument()
This method is called whenever the Parser is done parsing the file.-
Methods inherited from class org.apache.tika.sax.ContentHandlerDecorator
endElement, endPrefixMapping, handleException, ignorableWhitespace, processingInstruction, setContentHandler, setDocumentLocator, skippedEntity, startDocument, startElement, startPrefixMapping, toString
-
Methods inherited from class org.xml.sax.helpers.DefaultHandler
error, fatalError, notationDecl, resolveEntity, unparsedEntityDecl, warning
-
-
-
-
Constructor Detail
-
PhoneExtractingContentHandler
public PhoneExtractingContentHandler(ContentHandler handler, Metadata metadata)
Creates a decorator for the given SAX event handler and Metadata object.- Parameters:
handler
- SAX event handler to be decorated
-
PhoneExtractingContentHandler
protected PhoneExtractingContentHandler()
Creates a decorator that by default forwards incoming SAX events to a dummy content handler that simply ignores all the events. Subclasses should use theContentHandlerDecorator.setContentHandler(ContentHandler)
method to switch to a more usable underlying content handler. Also creates a dummy Metadata object to store phone numbers in.
-
-
Method Detail
-
characters
public void characters(char[] ch, int start, int length) throws SAXException
The characters method is called whenever a Parser wants to pass raw... characters to the ContentHandler. But, sometimes, phone numbers are split accross different calls to characters, depending on the specific Parser used. So, we simply add all characters to a StringBuilder and analyze it once the document is finished.- Specified by:
characters
in interfaceContentHandler
- Overrides:
characters
in classContentHandlerDecorator
- Throws:
SAXException
-
endDocument
public void endDocument() throws SAXException
This method is called whenever the Parser is done parsing the file. So, we check the output for any phone numbers.- Specified by:
endDocument
in interfaceContentHandler
- Overrides:
endDocument
in classContentHandlerDecorator
- Throws:
SAXException
-
-