Package org.apache.tika.sax
Class StandardsExtractingContentHandler
- java.lang.Object
-
- org.xml.sax.helpers.DefaultHandler
-
- org.apache.tika.sax.ContentHandlerDecorator
-
- org.apache.tika.sax.StandardsExtractingContentHandler
-
- All Implemented Interfaces:
ContentHandler
,DTDHandler
,EntityResolver
,ErrorHandler
public class StandardsExtractingContentHandler extends ContentHandlerDecorator
StandardsExtractingContentHandler is a Content Handler used to extract standard references while parsing.This handler relies on complex regular expressions which can be slow on some types of input data.
-
-
Field Summary
Fields Modifier and Type Field Description static String
STANDARD_REFERENCES
-
Constructor Summary
Constructors Modifier Constructor Description protected
StandardsExtractingContentHandler()
Creates a decorator that by default forwards incoming SAX events to a dummy content handler that simply ignores all the events.StandardsExtractingContentHandler(ContentHandler handler, Metadata metadata)
Creates a decorator for the given SAX event handler and Metadata object.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
characters(char[] ch, int start, int length)
The characters method is called whenever a Parser wants to pass raw characters to the ContentHandler.void
endDocument()
This method is called whenever the Parser is done parsing the file.double
getThreshold()
Gets the threshold to be used for selecting the standard references found within the text based on their score.void
setMaxBufferLength(int maxBufferLength)
The number of characters to store in memory for checking for standards.void
setThreshold(double score)
Sets the score to be used as threshold.-
Methods inherited from class org.apache.tika.sax.ContentHandlerDecorator
endElement, endPrefixMapping, error, fatalError, handleException, ignorableWhitespace, processingInstruction, setContentHandler, setDocumentLocator, skippedEntity, startDocument, startElement, startPrefixMapping, toString, warning
-
Methods inherited from class org.xml.sax.helpers.DefaultHandler
notationDecl, resolveEntity, unparsedEntityDecl
-
-
-
-
Field Detail
-
STANDARD_REFERENCES
public static final String STANDARD_REFERENCES
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
StandardsExtractingContentHandler
public StandardsExtractingContentHandler(ContentHandler handler, Metadata metadata)
Creates a decorator for the given SAX event handler and Metadata object.- Parameters:
handler
- SAX event handler to be decorated.metadata
-Metadata
object.
-
StandardsExtractingContentHandler
protected StandardsExtractingContentHandler()
Creates a decorator that by default forwards incoming SAX events to a dummy content handler that simply ignores all the events. Subclasses should use theContentHandlerDecorator.setContentHandler(ContentHandler)
method to switch to a more usable underlying content handler. Also creates a dummy Metadata object to store phone numbers in.
-
-
Method Detail
-
getThreshold
public double getThreshold()
Gets the threshold to be used for selecting the standard references found within the text based on their score.- Returns:
- the threshold to be used for selecting the standard references found within the text based on their score.
-
setThreshold
public void setThreshold(double score)
Sets the score to be used as threshold.- Parameters:
score
- the score to be used as threshold.
-
characters
public void characters(char[] ch, int start, int length) throws SAXException
The characters method is called whenever a Parser wants to pass raw characters to the ContentHandler. However, standard references are often split across different calls to characters, depending on the specific Parser used. Therefore, we simply add all characters to a StringBuilder and analyze it once the document is finished.- Specified by:
characters
in interfaceContentHandler
- Overrides:
characters
in classContentHandlerDecorator
- Throws:
SAXException
-
endDocument
public void endDocument() throws SAXException
This method is called whenever the Parser is done parsing the file. So, we check the output for any standard references.- Specified by:
endDocument
in interfaceContentHandler
- Overrides:
endDocument
in classContentHandlerDecorator
- Throws:
SAXException
-
setMaxBufferLength
public void setMaxBufferLength(int maxBufferLength)
The number of characters to store in memory for checking for standards. If this is unbounded, the complex regular expressions can take a long time to process some types of data. Only increase this limit with great caution.
-
-