Package org.apache.tika.sax
Class StandardsExtractingContentHandler
java.lang.Object
org.xml.sax.helpers.DefaultHandler
org.apache.tika.sax.ContentHandlerDecorator
org.apache.tika.sax.StandardsExtractingContentHandler
- All Implemented Interfaces:
ContentHandler
,DTDHandler
,EntityResolver
,ErrorHandler
StandardsExtractingContentHandler is a Content Handler used to extract
standard references while parsing.
This handler relies on complex regular expressions which can be slow on some types of input data.
-
Field Summary
-
Constructor Summary
ModifierConstructorDescriptionprotected
Creates a decorator that by default forwards incoming SAX events to a dummy content handler that simply ignores all the events.StandardsExtractingContentHandler
(ContentHandler handler, Metadata metadata) Creates a decorator for the given SAX event handler and Metadata object. -
Method Summary
Modifier and TypeMethodDescriptionvoid
characters
(char[] ch, int start, int length) The characters method is called whenever a Parser wants to pass raw characters to the ContentHandler.void
This method is called whenever the Parser is done parsing the file.double
Gets the threshold to be used for selecting the standard references found within the text based on their score.void
setMaxBufferLength
(int maxBufferLength) The number of characters to store in memory for checking for standards.void
setThreshold
(double score) Sets the score to be used as threshold.Methods inherited from class org.apache.tika.sax.ContentHandlerDecorator
endElement, endPrefixMapping, error, fatalError, handleException, ignorableWhitespace, processingInstruction, setContentHandler, setDocumentLocator, skippedEntity, startDocument, startElement, startPrefixMapping, toString, warning
Methods inherited from class org.xml.sax.helpers.DefaultHandler
notationDecl, resolveEntity, unparsedEntityDecl
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface org.xml.sax.ContentHandler
declaration
-
Field Details
-
STANDARD_REFERENCES
- See Also:
-
-
Constructor Details
-
StandardsExtractingContentHandler
Creates a decorator for the given SAX event handler and Metadata object.- Parameters:
handler
- SAX event handler to be decorated.metadata
-Metadata
object.
-
StandardsExtractingContentHandler
protected StandardsExtractingContentHandler()Creates a decorator that by default forwards incoming SAX events to a dummy content handler that simply ignores all the events. Subclasses should use theContentHandlerDecorator.setContentHandler(ContentHandler)
method to switch to a more usable underlying content handler. Also creates a dummy Metadata object to store phone numbers in.
-
-
Method Details
-
getThreshold
public double getThreshold()Gets the threshold to be used for selecting the standard references found within the text based on their score.- Returns:
- the threshold to be used for selecting the standard references found within the text based on their score.
-
setThreshold
public void setThreshold(double score) Sets the score to be used as threshold.- Parameters:
score
- the score to be used as threshold.
-
characters
The characters method is called whenever a Parser wants to pass raw characters to the ContentHandler. However, standard references are often split across different calls to characters, depending on the specific Parser used. Therefore, we simply add all characters to a StringBuilder and analyze it once the document is finished.- Specified by:
characters
in interfaceContentHandler
- Overrides:
characters
in classContentHandlerDecorator
- Throws:
SAXException
-
endDocument
This method is called whenever the Parser is done parsing the file. So, we check the output for any standard references.- Specified by:
endDocument
in interfaceContentHandler
- Overrides:
endDocument
in classContentHandlerDecorator
- Throws:
SAXException
-
setMaxBufferLength
public void setMaxBufferLength(int maxBufferLength) The number of characters to store in memory for checking for standards. If this is unbounded, the complex regular expressions can take a long time to process some types of data. Only increase this limit with great caution.
-