Package org.apache.tika.utils
Class XMLReaderUtils
java.lang.Object
org.apache.tika.utils.XMLReaderUtils
- All Implemented Interfaces:
Serializable
Utility functions for reading XML.
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intstatic final intDefault size for the pool of SAX Parsers and the pool of DOM builders -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic DocumentbuildDOM(InputStream is) Builds a Document with a DocumentBuilder from the poolstatic DocumentbuildDOM(InputStream is, ParseContext context) This checks context for a user specifiedDocumentBuilder.static DocumentbuildDOM(Reader reader, ParseContext context) This checks context for a user specifiedDocumentBuilder.static DocumentBuilds a Document with a DocumentBuilder from the poolstatic DocumentBuilds a Document with a DocumentBuilder from the poolstatic StringgetAttrValue(String localName, Attributes atts) static DocumentBuilderReturns the DOM builder specified in this parsing context.static DocumentBuildergetDocumentBuilder(ParseContext context) Returns the DOM builder specified in this parsing context.static DocumentBuilderFactoryReturns the DOM builder factory specified in this parsing context.static intstatic intGet the maximum number of times a SAXParser or DOMBuilder may be reused.static intstatic SAXParserReturns the SAX parser specified in this parsing context.static SAXParserFactoryReturns the SAX parser factory specified in this parsing context.static TransformerReturns a new transformerstatic TransformergetTransformer(ParseContext context) Returns the transformer specified in this parsing context.static XMLInputFactoryReturns the StAX input factory specified in this parsing context.static XMLInputFactorygetXMLInputFactory(ParseContext context) Returns the StAX input factory specified in this parsing context.static XMLReaderReturns the XMLReader specified in this parsing context.static voidparseSAX(InputStream is, ContentHandler contentHandler, ParseContext context) This checks context for a user specifiedSAXParser.static voidparseSAX(Reader reader, ContentHandler contentHandler, ParseContext context) This checks context for a user specifiedSAXParser.static voidsetMaxEntityExpansions(int maxEntityExpansions) Set the maximum number of entity expansions allowable in SAX/DOM/StAX parsing.static voidsetMaxNumReuses(int maxNumReuses) static voidsetPoolSize(int poolSize) Set the pool size for cached XML parsers.
-
Field Details
-
DEFAULT_POOL_SIZE
public static final int DEFAULT_POOL_SIZEDefault size for the pool of SAX Parsers and the pool of DOM builders- See Also:
-
DEFAULT_MAX_ENTITY_EXPANSIONS
public static final int DEFAULT_MAX_ENTITY_EXPANSIONS- See Also:
-
DEFAULT_NUM_REUSES
public static final int DEFAULT_NUM_REUSES- See Also:
-
-
Constructor Details
-
XMLReaderUtils
public XMLReaderUtils()
-
-
Method Details
-
getXMLReader
Returns the XMLReader specified in this parsing context. If a reader is not explicitly specified, then one is created using the specified or the default SAX parser.- Returns:
- XMLReader
- Throws:
TikaException- Since:
- Apache Tika 1.13
- See Also:
-
getSAXParser
Returns the SAX parser specified in this parsing context. If a parser is not explicitly specified, then one is created using the specified or the default SAX parser factory.If you call reset() on the parser, make sure to replace the SecurityManager which will be cleared by xerces2 on reset().
- Returns:
- SAX parser
- Throws:
TikaException- if a SAX parser could not be created- Since:
- Apache Tika 0.8
- See Also:
-
getSAXParserFactory
Returns the SAX parser factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware, not validating, and to usesecure XML processing.- Returns:
- SAX parser factory
- Since:
- Apache Tika 0.8
-
getDocumentBuilderFactory
Returns the DOM builder factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware and to apply reasonable security features.- Returns:
- DOM parser factory
- Since:
- Apache Tika 1.13
-
getDocumentBuilder
Returns the DOM builder specified in this parsing context. If a builder is not explicitly specified, then a builder instance is created and returned. The builder instance is configured to apply anIGNORING_SAX_ENTITY_RESOLVER, and it sets the ErrorHandler tonull.- Returns:
- DOM Builder
- Throws:
TikaException- Since:
- Apache Tika 1.13
-
getXMLInputFactory
Returns the StAX input factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware and to apply reasonable security using theIGNORING_STAX_ENTITY_RESOLVER.- Returns:
- StAX input factory
- Since:
- Apache Tika 1.13
-
getTransformer
Returns a new transformerThe transformer instance is configured to to use
secure XML processing.- Returns:
- Transformer
- Throws:
TikaException- when the transformer can not be created- Since:
- Apache Tika 1.17
-
buildDOM
public static Document buildDOM(InputStream is, ParseContext context) throws TikaException, IOException, SAXException This checks context for a user specifiedDocumentBuilder. If one is not found, this reuses a DocumentBuilder from the pool.- Parameters:
is- InputStream to parsecontext- context to use- Returns:
- a document
- Throws:
TikaExceptionIOExceptionSAXException- Since:
- Apache Tika 1.19
-
buildDOM
public static Document buildDOM(Reader reader, ParseContext context) throws TikaException, IOException, SAXException This checks context for a user specifiedDocumentBuilder. If one is not found, this reuses a DocumentBuilder from the pool.- Parameters:
reader- reader (character stream) to parsecontext- context to use- Returns:
- a document
- Throws:
TikaExceptionIOExceptionSAXException- Since:
- Apache Tika 2.5
-
buildDOM
Builds a Document with a DocumentBuilder from the pool- Parameters:
path- path to parse- Returns:
- a document
- Throws:
TikaExceptionIOExceptionSAXException- Since:
- Apache Tika 1.19.1
-
buildDOM
Builds a Document with a DocumentBuilder from the pool- Parameters:
uriString- uriString to process- Returns:
- a document
- Throws:
TikaExceptionIOExceptionSAXException- Since:
- Apache Tika 1.19.1
-
buildDOM
Builds a Document with a DocumentBuilder from the pool- Returns:
- a document
- Throws:
TikaExceptionIOExceptionSAXException- Since:
- Apache Tika 1.19.1
-
parseSAX
public static void parseSAX(InputStream is, ContentHandler contentHandler, ParseContext context) throws TikaException, IOException, SAXException This checks context for a user specifiedSAXParser. If one is not found, this reuses a SAXParser from the pool.- Parameters:
is- InputStream to parsecontentHandler- handler to use; this wraps aOfflineContentHandlerto the content handler as an extra layer of defense against external entity vulnerabilitiescontext- context to use- Throws:
TikaExceptionIOExceptionSAXException- Since:
- Apache Tika 1.19
-
parseSAX
public static void parseSAX(Reader reader, ContentHandler contentHandler, ParseContext context) throws TikaException, IOException, SAXException This checks context for a user specifiedSAXParser. If one is not found, this reuses a SAXParser from the pool.- Parameters:
reader- reader (character stream) to parsecontentHandler- handler to use; this wraps aOfflineContentHandlerto the content handler as an extra layer of defense against external entity vulnerabilitiescontext- context to use- Throws:
TikaExceptionIOExceptionSAXException- Since:
- Apache Tika 2.5
-
getMaxNumReuses
public static int getMaxNumReuses()Get the maximum number of times a SAXParser or DOMBuilder may be reused.- Returns:
-
setMaxNumReuses
public static void setMaxNumReuses(int maxNumReuses) -
getPoolSize
public static int getPoolSize() -
setPoolSize
Set the pool size for cached XML parsers. This has a side effect of locking the pool, and rebuilding the pool from scratch with the most recent settings, such asMAX_ENTITY_EXPANSIONSAs of Tika 3.2.1, if a value of0is passed in, no SAXParsers or DOMBuilders will be pooled, and a new parser/builder will be built for each parse.- Parameters:
poolSize-- Throws:
TikaException- Since:
- Apache Tika 1.19
-
getMaxEntityExpansions
public static int getMaxEntityExpansions() -
setMaxEntityExpansions
public static void setMaxEntityExpansions(int maxEntityExpansions) Set the maximum number of entity expansions allowable in SAX/DOM/StAX parsing. NOTE:A value less than or equal to zero indicates no limit. This will override the system propertyJAXP_ENTITY_EXPANSION_LIMIT_KEYand theDEFAULT_MAX_ENTITY_EXPANSIONSvalue for allowable entity expansionsNOTE: To trigger a rebuild of the pool of parsers with this setting, the client must call
setPoolSize(int)to rebuild the SAX and DOM parsers with this setting.- Parameters:
maxEntityExpansions- -- maximum number of allowable entity expansions- Since:
- Apache Tika 1.19
-
getAttrValue
- Parameters:
localName-atts-- Returns:
- attribute value with that local name or
nullif not found
-
getDocumentBuilder
Returns the DOM builder specified in this parsing context. If a builder is not explicitly specified, then a builder instance is created and returned. The builder instance is configured to apply anIGNORING_SAX_ENTITY_RESOLVER, and it sets the ErrorHandler tonull. Consider usingbuildDOM(InputStream, ParseContext)instead for more efficient reuse of document builders.- Returns:
- DOM Builder
- Throws:
TikaException
-
getXMLInputFactory
Returns the StAX input factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware and to apply reasonable security using theIGNORING_STAX_ENTITY_RESOLVER.- Returns:
- StAX input factory
-
getTransformer
Returns the transformer specified in this parsing context.If a transformer is not explicitly specified, then a default transformer instance is created and returned. The default transformer instance is configured to to use
secure XML processing.- Returns:
- Transformer
- Throws:
TikaException- when the transformer can not be created
-