Class XMLReaderUtils

java.lang.Object
org.apache.tika.utils.XMLReaderUtils
All Implemented Interfaces:
Serializable

public class XMLReaderUtils extends Object implements Serializable
Utility functions for reading XML.
See Also:
  • Field Details

    • DEFAULT_POOL_SIZE

      public static final int DEFAULT_POOL_SIZE
      Default size for the pool of SAX Parsers and the pool of DOM builders
      See Also:
    • DEFAULT_MAX_ENTITY_EXPANSIONS

      public static final int DEFAULT_MAX_ENTITY_EXPANSIONS
      See Also:
  • Constructor Details

    • XMLReaderUtils

      public XMLReaderUtils()
  • Method Details

    • getXMLReader

      public static XMLReader getXMLReader() throws TikaException
      Returns the XMLReader specified in this parsing context. If a reader is not explicitly specified, then one is created using the specified or the default SAX parser.
      Returns:
      XMLReader
      Throws:
      TikaException
      Since:
      Apache Tika 1.13
      See Also:
    • getSAXParser

      public static SAXParser getSAXParser() throws TikaException
      Returns the SAX parser specified in this parsing context. If a parser is not explicitly specified, then one is created using the specified or the default SAX parser factory.

      If you call reset() on the parser, make sure to replace the SecurityManager which will be cleared by xerces2 on reset().

      Returns:
      SAX parser
      Throws:
      TikaException - if a SAX parser could not be created
      Since:
      Apache Tika 0.8
      See Also:
    • getSAXParserFactory

      public static SAXParserFactory getSAXParserFactory()
      Returns the SAX parser factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware, not validating, and to use secure XML processing.
      Returns:
      SAX parser factory
      Since:
      Apache Tika 0.8
    • getDocumentBuilderFactory

      public static DocumentBuilderFactory getDocumentBuilderFactory()
      Returns the DOM builder factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware and to apply reasonable security features.
      Returns:
      DOM parser factory
      Since:
      Apache Tika 1.13
    • getDocumentBuilder

      public static DocumentBuilder getDocumentBuilder() throws TikaException
      Returns the DOM builder specified in this parsing context. If a builder is not explicitly specified, then a builder instance is created and returned. The builder instance is configured to apply an IGNORING_SAX_ENTITY_RESOLVER, and it sets the ErrorHandler to null.
      Returns:
      DOM Builder
      Throws:
      TikaException
      Since:
      Apache Tika 1.13
    • getXMLInputFactory

      public static XMLInputFactory getXMLInputFactory()
      Returns the StAX input factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware and to apply reasonable security using the IGNORING_STAX_ENTITY_RESOLVER.
      Returns:
      StAX input factory
      Since:
      Apache Tika 1.13
    • getTransformer

      public static Transformer getTransformer() throws TikaException
      Returns a new transformer

      The transformer instance is configured to to use secure XML processing.

      Returns:
      Transformer
      Throws:
      TikaException - when the transformer can not be created
      Since:
      Apache Tika 1.17
    • buildDOM

      public static Document buildDOM(InputStream is, ParseContext context) throws TikaException, IOException, SAXException
      This checks context for a user specified DocumentBuilder. If one is not found, this reuses a DocumentBuilder from the pool.
      Parameters:
      is - InputStream to parse
      context - context to use
      Returns:
      a document
      Throws:
      TikaException
      IOException
      SAXException
      Since:
      Apache Tika 1.19
    • buildDOM

      public static Document buildDOM(Reader reader, ParseContext context) throws TikaException, IOException, SAXException
      This checks context for a user specified DocumentBuilder. If one is not found, this reuses a DocumentBuilder from the pool.
      Parameters:
      reader - reader (character stream) to parse
      context - context to use
      Returns:
      a document
      Throws:
      TikaException
      IOException
      SAXException
      Since:
      Apache Tika 2.5
    • buildDOM

      public static Document buildDOM(Path path) throws TikaException, IOException, SAXException
      Builds a Document with a DocumentBuilder from the pool
      Parameters:
      path - path to parse
      Returns:
      a document
      Throws:
      TikaException
      IOException
      SAXException
      Since:
      Apache Tika 1.19.1
    • buildDOM

      public static Document buildDOM(String uriString) throws TikaException, IOException, SAXException
      Builds a Document with a DocumentBuilder from the pool
      Parameters:
      uriString - uriString to process
      Returns:
      a document
      Throws:
      TikaException
      IOException
      SAXException
      Since:
      Apache Tika 1.19.1
    • buildDOM

      public static Document buildDOM(InputStream is) throws TikaException, IOException, SAXException
      Builds a Document with a DocumentBuilder from the pool
      Returns:
      a document
      Throws:
      TikaException
      IOException
      SAXException
      Since:
      Apache Tika 1.19.1
    • parseSAX

      public static void parseSAX(InputStream is, ContentHandler contentHandler, ParseContext context) throws TikaException, IOException, SAXException
      This checks context for a user specified SAXParser. If one is not found, this reuses a SAXParser from the pool.
      Parameters:
      is - InputStream to parse
      contentHandler - handler to use; this wraps a OfflineContentHandler to the content handler as an extra layer of defense against external entity vulnerabilities
      context - context to use
      Throws:
      TikaException
      IOException
      SAXException
      Since:
      Apache Tika 1.19
    • parseSAX

      public static void parseSAX(Reader reader, ContentHandler contentHandler, ParseContext context) throws TikaException, IOException, SAXException
      This checks context for a user specified SAXParser. If one is not found, this reuses a SAXParser from the pool.
      Parameters:
      reader - reader (character stream) to parse
      contentHandler - handler to use; this wraps a OfflineContentHandler to the content handler as an extra layer of defense against external entity vulnerabilities
      context - context to use
      Throws:
      TikaException
      IOException
      SAXException
      Since:
      Apache Tika 2.5
    • getPoolSize

      public static int getPoolSize()
    • setPoolSize

      public static void setPoolSize(int poolSize) throws TikaException
      Set the pool size for cached XML parsers. This has a side effect of locking the pool, and rebuilding the pool from scratch with the most recent settings, such as MAX_ENTITY_EXPANSIONS
      Parameters:
      poolSize -
      Throws:
      TikaException
      Since:
      Apache Tika 1.19
    • getMaxEntityExpansions

      public static int getMaxEntityExpansions()
    • setMaxEntityExpansions

      public static void setMaxEntityExpansions(int maxEntityExpansions)
      Set the maximum number of entity expansions allowable in SAX/DOM/StAX parsing. NOTE:A value less than or equal to zero indicates no limit. This will override the system property JAXP_ENTITY_EXPANSION_LIMIT_KEY and the DEFAULT_MAX_ENTITY_EXPANSIONS value for allowable entity expansions

      NOTE: To trigger a rebuild of the pool of parsers with this setting, the client must call setPoolSize(int) to rebuild the SAX and DOM parsers with this setting.

      Parameters:
      maxEntityExpansions - -- maximum number of allowable entity expansions
      Since:
      Apache Tika 1.19
    • getAttrValue

      public static String getAttrValue(String localName, Attributes atts)
      Parameters:
      localName -
      atts -
      Returns:
      attribute value with that local name or null if not found
    • getDocumentBuilder

      public static DocumentBuilder getDocumentBuilder(ParseContext context) throws TikaException
      Returns the DOM builder specified in this parsing context. If a builder is not explicitly specified, then a builder instance is created and returned. The builder instance is configured to apply an IGNORING_SAX_ENTITY_RESOLVER, and it sets the ErrorHandler to null. Consider using buildDOM(InputStream, ParseContext) instead for more efficient reuse of document builders.
      Returns:
      DOM Builder
      Throws:
      TikaException
    • getXMLInputFactory

      public static XMLInputFactory getXMLInputFactory(ParseContext context)
      Returns the StAX input factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware and to apply reasonable security using the IGNORING_STAX_ENTITY_RESOLVER.
      Returns:
      StAX input factory
    • getTransformer

      public static Transformer getTransformer(ParseContext context) throws TikaException
      Returns the transformer specified in this parsing context.

      If a transformer is not explicitly specified, then a default transformer instance is created and returned. The default transformer instance is configured to to use secure XML processing.

      Returns:
      Transformer
      Throws:
      TikaException - when the transformer can not be created