Class ParsingExample


  • public class ParsingExample
    extends Object
    • Constructor Detail

      • ParsingExample

        public ParsingExample()
    • Method Detail

      • parseToStringExample

        public String parseToStringExample()
                                    throws IOException,
                                           SAXException,
                                           TikaException
        Example of how to use Tika's parseToString method to parse the content of a file, and return any text found.

        Note: Tika.parseToString() will extract content from the outer container document and any embedded/attached documents.

        Returns:
        The content of a file.
        Throws:
        IOException
        SAXException
        TikaException
      • parseExample

        public String parseExample()
                            throws IOException,
                                   SAXException,
                                   TikaException
        Example of how to use Tika to parse a file when you do not know its file type ahead of time.

        AutoDetectParser attempts to discover the file's type automatically, then call the exact Parser built for that file type.

        The stream to be parsed by the Parser. In this case, we get a file from the resources folder of this project.

        Handlers are used to get the exact information you want out of the host of information gathered by Parsers. The body content handler, intuitively, extracts everything that would go between HTML body tags.

        The Metadata object will be filled by the Parser with Metadata discovered about the file being parsed.

        Note: This example will extract content from the outer document and all embedded documents. However, if you choose to use a ParseContext, make sure to set a Parser or else embedded content will not be parsed.

        Returns:
        The content of a file.
        Throws:
        IOException
        SAXException
        TikaException
      • recursiveParserWrapperExample

        public List<Metadata> recursiveParserWrapperExample()
                                                     throws IOException,
                                                            SAXException,
                                                            TikaException
        For documents that may contain embedded documents, it might be helpful to create list of metadata objects, one for the container document and one for each embedded document. This allows easy access to both the extracted content and the metadata of each embedded document. Note that many document formats can contain embedded documents, including traditional container formats -- zip, tar and others -- but also common office document formats including: MSWord, MSExcel, MSPowerPoint, RTF, PDF, MSG and several others.

        The "content" format is determined by the ContentHandlerFactory, and the content is stored in org.apache.tika.parser.RecursiveParserWrapper#TIKA_CONTENT

        The drawback to the RecursiveParserWrapper is that it caches metadata and contents in memory. This should not be used on files whose contents are too big to be handled in memory.

        Returns:
        a list of metadata object, one each for the container file and each embedded file
        Throws:
        IOException
        SAXException
        TikaException
      • serializedRecursiveParserWrapperExample

        public String serializedRecursiveParserWrapperExample()
                                                       throws IOException,
                                                              SAXException,
                                                              TikaException
        We include a simple JSON serializer for a list of metadata with JsonMetadataList. That class also includes a deserializer to convert from JSON back to a List.

        This functionality is also available in tika-app's GUI, and with the -J option on tika-app's commandline. For tika-server users, there is the "rmeta" service that will return this format.

        Returns:
        a JSON representation of a list of Metadata objects
        Throws:
        IOException
        SAXException
        TikaException