public class ParsingExample extends Object
|Constructor and Description
|Modifier and Type
|Method and Description
This example shows how to extract content from the outer document and all embedded documents.
Example of how to use Tika to parse a file when you do not know its file type ahead of time.
Example of how to use Tika's parseToString method to parse the content of a file, and return any text found.
For documents that may contain embedded documents, it might be helpful to create list of metadata objects, one for the container document and one for each embedded document.
We include a simple JSON serializer for a list of metadata with
public String parseToStringExample() throws IOException, SAXException, TikaException
Note: Tika.parseToString() will extract content from the outer container document and any embedded/attached documents.
public String parseExample() throws IOException, SAXException, TikaException
AutoDetectParser attempts to discover the file's type automatically, then call the exact Parser built for that file type.
The stream to be parsed by the Parser. In this case, we get a file from the resources folder of this project.
Handlers are used to get the exact information you want out of the host of information gathered by Parsers. The body content handler, intuitively, extracts everything that would go between HTML body tags.
The Metadata object will be filled by the Parser with Metadata discovered about the file being parsed.
Note: This example will extract content from the outer document and all
embedded documents. However, if you choose to use a
make sure to set a
Parser or else embedded content will not be
public String parseNoEmbeddedExample() throws IOException, SAXException, TikaException
ParseContext that does not contain a
public String parseEmbeddedExample() throws IOException, SAXException, TikaException
Parser in the
public List<Metadata> recursiveParserWrapperExample() throws IOException, SAXException, TikaException
The "content" format is determined by the ContentHandlerFactory, and
the content is stored in
The drawback to the RecursiveParserWrapper is that it caches metadata and contents in memory. This should not be used on files whose contents are too big to be handled in memory.
public String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException
That class also includes a deserializer to convert from JSON
back to a List
This functionality is also available in tika-app's GUI, and with the -J option on tika-app's commandline. For tika-server users, there is the "rmeta" service that will return this format.
Copyright © 2007–2015 The Apache Software Foundation. All rights reserved.