public class ParsingExample extends Object
| Constructor and Description |
|---|
ParsingExample() |
| Modifier and Type | Method and Description |
|---|---|
List<Path> |
extractEmbeddedDocumentsExample(Path outputPath) |
String |
parseEmbeddedExample()
This example shows how to extract content from the outer document and all
embedded documents.
|
String |
parseExample()
Example of how to use Tika to parse a file when you do not know its file type
ahead of time.
|
String |
parseNoEmbeddedExample()
If you don't want content from embedded documents, send in
a
ParseContext that does contains a
EmptyParser. |
String |
parseToStringExample()
Example of how to use Tika's parseToString method to parse the content of a file,
and return any text found.
|
List<Metadata> |
recursiveParserWrapperExample()
For documents that may contain embedded documents, it might be helpful
to create list of metadata objects, one for the container document and
one for each embedded document.
|
String |
serializedRecursiveParserWrapperExample()
We include a simple JSON serializer for a list of metadata with
JsonMetadataList. |
public String parseToStringExample() throws IOException, SAXException, TikaException
Note: Tika.parseToString() will extract content from the outer container document and any embedded/attached documents.
IOExceptionSAXExceptionTikaExceptionpublic String parseExample() throws IOException, SAXException, TikaException
AutoDetectParser attempts to discover the file's type automatically, then call the exact Parser built for that file type.
The stream to be parsed by the Parser. In this case, we get a file from the resources folder of this project.
Handlers are used to get the exact information you want out of the host of information gathered by Parsers. The body content handler, intuitively, extracts everything that would go between HTML body tags.
The Metadata object will be filled by the Parser with Metadata discovered about the file being parsed.
Note: This example will extract content from the outer document and all
embedded documents. However, if you choose to use a ParseContext,
make sure to set a Parser or else embedded content will not be
parsed.
IOExceptionSAXExceptionTikaExceptionpublic String parseNoEmbeddedExample() throws IOException, SAXException, TikaException
ParseContext that does contains a
EmptyParser.IOExceptionSAXExceptionTikaExceptionpublic String parseEmbeddedExample() throws IOException, SAXException, TikaException
Parser in the ParseContext.IOExceptionSAXExceptionTikaExceptionpublic List<Metadata> recursiveParserWrapperExample() throws IOException, SAXException, TikaException
The "content" format is determined by the ContentHandlerFactory, and
the content is stored in org.apache.tika.parser.RecursiveParserWrapper#TIKA_CONTENT
The drawback to the RecursiveParserWrapper is that it caches metadata and contents in memory. This should not be used on files whose contents are too big to be handled in memory.
IOExceptionSAXExceptionTikaExceptionpublic String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException
JsonMetadataList.
That class also includes a deserializer to convert from JSON
back to a ListThis functionality is also available in tika-app's GUI, and with the -J option on tika-app's commandline. For tika-server users, there is the "rmeta" service that will return this format.
IOExceptionSAXExceptionTikaExceptionpublic List<Path> extractEmbeddedDocumentsExample(Path outputPath) throws IOException, SAXException, TikaException
outputPath - -- output directory to place filesIOExceptionSAXExceptionTikaExceptionCopyright © 2007–2023 The Apache Software Foundation. All rights reserved.