public class ParsingExample extends Object
Constructor and Description |
---|
ParsingExample() |
Modifier and Type | Method and Description |
---|---|
List<Path> |
extractEmbeddedDocumentsExample(Path outputPath) |
String |
parseEmbeddedExample()
This example shows how to extract content from the outer document and all
embedded documents.
|
String |
parseExample()
Example of how to use Tika to parse a file when you do not know its file type
ahead of time.
|
String |
parseNoEmbeddedExample()
If you don't want content from embedded documents, send in
a
ParseContext that does contains a
EmptyParser . |
String |
parseToStringExample()
Example of how to use Tika's parseToString method to parse the content of a file,
and return any text found.
|
List<Metadata> |
recursiveParserWrapperExample()
For documents that may contain embedded documents, it might be helpful
to create list of metadata objects, one for the container document and
one for each embedded document.
|
String |
serializedRecursiveParserWrapperExample()
We include a simple JSON serializer for a list of metadata with
JsonMetadataList . |
public String parseToStringExample() throws IOException, SAXException, TikaException
Note: Tika.parseToString() will extract content from the outer container document and any embedded/attached documents.
IOException
SAXException
TikaException
public String parseExample() throws IOException, SAXException, TikaException
AutoDetectParser attempts to discover the file's type automatically, then call the exact Parser built for that file type.
The stream to be parsed by the Parser. In this case, we get a file from the resources folder of this project.
Handlers are used to get the exact information you want out of the host of information gathered by Parsers. The body content handler, intuitively, extracts everything that would go between HTML body tags.
The Metadata object will be filled by the Parser with Metadata discovered about the file being parsed.
Note: This example will extract content from the outer document and all
embedded documents. However, if you choose to use a ParseContext
,
make sure to set a Parser
or else embedded content will not be
parsed.
IOException
SAXException
TikaException
public String parseNoEmbeddedExample() throws IOException, SAXException, TikaException
ParseContext
that does contains a
EmptyParser
.IOException
SAXException
TikaException
public String parseEmbeddedExample() throws IOException, SAXException, TikaException
Parser
in the ParseContext
.IOException
SAXException
TikaException
public List<Metadata> recursiveParserWrapperExample() throws IOException, SAXException, TikaException
The "content" format is determined by the ContentHandlerFactory, and
the content is stored in RecursiveParserWrapper.TIKA_CONTENT
The drawback to the RecursiveParserWrapper is that it caches metadata and contents in memory. This should not be used on files whose contents are too big to be handled in memory.
IOException
SAXException
TikaException
public String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException
JsonMetadataList
.
That class also includes a deserializer to convert from JSON
back to a ListThis functionality is also available in tika-app's GUI, and with the -J option on tika-app's commandline. For tika-server users, there is the "rmeta" service that will return this format.
IOException
SAXException
TikaException
public List<Path> extractEmbeddedDocumentsExample(Path outputPath) throws IOException, SAXException, TikaException
outputPath
- -- output directory to place filesIOException
SAXException
TikaException
Copyright © 2007–2020 The Apache Software Foundation. All rights reserved.