Class ParsingExample
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionextractEmbeddedDocumentsExample
(Path outputPath) This example shows how to extract content from the outer document and all embedded documents.Example of how to use Tika to parse a file when you do not know its file type ahead of time.If you don't want content from embedded documents, send in aParseContext
that does contains aEmptyParser
.Example of how to use Tika's parseToString method to parse the content of a file, and return any text found.For documents that may contain embedded documents, it might be helpful to create list of metadata objects, one for the container document and one for each embedded document.We include a simple JSON serializer for a list of metadata withJsonMetadataList
.
-
Constructor Details
-
ParsingExample
public ParsingExample()
-
-
Method Details
-
parseToStringExample
Example of how to use Tika's parseToString method to parse the content of a file, and return any text found.Note: Tika.parseToString() will extract content from the outer container document and any embedded/attached documents.
- Returns:
- The content of a file.
- Throws:
IOException
SAXException
TikaException
-
parseExample
Example of how to use Tika to parse a file when you do not know its file type ahead of time.AutoDetectParser attempts to discover the file's type automatically, then call the exact Parser built for that file type.
The stream to be parsed by the Parser. In this case, we get a file from the resources folder of this project.
Handlers are used to get the exact information you want out of the host of information gathered by Parsers. The body content handler, intuitively, extracts everything that would go between HTML body tags.
The Metadata object will be filled by the Parser with Metadata discovered about the file being parsed.
Note: This example will extract content from the outer document and all embedded documents. However, if you choose to use a
ParseContext
, make sure to set aParser
or else embedded content will not be parsed.- Returns:
- The content of a file.
- Throws:
IOException
SAXException
TikaException
-
parseNoEmbeddedExample
If you don't want content from embedded documents, send in aParseContext
that does contains aEmptyParser
.- Returns:
- The content of a file.
- Throws:
IOException
SAXException
TikaException
-
parseEmbeddedExample
This example shows how to extract content from the outer document and all embedded documents. The key is to specify aParser
in theParseContext
.- Returns:
- content, including from embedded documents
- Throws:
IOException
SAXException
TikaException
-
recursiveParserWrapperExample
public List<Metadata> recursiveParserWrapperExample() throws IOException, SAXException, TikaExceptionFor documents that may contain embedded documents, it might be helpful to create list of metadata objects, one for the container document and one for each embedded document. This allows easy access to both the extracted content and the metadata of each embedded document. Note that many document formats can contain embedded documents, including traditional container formats -- zip, tar and others -- but also common office document formats including: MSWord, MSExcel, MSPowerPoint, RTF, PDF, MSG and several others.The "content" format is determined by the ContentHandlerFactory, and the content is stored in
org.apache.tika.parser.RecursiveParserWrapper#TIKA_CONTENT
The drawback to the RecursiveParserWrapper is that it caches metadata and contents in memory. This should not be used on files whose contents are too big to be handled in memory.
- Returns:
- a list of metadata object, one each for the container file and each embedded file
- Throws:
IOException
SAXException
TikaException
-
serializedRecursiveParserWrapperExample
public String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaExceptionWe include a simple JSON serializer for a list of metadata withJsonMetadataList
. That class also includes a deserializer to convert from JSON back to a List. This functionality is also available in tika-app's GUI, and with the -J option on tika-app's commandline. For tika-server users, there is the "rmeta" service that will return this format.
- Returns:
- a JSON representation of a list of Metadata objects
- Throws:
IOException
SAXException
TikaException
-
extractEmbeddedDocumentsExample
public List<Path> extractEmbeddedDocumentsExample(Path outputPath) throws IOException, SAXException, TikaException - Parameters:
outputPath
- -- output directory to place files- Returns:
- list of files created
- Throws:
IOException
SAXException
TikaException
-