Class ParsingExample
- java.lang.Object
-
- org.apache.tika.example.ParsingExample
-
public class ParsingExample extends Object
-
-
Constructor Summary
Constructors Constructor Description ParsingExample()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description List<Path>
extractEmbeddedDocumentsExample(Path outputPath)
String
parseEmbeddedExample()
This example shows how to extract content from the outer document and all embedded documents.String
parseExample()
Example of how to use Tika to parse a file when you do not know its file type ahead of time.String
parseNoEmbeddedExample()
If you don't want content from embedded documents, send in aParseContext
that does contains aEmptyParser
.String
parseToStringExample()
Example of how to use Tika's parseToString method to parse the content of a file, and return any text found.List<Metadata>
recursiveParserWrapperExample()
For documents that may contain embedded documents, it might be helpful to create list of metadata objects, one for the container document and one for each embedded document.String
serializedRecursiveParserWrapperExample()
We include a simple JSON serializer for a list of metadata withJsonMetadataList
.
-
-
-
Method Detail
-
parseToStringExample
public String parseToStringExample() throws IOException, SAXException, TikaException
Example of how to use Tika's parseToString method to parse the content of a file, and return any text found.Note: Tika.parseToString() will extract content from the outer container document and any embedded/attached documents.
- Returns:
- The content of a file.
- Throws:
IOException
SAXException
TikaException
-
parseExample
public String parseExample() throws IOException, SAXException, TikaException
Example of how to use Tika to parse a file when you do not know its file type ahead of time.AutoDetectParser attempts to discover the file's type automatically, then call the exact Parser built for that file type.
The stream to be parsed by the Parser. In this case, we get a file from the resources folder of this project.
Handlers are used to get the exact information you want out of the host of information gathered by Parsers. The body content handler, intuitively, extracts everything that would go between HTML body tags.
The Metadata object will be filled by the Parser with Metadata discovered about the file being parsed.
Note: This example will extract content from the outer document and all embedded documents. However, if you choose to use a
ParseContext
, make sure to set aParser
or else embedded content will not be parsed.- Returns:
- The content of a file.
- Throws:
IOException
SAXException
TikaException
-
parseNoEmbeddedExample
public String parseNoEmbeddedExample() throws IOException, SAXException, TikaException
If you don't want content from embedded documents, send in aParseContext
that does contains aEmptyParser
.- Returns:
- The content of a file.
- Throws:
IOException
SAXException
TikaException
-
parseEmbeddedExample
public String parseEmbeddedExample() throws IOException, SAXException, TikaException
This example shows how to extract content from the outer document and all embedded documents. The key is to specify aParser
in theParseContext
.- Returns:
- content, including from embedded documents
- Throws:
IOException
SAXException
TikaException
-
recursiveParserWrapperExample
public List<Metadata> recursiveParserWrapperExample() throws IOException, SAXException, TikaException
For documents that may contain embedded documents, it might be helpful to create list of metadata objects, one for the container document and one for each embedded document. This allows easy access to both the extracted content and the metadata of each embedded document. Note that many document formats can contain embedded documents, including traditional container formats -- zip, tar and others -- but also common office document formats including: MSWord, MSExcel, MSPowerPoint, RTF, PDF, MSG and several others.The "content" format is determined by the ContentHandlerFactory, and the content is stored in
org.apache.tika.parser.RecursiveParserWrapper#TIKA_CONTENT
The drawback to the RecursiveParserWrapper is that it caches metadata and contents in memory. This should not be used on files whose contents are too big to be handled in memory.
- Returns:
- a list of metadata object, one each for the container file and each embedded file
- Throws:
IOException
SAXException
TikaException
-
serializedRecursiveParserWrapperExample
public String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException
We include a simple JSON serializer for a list of metadata withJsonMetadataList
. That class also includes a deserializer to convert from JSON back to a List. This functionality is also available in tika-app's GUI, and with the -J option on tika-app's commandline. For tika-server users, there is the "rmeta" service that will return this format.
- Returns:
- a JSON representation of a list of Metadata objects
- Throws:
IOException
SAXException
TikaException
-
extractEmbeddedDocumentsExample
public List<Path> extractEmbeddedDocumentsExample(Path outputPath) throws IOException, SAXException, TikaException
- Parameters:
outputPath
- -- output directory to place files- Returns:
- list of files created
- Throws:
IOException
SAXException
TikaException
-
-