ParsingExample (Apache Tika 1.13 API)

java.lang.Object
- org.apache.tika.example.ParsingExample

public class ParsingExample
extends Object

Constructor Summary

Constructors
Constructor and Description

ParsingExample()

Constructors
Constructor and Description
`ParsingExample()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`List<Path>`	`extractEmbeddedDocumentsExample(Path outputPath)`
`String`	`parseEmbeddedExample()` This example shows how to extract content from the outer document and all embedded documents.
`String`	`parseExample()` Example of how to use Tika to parse a file when you do not know its file type ahead of time.
`String`	`parseNoEmbeddedExample()` If you don't want content from embedded documents, send in a `ParseContext` that does not contain a `Parser`.
`String`	`parseToStringExample()` Example of how to use Tika's parseToString method to parse the content of a file, and return any text found.
`List<Metadata>`	`recursiveParserWrapperExample()` For documents that may contain embedded documents, it might be helpful to create list of metadata objects, one for the container document and one for each embedded document.
`String`	`serializedRecursiveParserWrapperExample()` We include a simple JSON serializer for a list of metadata with `JsonMetadataList`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ParsingExample
```
public ParsingExample()
```
- Method Detail
  - parseToStringExample
```
public String parseToStringExample()
                            throws IOException,
                                   SAXException,
                                   TikaException
```
    Example of how to use Tika's parseToString method to parse the content of a file, and return any text found.
    Note: Tika.parseToString() will extract content from the outer container document and any embedded/attached documents.
    
    Returns:
    
    The content of a file.
    
    Throws:
    
    IOException
    
    SAXException
    
    TikaException
  - parseExample
```
public String parseExample()
                    throws IOException,
                           SAXException,
                           TikaException
```
    Example of how to use Tika to parse a file when you do not know its file type ahead of time.
    AutoDetectParser attempts to discover the file's type automatically, then call the exact Parser built for that file type.
    The stream to be parsed by the Parser. In this case, we get a file from the resources folder of this project.
    Handlers are used to get the exact information you want out of the host of information gathered by Parsers. The body content handler, intuitively, extracts everything that would go between HTML body tags.
    The Metadata object will be filled by the Parser with Metadata discovered about the file being parsed.
    Note: This example will extract content from the outer document and all embedded documents. However, if you choose to use a ParseContext, make sure to set a Parser or else embedded content will not be parsed.
    
    Returns:
    
    The content of a file.
    
    Throws:
    
    IOException
    
    SAXException
    
    TikaException
  - parseNoEmbeddedExample
```
public String parseNoEmbeddedExample()
                              throws IOException,
                                     SAXException,
                                     TikaException
```
    If you don't want content from embedded documents, send in a ParseContext that does not contain a Parser.
    
    Returns:
    
    The content of a file.
    
    Throws:
    
    IOException
    
    SAXException
    
    TikaException
  - parseEmbeddedExample
```
public String parseEmbeddedExample()
                            throws IOException,
                                   SAXException,
                                   TikaException
```
    This example shows how to extract content from the outer document and all embedded documents. The key is to specify a Parser in the ParseContext.
    
    Returns:
    
    content, including from embedded documents
    
    Throws:
    
    IOException
    
    SAXException
    
    TikaException
  - recursiveParserWrapperExample
```
public List<Metadata> recursiveParserWrapperExample()
                                             throws IOException,
                                                    SAXException,
                                                    TikaException
```
    For documents that may contain embedded documents, it might be helpful to create list of metadata objects, one for the container document and one for each embedded document. This allows easy access to both the extracted content and the metadata of each embedded document. Note that many document formats can contain embedded documents, including traditional container formats -- zip, tar and others -- but also common office document formats including: MSWord, MSExcel, MSPowerPoint, RTF, PDF, MSG and several others.
    The "content" format is determined by the ContentHandlerFactory, and the content is stored in RecursiveParserWrapper.TIKA_CONTENT
    The drawback to the RecursiveParserWrapper is that it caches metadata and contents in memory. This should not be used on files whose contents are too big to be handled in memory.
    
    Returns:
    
    a list of metadata object, one each for the container file and each embedded file
    
    Throws:
    
    IOException
    
    SAXException
    
    TikaException
  - serializedRecursiveParserWrapperExample
```
public String serializedRecursiveParserWrapperExample()
                                               throws IOException,
                                                      SAXException,
                                                      TikaException
```
    We include a simple JSON serializer for a list of metadata with JsonMetadataList. That class also includes a deserializer to convert from JSON back to a List.
    This functionality is also available in tika-app's GUI, and with the -J option on tika-app's commandline. For tika-server users, there is the "rmeta" service that will return this format.
    
    Returns:
    
    a JSON representation of a list of Metadata objects
    
    Throws:
    
    IOException
    
    SAXException
    
    TikaException
  - extractEmbeddedDocumentsExample
```
public List<Path> extractEmbeddedDocumentsExample(Path outputPath)
                                           throws IOException,
                                                  SAXException,
                                                  TikaException
```
    Parameters:
    
    outputPath - -- output directory to place files
    
    Returns:
    
    list of files created
    
    Throws:
    
    IOException
    
    SAXException
    
    TikaException

Class ParsingExample

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

ParsingExample

Method Detail

parseToStringExample

parseExample

parseNoEmbeddedExample

parseEmbeddedExample

recursiveParserWrapperExample

serializedRecursiveParserWrapperExample

extractEmbeddedDocumentsExample