Tika (Apache Tika 1.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.tika
Class Tika

java.lang.Object
  org.apache.tika.Tika

public class Tika
extends Object
extends Object

Facade class for accessing Tika functionality. This class hides much of the underlying complexity of the lower level Tika classes and provides simple methods for many common parsing and type detection operations.

Since:: Apache Tika 0.5
See Also:: Parser, Detector

Constructor Summary
`Tika()` Creates a Tika facade using the default configuration.
`Tika(Detector detector)` Creates a Tika facade using the given detector instance and the default parser configuration.
`Tika(Detector detector, Parser parser)` Creates a Tika facade using the given detector and parser instances.
`Tika(TikaConfig config)` Creates a Tika facade using the given configuration.

Method Summary
`String`	`detect(byte[] prefix)` Detects the media type of the given document.
`String`	`detect(byte[] prefix, String name)` Detects the media type of the given document.
`String`	`detect(File file)` Detects the media type of the given file.
`String`	`detect(InputStream stream)` Detects the media type of the given document.
`String`	`detect(InputStream stream, Metadata metadata)` Detects the media type of the given document.
`String`	`detect(InputStream stream, String name)` Detects the media type of the given document.
`String`	`detect(String name)` Detects the media type of a document with the given file name.
`String`	`detect(URL url)` Detects the media type of the resource at the given URL.
`Detector`	`getDetector()` Returns the detector instance used by this facade.
`int`	`getMaxStringLength()` Returns the maximum length of strings returned by the parseToString methods.
`Parser`	`getParser()` Returns the parser instance used by this facade.
`Reader`	`parse(File file)` Parses the given file and returns the extracted text content.
`Reader`	`parse(InputStream stream)` Parses the given document and returns the extracted text content.
`Reader`	`parse(InputStream stream, Metadata metadata)` Parses the given document and returns the extracted text content.
`Reader`	`parse(URL url)` Parses the resource at the given URL and returns the extracted text content.
`String`	`parseToString(File file)` Parses the given file and returns the extracted text content.
`String`	`parseToString(InputStream stream)` Parses the given document and returns the extracted text content.
`String`	`parseToString(InputStream stream, Metadata metadata)` Parses the given document and returns the extracted text content.
`String`	`parseToString(URL url)` Parses the resource at the given URL and returns the extracted text content.
`void`	`setMaxStringLength(int maxStringLength)` Sets the maximum length of strings returned by the parseToString methods.
`String`	`toString()`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Constructor Detail

Tika

public Tika(Detector detector,
            Parser parser)

Creates a Tika facade using the given detector and parser instances.

Parameters:: detector - type detector; parser - document parser
Since:: Apache Tika 0.8

Tika

public Tika(TikaConfig config)

Creates a Tika facade using the given configuration.

Parameters:: config - Tika configuration

Tika

public Tika()

Creates a Tika facade using the default configuration.

Tika

public Tika(Detector detector)

Creates a Tika facade using the given detector instance and the default parser configuration.

Parameters:: detector - type detector
Since:: Apache Tika 0.8

Method Detail

detect

public String detect(InputStream stream,
                     Metadata metadata)
              throws IOException

Detects the media type of the given document. The type detection is based on the content of the given document stream and any given document metadata. The document stream can be null, in which case only the given document metadata is used for type detection.

If the document stream supports the mark feature, then the stream is marked and reset to the original position before this method returns. Only a limited number of bytes are read from the stream.

The given document stream is not closed by this method.

Unlike in the parse(InputStream, Metadata) method, the given document metadata is not modified by this method.

Parameters:: stream - the document stream, or null; metadata - document metadata
Returns:: detected media type
Throws:: IOException - if the stream can not be read

detect

public String detect(InputStream stream,
                     String name)
              throws IOException

Detects the media type of the given document. The type detection is based on the content of the given document stream and the name of the document.

If the document stream supports the mark feature, then the stream is marked and reset to the original position before this method returns. Only a limited number of bytes are read from the stream.

The given document stream is not closed by this method.

Parameters:: stream - the document stream; name - document name
Returns:: detected media type
Throws:: IOException - if the stream can not be read
Since:: Apache Tika 0.9

detect

public String detect(InputStream stream)
              throws IOException

Detects the media type of the given document. The type detection is based on the content of the given document stream.

If the document stream supports the mark feature, then the stream is marked and reset to the original position before this method returns. Only a limited number of bytes are read from the stream.

The given document stream is not closed by this method.

Parameters:: stream - the document stream
Returns:: detected media type
Throws:: IOException - if the stream can not be read

detect

public String detect(byte[] prefix,
                     String name)

Detects the media type of the given document. The type detection is based on the first few bytes of a document and the document name.

For best results at least a few kilobytes of the document data are needed. See also the other detect() methods for better alternatives when you have more than just the document prefix available for type detection.

Parameters:: prefix - first few bytes of the document; name - document name
Returns:: detected media type
Since:: Apache Tika 0.9

detect

public String detect(byte[] prefix)

Detects the media type of the given document. The type detection is based on the first few bytes of a document.

Parameters:: prefix - first few bytes of the document
Returns:: detected media type
Since:: Apache Tika 0.9

detect

public String detect(File file)
              throws IOException

Detects the media type of the given file. The type detection is based on the document content and a potential known file extension.

Use the detect(String) method when you want to detect the type of the document without actually accessing the file.

Parameters:: file - the file
Returns:: detected media type
Throws:: IOException - if the file can not be read

detect

public String detect(URL url)
              throws IOException

Detects the media type of the resource at the given URL. The type detection is based on the document content and a potential known file extension included in the URL.

Use the detect(String) method when you want to detect the type of the document without actually accessing the URL.

Parameters:: url - the URL of the resource
Returns:: detected media type
Throws:: IOException - if the resource can not be read

detect

public String detect(String name)

Detects the media type of a document with the given file name. The type detection is based on known file name extensions.

The given name can also be a URL or a full file path. In such cases only the file name part of the string is used for type detection.

Parameters:: name - the file name of the document
Returns:: detected media type

parse

public Reader parse(InputStream stream,
                    Metadata metadata)
             throws IOException

Parses the given document and returns the extracted text content. Input metadata like a file name or a content type hint can be passed in the given metadata instance. Metadata information extracted from the document is returned in that same metadata instance.

The returned reader will be responsible for closing the given stream. The stream and any associated resources will be closed at or before the time when the Reader.close() method is called.

Parameters:: stream - the document to be parsed; metadata - document metadata
Returns:: extracted text content
Throws:: IOException - if the document can not be read or parsed

parse

public Reader parse(InputStream stream)
             throws IOException

Parses the given document and returns the extracted text content.

The returned reader will be responsible for closing the given stream. The stream and any associated resources will be closed at or before the time when the Reader.close() method is called.

Parameters:: stream - the document to be parsed
Returns:: extracted text content
Throws:: IOException - if the document can not be read or parsed

parse

public Reader parse(File file)
             throws IOException

Parses the given file and returns the extracted text content.

Parameters:: file - the file to be parsed
Returns:: extracted text content
Throws:: IOException - if the file can not be read or parsed

parse

public Reader parse(URL url)
             throws IOException

Parses the resource at the given URL and returns the extracted text content.

Parameters:: url - the URL of the resource to be parsed
Returns:: extracted text content
Throws:: IOException - if the resource can not be read or parsed

parseToString

public String parseToString(InputStream stream,
                            Metadata metadata)
                     throws IOException,
                            TikaException

Parses the given document and returns the extracted text content. The given input stream is closed by this method.

To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation.

NOTE: Unlike most other Tika methods that take an InputStream, this method will close the given stream for you as a convenience. With other methods you are still responsible for closing the stream or a wrapper instance returned by Tika.

Parameters:: stream - the document to be parsed; metadata - document metadata
Returns:: extracted text content
Throws:: IOException - if the document can not be read; TikaException - if the document can not be parsed

parseToString

public String parseToString(InputStream stream)
                     throws IOException,
                            TikaException

Parses the given document and returns the extracted text content. The given input stream is closed by this method.

Parameters:: stream - the document to be parsed
Returns:: extracted text content
Throws:: IOException - if the document can not be read; TikaException - if the document can not be parsed

parseToString

public String parseToString(File file)
                     throws IOException,
                            TikaException

Parses the given file and returns the extracted text content.

Parameters:: file - the file to be parsed
Returns:: extracted text content
Throws:: IOException - if the file can not be read; TikaException - if the file can not be parsed

parseToString

public String parseToString(URL url)
                     throws IOException,
                            TikaException

Parses the resource at the given URL and returns the extracted text content.

Parameters:: url - the URL of the resource to be parsed
Returns:: extracted text content
Throws:: IOException - if the resource can not be read; TikaException - if the resource can not be parsed

getMaxStringLength

public int getMaxStringLength()

Returns the maximum length of strings returned by the parseToString methods.

Returns:: maximum string length, or -1 if the limit has been disabled
Since:: Apache Tika 0.7

setMaxStringLength

public void setMaxStringLength(int maxStringLength)

Sets the maximum length of strings returned by the parseToString methods.

Parameters:: maxStringLength - maximum string length, or -1 to disable this limit
Since:: Apache Tika 0.7

getParser

public Parser getParser()

Returns the parser instance used by this facade.

Returns:: parser instance
Since:: Apache Tika 0.10

getDetector

public Detector getDetector()

Returns the detector instance used by this facade.

Returns:: detector instance
Since:: Apache Tika 0.10

toString

public String toString()

Overrides:: toString in class Object

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.tika Class Tika

Tika

Tika

Tika

Tika

detect

detect

detect

detect

detect

detect

detect

detect

parse

parse

parse

parse

parseToString

parseToString

parseToString

parseToString

getMaxStringLength

setMaxStringLength

getParser

getDetector

toString

org.apache.tika
Class Tika