org.apache.tika.Tika

public class Tika extends Object

Facade class for accessing Tika functionality. This class hides much of the underlying complexity of the lower level Tika classes and provides simple methods for many common parsing and type detection operations.

Since:

Apache Tika 0.5

See Also:

Constructor Summary

Constructors

Constructor

Description

Tika()

Creates a Tika facade using the default configuration.

Tika(TikaConfig config)

Creates a Tika facade using the given configuration.

Tika(Detector detector)

Creates a Tika facade using the given detector instance, the default parser configuration, and the default Translator.

Tika(Detector detector, Parser parser)

Creates a Tika facade using the given detector and parser instances, but the default Translator.

Tika(Detector detector, Parser parser, Translator translator)

Creates a Tika facade using the given detector, parser, and translator instances.
Method Summary

Modifier and Type

Method

Description

String

detect(byte[] prefix)

Detects the media type of the given document.

String

detect(byte[] prefix, String name)

Detects the media type of the given document.

String

detect(File file)

Detects the media type of the given file.

String

detect(InputStream stream)

Detects the media type of the given document.

String

detect(InputStream stream, String name)

Detects the media type of the given document.

String

detect(InputStream stream, Metadata metadata)

Detects the media type of the given document.

String

detect(String name)

Detects the media type of a document with the given file name.

String

detect(URL url)

Detects the media type of the resource at the given URL.

String

detect(Path path)

Detects the media type of the file at the given path.

Detector

getDetector()

Returns the detector instance used by this facade.

int

getMaxStringLength()

Returns the maximum length of strings returned by the parseToString methods.

Parser

getParser()

Returns the parser instance used by this facade.

Translator

getTranslator()

Returns the translator instance used by this facade.

Reader

parse(File file)

Parses the given file and returns the extracted text content.

Reader

parse(File file, Metadata metadata)

Parses the given file and returns the extracted text content.

Reader

parse(InputStream stream)

Parses the given document and returns the extracted text content.

Reader

parse(InputStream stream, Metadata metadata)

Parses the given document and returns the extracted text content.

Reader

parse(URL url)

Parses the resource at the given URL and returns the extracted text content.

Reader

parse(Path path)

Parses the file at the given path and returns the extracted text content.

Reader

parse(Path path, Metadata metadata)

Parses the file at the given path and returns the extracted text content.

String

parseToString(File file)

Parses the given file and returns the extracted text content.

String

parseToString(InputStream stream)

Parses the given document and returns the extracted text content.

String

parseToString(InputStream stream, Metadata metadata)

Parses the given document and returns the extracted text content.

String

parseToString(InputStream stream, Metadata metadata, int maxLength)

Parses the given document and returns the extracted text content.

String

parseToString(URL url)

Parses the resource at the given URL and returns the extracted text content.

String

parseToString(Path path)

Parses the file at the given path and returns the extracted text content.

void

setMaxStringLength(int maxStringLength)

Sets the maximum length of strings returned by the parseToString methods.

String

toString()

String

translate(String text, String targetLanguage)

Translate the given text String to the given language, attempting to auto-detect the source language.

String

translate(String text, String sourceLanguage, String targetLanguage)

Translate the given text String to and from the given languages.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Constructor Details
- Tika
  
  public Tika(Detector detector, Parser parser)
  
  Creates a Tika facade using the given detector and parser instances, but the default Translator.
  
  Parameters:
  
  detector - type detector
  
  parser - document parser
  
  Since:
  
  Apache Tika 0.8
- Tika
  
  public Tika(Detector detector, Parser parser, Translator translator)
  
  Creates a Tika facade using the given detector, parser, and translator instances.
  
  Parameters:
  
  detector - type detector
  
  parser - document parser
  
  translator - text translator
  
  Since:
  
  Apache Tika 1.6
- Tika
  
  public Tika(TikaConfig config)
  
  Creates a Tika facade using the given configuration.
  
  Parameters:
  
  config - Tika configuration
- Tika
  
  public Tika()
  
  Creates a Tika facade using the default configuration.
- Tika
  
  public Tika(Detector detector)
  
  Creates a Tika facade using the given detector instance, the default parser configuration, and the default Translator.
  
  Parameters:
  
  detector - type detector
  
  Since:
  
  Apache Tika 0.8
Method Details
- detect
  
  public String detect(InputStream stream, Metadata metadata) throws IOException
  
  Detects the media type of the given document. The type detection is based on the content of the given document stream and any given document metadata. The document stream can be null, in which case only the given document metadata is used for type detection.
  If the document stream supports the mark feature, then the stream is marked and reset to the original position before this method returns. Only a limited number of bytes are read from the stream.
  The given document stream is not closed by this method.
  Unlike in the parse(InputStream, Metadata) method, the given document metadata is not modified by this method.
  
  Parameters:
  
  stream - the document stream, or null
  
  metadata - document metadata
  
  Returns:
  
  detected media type
  
  Throws:
  
  IOException - if the stream can not be read
- detect
  
  public String detect(InputStream stream, String name) throws IOException
  
  Detects the media type of the given document. The type detection is based on the content of the given document stream and the name of the document.
  If the document stream supports the mark feature, then the stream is marked and reset to the original position before this method returns. Only a limited number of bytes are read from the stream.
  The given document stream is not closed by this method.
  
  Parameters:
  
  stream - the document stream
  
  name - document name
  
  Returns:
  
  detected media type
  
  Throws:
  
  IOException - if the stream can not be read
  
  Since:
  
  Apache Tika 0.9
- detect
  
  public String detect(InputStream stream) throws IOException
  
  Detects the media type of the given document. The type detection is based on the content of the given document stream.
  If the document stream supports the mark feature, then the stream is marked and reset to the original position before this method returns. Only a limited number of bytes are read from the stream.
  The given document stream is not closed by this method.
  
  Parameters:
  
  stream - the document stream
  
  Returns:
  
  detected media type
  
  Throws:
  
  IOException - if the stream can not be read
- detect
  
  public String detect(byte[] prefix, String name)
  
  Detects the media type of the given document. The type detection is based on the first few bytes of a document and the document name.
  For best results at least a few kilobytes of the document data are needed. See also the other detect() methods for better alternatives when you have more than just the document prefix available for type detection.
  
  Parameters:
  
  prefix - first few bytes of the document
  
  name - document name
  
  Returns:
  
  detected media type
  
  Since:
  
  Apache Tika 0.9
- detect
  
  public String detect(byte[] prefix)
  
  Detects the media type of the given document. The type detection is based on the first few bytes of a document.
  For best results at least a few kilobytes of the document data are needed. See also the other detect() methods for better alternatives when you have more than just the document prefix available for type detection.
  
  Parameters:
  
  prefix - first few bytes of the document
  
  Returns:
  
  detected media type
  
  Since:
  
  Apache Tika 0.9
- detect
  
  public String detect(Path path) throws IOException
  
  Detects the media type of the file at the given path. The type detection is based on the document content and a potential known file extension.
  Use the detect(String) method when you want to detect the type of the document without actually accessing the file.
  
  Parameters:
  
  path - the path of the file
  
  Returns:
  
  detected media type
  
  Throws:
  
  IOException - if the file can not be read
- detect
  
  public String detect(File file) throws IOException
  
  Detects the media type of the given file. The type detection is based on the document content and a potential known file extension.
  Use the detect(String) method when you want to detect the type of the document without actually accessing the file.
  Parameters:
  
  file - the file
  
  Returns:
  
  detected media type
  
  Throws:
  
  IOException - if the file can not be read
  
  See Also:
  
  detect(Path)
- detect
  
  public String detect(URL url) throws IOException
  
  Detects the media type of the resource at the given URL. The type detection is based on the document content and a potential known file extension included in the URL.
  Use the detect(String) method when you want to detect the type of the document without actually accessing the URL.
  
  Parameters:
  
  url - the URL of the resource
  
  Returns:
  
  detected media type
  
  Throws:
  
  IOException - if the resource can not be read
- detect
  
  public String detect(String name)
  
  Detects the media type of a document with the given file name. The type detection is based on known file name extensions.
  The given name can also be a URL or a full file path. In such cases only the file name part of the string is used for type detection.
  
  Parameters:
  
  name - the file name of the document
  
  Returns:
  
  detected media type
- translate
  
  public String translate(String text, String sourceLanguage, String targetLanguage)
  
  Translate the given text String to and from the given languages.
  Parameters:
  
  text - The text to translate.
  
  sourceLanguage - The input text language (for example, "hi").
  
  targetLanguage - The desired output language (for example, "fr").
  
  Returns:
  
  The translated text. If translation is unavailable (client keys not set), returns the same text back.
  
  See Also:
  
  Translator
- translate
  
  public String translate(String text, String targetLanguage)
  
  Translate the given text String to the given language, attempting to auto-detect the source language.
  Parameters:
  
  text - The text to translate.
  
  targetLanguage - The desired output language (for example, "en").
  
  Returns:
  
  The translated text. If translation is unavailable (client keys not set), returns the same text back.
  
  See Also:
  
  Translator
- parse
  
  public Reader parse(InputStream stream, Metadata metadata) throws IOException
  
  Parses the given document and returns the extracted text content. Input metadata like a file name or a content type hint can be passed in the given metadata instance. Metadata information extracted from the document is returned in that same metadata instance.
  The returned reader will be responsible for closing the given stream. The stream and any associated resources will be closed at or before the time when the Reader.close() method is called.
  
  Parameters:
  
  stream - the document to be parsed
  
  metadata - where document's metadata will be populated
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the document can not be read or parsed
- parse
  
  public Reader parse(InputStream stream) throws IOException
  
  Parses the given document and returns the extracted text content.
  The returned reader will be responsible for closing the given stream. The stream and any associated resources will be closed at or before the time when the Reader.close() method is called.
  
  Parameters:
  
  stream - the document to be parsed
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the document can not be read or parsed
- parse
  
  public Reader parse(Path path, Metadata metadata) throws IOException
  
  Parses the file at the given path and returns the extracted text content.
  Metadata information extracted from the document is returned in the supplied metadata instance.
  
  Parameters:
  
  path - the path of the file to be parsed
  
  metadata - where document's metadata will be populated
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the file can not be read or parsed
- parse
  
  public Reader parse(Path path) throws IOException
  
  Parses the file at the given path and returns the extracted text content.
  
  Parameters:
  
  path - the path of the file to be parsed
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the file can not be read or parsed
- parse
  
  public Reader parse(File file, Metadata metadata) throws IOException
  
  Parses the given file and returns the extracted text content.
  Metadata information extracted from the document is returned in the supplied metadata instance.
  Parameters:
  
  file - the file to be parsed
  
  metadata - where document's metadata will be populated
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the file can not be read or parsed
  
  See Also:
  
  parse(Path)
- parse
  
  public Reader parse(File file) throws IOException
  
  Parses the given file and returns the extracted text content.
  Parameters:
  
  file - the file to be parsed
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the file can not be read or parsed
  
  See Also:
  
  parse(Path)
- parse
  
  public Reader parse(URL url) throws IOException
  
  Parses the resource at the given URL and returns the extracted text content.
  
  Parameters:
  
  url - the URL of the resource to be parsed
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the resource can not be read or parsed
- parseToString
  
  public String parseToString(InputStream stream, Metadata metadata) throws IOException, TikaException
  
  Parses the given document and returns the extracted text content. The given input stream is closed by this method.
  To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation.
  NOTE: Unlike most other Tika methods that take an InputStream, this method will close the given stream for you as a convenience. With other methods you are still responsible for closing the stream or a wrapper instance returned by Tika.
  
  Parameters:
  
  stream - the document to be parsed
  
  metadata - document metadata
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the document can not be read
  
  TikaException - if the document can not be parsed
- parseToString
  
  public String parseToString(InputStream stream, Metadata metadata, int maxLength) throws IOException, TikaException
  
  Parses the given document and returns the extracted text content. The given input stream is closed by this method. This method lets you control the maxStringLength per call.
  To avoid unpredictable excess memory use, the returned string contains only up to maxLength (parameter) first characters extracted from the input document.
  NOTE: Unlike most other Tika methods that take an InputStream, this method will close the given stream for you as a convenience. With other methods you are still responsible for closing the stream or a wrapper instance returned by Tika.
  
  Parameters:
  
  stream - the document to be parsed
  
  metadata - document metadata
  
  maxLength - maximum length of the returned string
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the document can not be read
  
  TikaException - if the document can not be parsed
- parseToString
  
  public String parseToString(InputStream stream) throws IOException, TikaException
  
  Parses the given document and returns the extracted text content. The given input stream is closed by this method.
  To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation.
  NOTE: Unlike most other Tika methods that take an InputStream, this method will close the given stream for you as a convenience. With other methods you are still responsible for closing the stream or a wrapper instance returned by Tika.
  
  Parameters:
  
  stream - the document to be parsed
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the document can not be read
  
  TikaException - if the document can not be parsed
- parseToString
  
  public String parseToString(Path path) throws IOException, TikaException
  
  Parses the file at the given path and returns the extracted text content.
  To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation.
  
  Parameters:
  
  path - the path of the file to be parsed
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the file can not be read
  
  TikaException - if the file can not be parsed
- parseToString
  
  public String parseToString(File file) throws IOException, TikaException
  
  Parses the given file and returns the extracted text content.
  To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation.
  Parameters:
  
  file - the file to be parsed
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the file can not be read
  
  TikaException - if the file can not be parsed
  
  See Also:
  
  parseToString(Path)
- parseToString
  
  public String parseToString(URL url) throws IOException, TikaException
  
  Parses the resource at the given URL and returns the extracted text content.
  To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation.
  
  Parameters:
  
  url - the URL of the resource to be parsed
  
  Returns:
  
  extracted text content
  
  Throws:
  
  IOException - if the resource can not be read
  
  TikaException - if the resource can not be parsed
- getMaxStringLength
  
  public int getMaxStringLength()
  
  Returns the maximum length of strings returned by the parseToString methods.
  
  Returns:
  
  maximum string length, or -1 if the limit has been disabled
  
  Since:
  
  Apache Tika 0.7
- setMaxStringLength
  
  public void setMaxStringLength(int maxStringLength)
  
  Sets the maximum length of strings returned by the parseToString methods.
  
  Parameters:
  
  maxStringLength - maximum string length, or -1 to disable this limit
  
  Since:
  
  Apache Tika 0.7
- getParser
  
  public Parser getParser()
  
  Returns the parser instance used by this facade.
  
  Returns:
  
  parser instance
  
  Since:
  
  Apache Tika 0.10
- getDetector
  
  public Detector getDetector()
  
  Returns the detector instance used by this facade.
  
  Returns:
  
  detector instance
  
  Since:
  
  Apache Tika 0.10
- getTranslator
  
  public Translator getTranslator()
  
  Returns the translator instance used by this facade.
  
  Returns:
  
  translator instance
  
  Since:
  
  Tika 1.6
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object

Class Tika

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

Tika

Tika

Tika

Tika

Tika

Method Details

detect

detect

detect

detect

detect

detect

detect

detect

detect

translate

translate

parse

parse

parse

parse

parse

parse

parse

parseToString

parseToString

parseToString

parseToString

parseToString

parseToString

getMaxStringLength

setMaxStringLength

getParser

getDetector

getTranslator

toString