Class Tika
-
Constructor Summary
ConstructorDescriptionTika()
Creates a Tika facade using the default configuration.Tika
(TikaConfig config) Creates a Tika facade using the given configuration.Creates a Tika facade using the given detector instance, the default parser configuration, and the default Translator.Creates a Tika facade using the given detector and parser instances, but the default Translator.Tika
(Detector detector, Parser parser, Translator translator) Creates a Tika facade using the given detector, parser, and translator instances. -
Method Summary
Modifier and TypeMethodDescriptiondetect
(byte[] prefix) Detects the media type of the given document.Detects the media type of the given document.Detects the media type of the given file.detect
(InputStream stream) Detects the media type of the given document.detect
(InputStream stream, String name) Detects the media type of the given document.detect
(InputStream stream, Metadata metadata) Detects the media type of the given document.Detects the media type of a document with the given file name.Detects the media type of the resource at the given URL.Detects the media type of the file at the given path.Returns the detector instance used by this facade.int
Returns the maximum length of strings returned by the parseToString methods.Returns the parser instance used by this facade.Returns the translator instance used by this facade.Parses the given file and returns the extracted text content.Parses the given file and returns the extracted text content.parse
(InputStream stream) Parses the given document and returns the extracted text content.parse
(InputStream stream, Metadata metadata) Parses the given document and returns the extracted text content.Parses the resource at the given URL and returns the extracted text content.Parses the file at the given path and returns the extracted text content.Parses the file at the given path and returns the extracted text content.parseToString
(File file) Parses the given file and returns the extracted text content.parseToString
(InputStream stream) Parses the given document and returns the extracted text content.parseToString
(InputStream stream, Metadata metadata) Parses the given document and returns the extracted text content.parseToString
(InputStream stream, Metadata metadata, int maxLength) Parses the given document and returns the extracted text content.parseToString
(URL url) Parses the resource at the given URL and returns the extracted text content.parseToString
(Path path) Parses the file at the given path and returns the extracted text content.void
setMaxStringLength
(int maxStringLength) Sets the maximum length of strings returned by the parseToString methods.toString()
Translate the given text String to the given language, attempting to auto-detect the source language.Translate the given text String to and from the given languages.
-
Constructor Details
-
Tika
Creates a Tika facade using the given detector and parser instances, but the default Translator.- Parameters:
detector
- type detectorparser
- document parser- Since:
- Apache Tika 0.8
-
Tika
Creates a Tika facade using the given detector, parser, and translator instances.- Parameters:
detector
- type detectorparser
- document parsertranslator
- text translator- Since:
- Apache Tika 1.6
-
Tika
Creates a Tika facade using the given configuration.- Parameters:
config
- Tika configuration
-
Tika
public Tika()Creates a Tika facade using the default configuration. -
Tika
Creates a Tika facade using the given detector instance, the default parser configuration, and the default Translator.- Parameters:
detector
- type detector- Since:
- Apache Tika 0.8
-
-
Method Details
-
detect
Detects the media type of the given document. The type detection is based on the content of the given document stream and any given document metadata. The document stream can benull
, in which case only the given document metadata is used for type detection.If the document stream supports the
mark feature
, then the stream is marked and reset to the original position before this method returns. Only a limited number of bytes are read from the stream.The given document stream is not closed by this method.
Unlike in the
parse(InputStream, Metadata)
method, the given document metadata is not modified by this method.- Parameters:
stream
- the document stream, ornull
metadata
- document metadata- Returns:
- detected media type
- Throws:
IOException
- if the stream can not be read
-
detect
Detects the media type of the given document. The type detection is based on the content of the given document stream and the name of the document.If the document stream supports the
mark feature
, then the stream is marked and reset to the original position before this method returns. Only a limited number of bytes are read from the stream.The given document stream is not closed by this method.
- Parameters:
stream
- the document streamname
- document name- Returns:
- detected media type
- Throws:
IOException
- if the stream can not be read- Since:
- Apache Tika 0.9
-
detect
Detects the media type of the given document. The type detection is based on the content of the given document stream.If the document stream supports the
mark feature
, then the stream is marked and reset to the original position before this method returns. Only a limited number of bytes are read from the stream.The given document stream is not closed by this method.
- Parameters:
stream
- the document stream- Returns:
- detected media type
- Throws:
IOException
- if the stream can not be read
-
detect
Detects the media type of the given document. The type detection is based on the first few bytes of a document and the document name.For best results at least a few kilobytes of the document data are needed. See also the other detect() methods for better alternatives when you have more than just the document prefix available for type detection.
- Parameters:
prefix
- first few bytes of the documentname
- document name- Returns:
- detected media type
- Since:
- Apache Tika 0.9
-
detect
Detects the media type of the given document. The type detection is based on the first few bytes of a document.For best results at least a few kilobytes of the document data are needed. See also the other detect() methods for better alternatives when you have more than just the document prefix available for type detection.
- Parameters:
prefix
- first few bytes of the document- Returns:
- detected media type
- Since:
- Apache Tika 0.9
-
detect
Detects the media type of the file at the given path. The type detection is based on the document content and a potential known file extension.Use the
detect(String)
method when you want to detect the type of the document without actually accessing the file.- Parameters:
path
- the path of the file- Returns:
- detected media type
- Throws:
IOException
- if the file can not be read
-
detect
Detects the media type of the given file. The type detection is based on the document content and a potential known file extension.Use the
detect(String)
method when you want to detect the type of the document without actually accessing the file.- Parameters:
file
- the file- Returns:
- detected media type
- Throws:
IOException
- if the file can not be read- See Also:
-
detect
Detects the media type of the resource at the given URL. The type detection is based on the document content and a potential known file extension included in the URL.Use the
detect(String)
method when you want to detect the type of the document without actually accessing the URL.- Parameters:
url
- the URL of the resource- Returns:
- detected media type
- Throws:
IOException
- if the resource can not be read
-
detect
Detects the media type of a document with the given file name. The type detection is based on known file name extensions.The given name can also be a URL or a full file path. In such cases only the file name part of the string is used for type detection.
- Parameters:
name
- the file name of the document- Returns:
- detected media type
-
translate
Translate the given text String to and from the given languages.- Parameters:
text
- The text to translate.sourceLanguage
- The input text language (for example, "hi").targetLanguage
- The desired output language (for example, "fr").- Returns:
- The translated text. If translation is unavailable (client keys not set), returns the same text back.
- See Also:
-
translate
Translate the given text String to the given language, attempting to auto-detect the source language.- Parameters:
text
- The text to translate.targetLanguage
- The desired output language (for example, "en").- Returns:
- The translated text. If translation is unavailable (client keys not set), returns the same text back.
- See Also:
-
parse
Parses the given document and returns the extracted text content. Input metadata like a file name or a content type hint can be passed in the given metadata instance. Metadata information extracted from the document is returned in that same metadata instance.The returned reader will be responsible for closing the given stream. The stream and any associated resources will be closed at or before the time when the
Reader.close()
method is called.- Parameters:
stream
- the document to be parsedmetadata
- where document's metadata will be populated- Returns:
- extracted text content
- Throws:
IOException
- if the document can not be read or parsed
-
parse
Parses the given document and returns the extracted text content.The returned reader will be responsible for closing the given stream. The stream and any associated resources will be closed at or before the time when the
Reader.close()
method is called.- Parameters:
stream
- the document to be parsed- Returns:
- extracted text content
- Throws:
IOException
- if the document can not be read or parsed
-
parse
Parses the file at the given path and returns the extracted text content.Metadata information extracted from the document is returned in the supplied metadata instance.
- Parameters:
path
- the path of the file to be parsedmetadata
- where document's metadata will be populated- Returns:
- extracted text content
- Throws:
IOException
- if the file can not be read or parsed
-
parse
Parses the file at the given path and returns the extracted text content.- Parameters:
path
- the path of the file to be parsed- Returns:
- extracted text content
- Throws:
IOException
- if the file can not be read or parsed
-
parse
Parses the given file and returns the extracted text content.Metadata information extracted from the document is returned in the supplied metadata instance.
- Parameters:
file
- the file to be parsedmetadata
- where document's metadata will be populated- Returns:
- extracted text content
- Throws:
IOException
- if the file can not be read or parsed- See Also:
-
parse
Parses the given file and returns the extracted text content.- Parameters:
file
- the file to be parsed- Returns:
- extracted text content
- Throws:
IOException
- if the file can not be read or parsed- See Also:
-
parse
Parses the resource at the given URL and returns the extracted text content.- Parameters:
url
- the URL of the resource to be parsed- Returns:
- extracted text content
- Throws:
IOException
- if the resource can not be read or parsed
-
parseToString
public String parseToString(InputStream stream, Metadata metadata) throws IOException, TikaException Parses the given document and returns the extracted text content. The given input stream is closed by this method.To avoid unpredictable excess memory use, the returned string contains only up to
getMaxStringLength()
first characters extracted from the input document. Use thesetMaxStringLength(int)
method to adjust this limitation.NOTE: Unlike most other Tika methods that take an
InputStream
, this method will close the given stream for you as a convenience. With other methods you are still responsible for closing the stream or a wrapper instance returned by Tika.- Parameters:
stream
- the document to be parsedmetadata
- document metadata- Returns:
- extracted text content
- Throws:
IOException
- if the document can not be readTikaException
- if the document can not be parsed
-
parseToString
public String parseToString(InputStream stream, Metadata metadata, int maxLength) throws IOException, TikaException Parses the given document and returns the extracted text content. The given input stream is closed by this method. This method lets you control the maxStringLength per call.To avoid unpredictable excess memory use, the returned string contains only up to maxLength (parameter) first characters extracted from the input document.
NOTE: Unlike most other Tika methods that take an
InputStream
, this method will close the given stream for you as a convenience. With other methods you are still responsible for closing the stream or a wrapper instance returned by Tika.- Parameters:
stream
- the document to be parsedmetadata
- document metadatamaxLength
- maximum length of the returned string- Returns:
- extracted text content
- Throws:
IOException
- if the document can not be readTikaException
- if the document can not be parsed
-
parseToString
Parses the given document and returns the extracted text content. The given input stream is closed by this method.To avoid unpredictable excess memory use, the returned string contains only up to
getMaxStringLength()
first characters extracted from the input document. Use thesetMaxStringLength(int)
method to adjust this limitation.NOTE: Unlike most other Tika methods that take an
InputStream
, this method will close the given stream for you as a convenience. With other methods you are still responsible for closing the stream or a wrapper instance returned by Tika.- Parameters:
stream
- the document to be parsed- Returns:
- extracted text content
- Throws:
IOException
- if the document can not be readTikaException
- if the document can not be parsed
-
parseToString
Parses the file at the given path and returns the extracted text content.To avoid unpredictable excess memory use, the returned string contains only up to
getMaxStringLength()
first characters extracted from the input document. Use thesetMaxStringLength(int)
method to adjust this limitation.- Parameters:
path
- the path of the file to be parsed- Returns:
- extracted text content
- Throws:
IOException
- if the file can not be readTikaException
- if the file can not be parsed
-
parseToString
Parses the given file and returns the extracted text content.To avoid unpredictable excess memory use, the returned string contains only up to
getMaxStringLength()
first characters extracted from the input document. Use thesetMaxStringLength(int)
method to adjust this limitation.- Parameters:
file
- the file to be parsed- Returns:
- extracted text content
- Throws:
IOException
- if the file can not be readTikaException
- if the file can not be parsed- See Also:
-
parseToString
Parses the resource at the given URL and returns the extracted text content.To avoid unpredictable excess memory use, the returned string contains only up to
getMaxStringLength()
first characters extracted from the input document. Use thesetMaxStringLength(int)
method to adjust this limitation.- Parameters:
url
- the URL of the resource to be parsed- Returns:
- extracted text content
- Throws:
IOException
- if the resource can not be readTikaException
- if the resource can not be parsed
-
getMaxStringLength
public int getMaxStringLength()Returns the maximum length of strings returned by the parseToString methods.- Returns:
- maximum string length, or -1 if the limit has been disabled
- Since:
- Apache Tika 0.7
-
setMaxStringLength
public void setMaxStringLength(int maxStringLength) Sets the maximum length of strings returned by the parseToString methods.- Parameters:
maxStringLength
- maximum string length, or -1 to disable this limit- Since:
- Apache Tika 0.7
-
getParser
Returns the parser instance used by this facade.- Returns:
- parser instance
- Since:
- Apache Tika 0.10
-
getDetector
Returns the detector instance used by this facade.- Returns:
- detector instance
- Since:
- Apache Tika 0.10
-
getTranslator
Returns the translator instance used by this facade.- Returns:
- translator instance
- Since:
- Tika 1.6
-
toString
-