Package org.apache.tika.detect
Class TextDetector
java.lang.Object
org.apache.tika.detect.TextDetector
- All Implemented Interfaces:
Serializable
,Detector
Content type detection of plain text documents. This detector looks at the
beginning of the document input stream and considers the document to be
a text document if no ASCII (ISO-Latin-1, UTF-8, etc.) control bytes are
found. As a special case some control bytes (up to 2% of all characters)
are also allowed in a text document if it also contains no or just a few
(less than 10%) characters above the 7-bit ASCII range.
Note that text documents with a character encoding like UTF-16 are better
detected with MagicDetector
and an appropriate magic byte pattern.
- Since:
- Apache Tika 0.3
- See Also:
-
Constructor Summary
ConstructorDescriptionConstructs aTextDetector
which will look at the default number of bytes from the beginning of the document.TextDetector
(int bytesToTest) Constructs aTextDetector
which will look at a given number of bytes from the beginning of the document. -
Method Summary
Modifier and TypeMethodDescriptiondetect
(InputStream input, Metadata metadata) Looks at the beginning of the document input stream to determine whether the document is text or not.
-
Constructor Details
-
TextDetector
public TextDetector()Constructs aTextDetector
which will look at the default number of bytes from the beginning of the document. -
TextDetector
public TextDetector(int bytesToTest) Constructs aTextDetector
which will look at a given number of bytes from the beginning of the document.
-
-
Method Details
-
detect
Looks at the beginning of the document input stream to determine whether the document is text or not.- Specified by:
detect
in interfaceDetector
- Parameters:
input
- document input stream, ornull
metadata
- ignored- Returns:
- "text/plain" if the input stream suggest a text document, "application/octet-stream" otherwise
- Throws:
IOException
- if the document input stream could not be read
-