Package org.apache.tika.detect
Class TextDetector
- java.lang.Object
-
- org.apache.tika.detect.TextDetector
-
- All Implemented Interfaces:
Serializable
,Detector
public class TextDetector extends Object implements Detector
Content type detection of plain text documents. This detector looks at the beginning of the document input stream and considers the document to be a text document if no ASCII (ISO-Latin-1, UTF-8, etc.) control bytes are found. As a special case some control bytes (up to 2% of all characters) are also allowed in a text document if it also contains no or just a few (less than 10%) characters above the 7-bit ASCII range.Note that text documents with a character encoding like UTF-16 are better detected with
MagicDetector
and an appropriate magic byte pattern.- Since:
- Apache Tika 0.3
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description TextDetector()
Constructs aTextDetector
which will look at the default number of bytes from the beginning of the document.TextDetector(int bytesToTest)
Constructs aTextDetector
which will look at a given number of bytes from the beginning of the document.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description MediaType
detect(InputStream input, Metadata metadata)
Looks at the beginning of the document input stream to determine whether the document is text or not.
-
-
-
Constructor Detail
-
TextDetector
public TextDetector()
Constructs aTextDetector
which will look at the default number of bytes from the beginning of the document.
-
TextDetector
public TextDetector(int bytesToTest)
Constructs aTextDetector
which will look at a given number of bytes from the beginning of the document.
-
-
Method Detail
-
detect
public MediaType detect(InputStream input, Metadata metadata) throws IOException
Looks at the beginning of the document input stream to determine whether the document is text or not.- Specified by:
detect
in interfaceDetector
- Parameters:
input
- document input stream, ornull
metadata
- ignored- Returns:
- "text/plain" if the input stream suggest a text document, "application/octet-stream" otherwise
- Throws:
IOException
- if the document input stream could not be read
-
-