org.apache.tika.detect
Class TextDetector
java.lang.Object
org.apache.tika.detect.TextDetector
- All Implemented Interfaces:
- java.io.Serializable, Detector
public class TextDetector
- extends java.lang.Object
- implements Detector
Content type detection of plain text documents. This detector looks at the
beginning of the document input stream and considers the document to be
a text document if no ASCII (ISO-Latin-1, UTF-8, etc.) control bytes are
found. As a special case some control bytes (up to 2% of all characters)
are also allowed in a text document if it also contains no or just a few
(less than 10%) characters above the 7-bit ASCII range.
Note that text documents with a character encoding like UTF-16 are better
detected with MagicDetector
and an appropriate magic byte pattern.
- Since:
- Apache Tika 0.3
- See Also:
- Serialized Form
Method Summary |
MediaType |
detect(java.io.InputStream input,
Metadata metadata)
Looks at the beginning of the document input stream to determine
whether the document is text or not. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TextDetector
public TextDetector()
detect
public MediaType detect(java.io.InputStream input,
Metadata metadata)
throws java.io.IOException
- Looks at the beginning of the document input stream to determine
whether the document is text or not.
- Specified by:
detect
in interface Detector
- Parameters:
input
- document input stream, or null
metadata
- ignored
- Returns:
- "text/plain" if the input stream suggest a text document,
"application/octet-stream" otherwise
- Throws:
java.io.IOException
- if the document input stream could not be read
Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.