TextDetector (Apache Tika 0.10 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.tika.detect
Class TextDetector

java.lang.Object
  org.apache.tika.detect.TextDetector

All Implemented Interfaces:: java.io.Serializable, Detector

public class TextDetector
extends java.lang.Object
implements Detector
extends java.lang.Object
implements Detector

Content type detection of plain text documents. This detector looks at the beginning of the document input stream and considers the document to be a text document if no ASCII (ISO-Latin-1, UTF-8, etc.) control bytes are found. As a special case some control bytes (up to 2% of all characters) are also allowed in a text document if it also contains no or just a few (less than 10%) characters above the 7-bit ASCII range.

Note that text documents with a character encoding like UTF-16 are better detected with MagicDetector and an appropriate magic byte pattern.

Since:: Apache Tika 0.3
See Also:: Serialized Form

Constructor Summary
`TextDetector()`

Method Summary
`MediaType`	`detect(java.io.InputStream input, Metadata metadata)` Looks at the beginning of the document input stream to determine whether the document is text or not.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

TextDetector

public TextDetector()

Method Detail

detect

public MediaType detect(java.io.InputStream input,
                        Metadata metadata)
                 throws java.io.IOException

Looks at the beginning of the document input stream to determine whether the document is text or not.

Specified by:: detect in interface Detector

Parameters:: input - document input stream, or null; metadata - ignored
Returns:: "text/plain" if the input stream suggest a text document, "application/octet-stream" otherwise
Throws:: java.io.IOException - if the document input stream could not be read