|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.tika.detect.TextStatistics
public class TextStatistics
Utility class for computing a histogram of the bytes seen in a stream.
Constructor Summary | |
---|---|
TextStatistics()
|
Method Summary | |
---|---|
void |
addData(byte[] buffer,
int offset,
int length)
|
int |
count()
Returns the total number of bytes seen so far. |
int |
count(int b)
Returns the number of occurrences of the given byte. |
int |
countControl()
Counts control characters (i.e. |
int |
countEightBit()
Counts eight bit characters, i.e. |
int |
countSafeAscii()
Counts "safe" (i.e. |
boolean |
isMostlyAscii()
Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public TextStatistics()
Method Detail |
---|
public void addData(byte[] buffer, int offset, int length)
public boolean isMostlyAscii()
true
if the seen bytes were mostly safe ASCII,
false
otherwisepublic int count()
public int count(int b)
b
- byte
public int countControl()
This definition of control characters is based on section 4 of the "Content-Type Processing Model" Internet-draft (draft-abarth-mime-sniff-01).
+-------------------------+ | Binary data byte ranges | +-------------------------+ | 0x00 -- 0x08 | | 0x0B | | 0x0E -- 0x1A | | 0x1C -- 0x1F | +-------------------------+
public int countSafeAscii()
countControl()
public int countEightBit()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |