Package org.apache.tika.detect
Class TextStatistics
java.lang.Object
org.apache.tika.detect.TextStatistics
Utility class for computing a histogram of the bytes seen in a stream.
- Since:
- Apache Tika 1.2
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionvoidaddData(byte[] buffer, int offset, int length) intcount()Returns the total number of bytes seen so far.intcount(int b) Returns the number of occurrences of the given byte.intCounts control characters (i.e.intCounts eight bit characters, i.e.intCounts "safe" (i.e.booleanChecks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e.booleanChecks whether the observed byte stream looks like UTF-8 encoded text.
- 
Constructor Details- 
TextStatisticspublic TextStatistics()
 
- 
- 
Method Details- 
addDatapublic void addData(byte[] buffer, int offset, int length) 
- 
isMostlyAsciipublic boolean isMostlyAscii()Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range).
- 
looksLikeUTF8public boolean looksLikeUTF8()Checks whether the observed byte stream looks like UTF-8 encoded text.- Returns:
- trueif the seen bytes look like UTF-8,- falseotherwise
- Since:
- Apache Tika 1.3
 
- 
countpublic int count()Returns the total number of bytes seen so far.- Returns:
- count of all bytes
 
- 
countpublic int count(int b) Returns the number of occurrences of the given byte.- Parameters:
- b- byte
- Returns:
- count of the given byte
 
- 
countControlpublic int countControl()Counts control characters (i.e. < 0x20, excluding tab, CR, LF, page feed and escape).This definition of control characters is based on section 4 of the "Content-Type Processing Model" Internet-draft (draft-abarth-mime-sniff-01). +-------------------------+ | Binary data byte ranges | +-------------------------+ | 0x00 -- 0x08 | | 0x0B | | 0x0E -- 0x1A | | 0x1C -- 0x1F | +-------------------------+ - Returns:
- count of control characters
- See Also:
 
- 
countSafeAsciipublic int countSafeAscii()Counts "safe" (i.e. seven-bit non-control) ASCII characters.- Returns:
- count of safe ASCII characters
- See Also:
 
- 
countEightBitpublic int countEightBit()Counts eight bit characters, i.e. bytes with their highest bit set.- Returns:
- count of eight bit characters
 
 
-