Package org.apache.tika.detect
Class TextStatistics
java.lang.Object
org.apache.tika.detect.TextStatistics
Utility class for computing a histogram of the bytes seen in a stream.
- Since:
- Apache Tika 1.2
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidaddData(byte[] buffer, int offset, int length) intcount()Returns the total number of bytes seen so far.intcount(int b) Returns the number of occurrences of the given byte.intCounts control characters (i.e.intCounts eight bit characters, i.e. bytes with their highest bit set.intCounts "safe" (i.e. seven-bit non-control) ASCII characters.booleanChecks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e.booleanChecks whether the observed byte stream looks like UTF-8 encoded text.
-
Constructor Details
-
TextStatistics
public TextStatistics()
-
-
Method Details
-
addData
public void addData(byte[] buffer, int offset, int length) -
isMostlyAscii
public boolean isMostlyAscii()Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range). -
looksLikeUTF8
public boolean looksLikeUTF8()Checks whether the observed byte stream looks like UTF-8 encoded text.- Returns:
trueif the seen bytes look like UTF-8,falseotherwise- Since:
- Apache Tika 1.3
-
count
public int count()Returns the total number of bytes seen so far.- Returns:
- count of all bytes
-
count
public int count(int b) Returns the number of occurrences of the given byte.- Parameters:
b- byte- Returns:
- count of the given byte
-
countControl
public int countControl()Counts control characters (i.e. < 0x20, excluding tab, CR, LF, page feed and escape).This definition of control characters is based on section 4 of the "Content-Type Processing Model" Internet-draft (draft-abarth-mime-sniff-01).
+-------------------------+ | Binary data byte ranges | +-------------------------+ | 0x00 -- 0x08 | | 0x0B | | 0x0E -- 0x1A | | 0x1C -- 0x1F | +-------------------------+
- Returns:
- count of control characters
- See Also:
-
countSafeAscii
public int countSafeAscii()Counts "safe" (i.e. seven-bit non-control) ASCII characters.- Returns:
- count of safe ASCII characters
- See Also:
-
countEightBit
public int countEightBit()Counts eight bit characters, i.e. bytes with their highest bit set.- Returns:
- count of eight bit characters
-