Package org.apache.tika.detect
Class TextStatistics
java.lang.Object
org.apache.tika.detect.TextStatistics
Utility class for computing a histogram of the bytes seen in a stream.
- Since:
- Apache Tika 1.2
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
addData
(byte[] buffer, int offset, int length) int
count()
Returns the total number of bytes seen so far.int
count
(int b) Returns the number of occurrences of the given byte.int
Counts control characters (i.e.int
Counts eight bit characters, i.e.int
Counts "safe" (i.e.boolean
Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e.boolean
Checks whether the observed byte stream looks like UTF-8 encoded text.
-
Constructor Details
-
TextStatistics
public TextStatistics()
-
-
Method Details
-
addData
public void addData(byte[] buffer, int offset, int length) -
isMostlyAscii
public boolean isMostlyAscii()Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range). -
looksLikeUTF8
public boolean looksLikeUTF8()Checks whether the observed byte stream looks like UTF-8 encoded text.- Returns:
true
if the seen bytes look like UTF-8,false
otherwise- Since:
- Apache Tika 1.3
-
count
public int count()Returns the total number of bytes seen so far.- Returns:
- count of all bytes
-
count
public int count(int b) Returns the number of occurrences of the given byte.- Parameters:
b
- byte- Returns:
- count of the given byte
-
countControl
public int countControl()Counts control characters (i.e. < 0x20, excluding tab, CR, LF, page feed and escape).This definition of control characters is based on section 4 of the "Content-Type Processing Model" Internet-draft (draft-abarth-mime-sniff-01).
+-------------------------+ | Binary data byte ranges | +-------------------------+ | 0x00 -- 0x08 | | 0x0B | | 0x0E -- 0x1A | | 0x1C -- 0x1F | +-------------------------+
- Returns:
- count of control characters
- See Also:
-
countSafeAscii
public int countSafeAscii()Counts "safe" (i.e. seven-bit non-control) ASCII characters.- Returns:
- count of safe ASCII characters
- See Also:
-
countEightBit
public int countEightBit()Counts eight bit characters, i.e. bytes with their highest bit set.- Returns:
- count of eight bit characters
-