Class TextStatistics


  • public class TextStatistics
    extends Object
    Utility class for computing a histogram of the bytes seen in a stream.
    Since:
    Apache Tika 1.2
    • Constructor Detail

      • TextStatistics

        public TextStatistics()
    • Method Detail

      • addData

        public void addData​(byte[] buffer,
                            int offset,
                            int length)
      • isMostlyAscii

        public boolean isMostlyAscii()
        Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range).
        Returns:
        true if the seen bytes were mostly safe ASCII, false otherwise
        See Also:
        TIKA-483, TIKA-688
      • looksLikeUTF8

        public boolean looksLikeUTF8()
        Checks whether the observed byte stream looks like UTF-8 encoded text.
        Returns:
        true if the seen bytes look like UTF-8, false otherwise
        Since:
        Apache Tika 1.3
      • count

        public int count()
        Returns the total number of bytes seen so far.
        Returns:
        count of all bytes
      • count

        public int count​(int b)
        Returns the number of occurrences of the given byte.
        Parameters:
        b - byte
        Returns:
        count of the given byte
      • countControl

        public int countControl()
        Counts control characters (i.e. < 0x20, excluding tab, CR, LF, page feed and escape).

        This definition of control characters is based on section 4 of the "Content-Type Processing Model" Internet-draft (draft-abarth-mime-sniff-01).

         +-------------------------+
         | Binary data byte ranges |
         +-------------------------+
         | 0x00 -- 0x08            |
         | 0x0B                    |
         | 0x0E -- 0x1A            |
         | 0x1C -- 0x1F            |
         +-------------------------+
         
        Returns:
        count of control characters
        See Also:
        TIKA-154
      • countSafeAscii

        public int countSafeAscii()
        Counts "safe" (i.e. seven-bit non-control) ASCII characters.
        Returns:
        count of safe ASCII characters
        See Also:
        countControl()
      • countEightBit

        public int countEightBit()
        Counts eight bit characters, i.e. bytes with their highest bit set.
        Returns:
        count of eight bit characters