Class TextStatistics

java.lang.Object
org.apache.tika.detect.TextStatistics

public class TextStatistics extends Object
Utility class for computing a histogram of the bytes seen in a stream.
Since:
Apache Tika 1.2
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    addData(byte[] buffer, int offset, int length)
     
    int
    Returns the total number of bytes seen so far.
    int
    count(int b)
    Returns the number of occurrences of the given byte.
    int
    Counts control characters (i.e.
    int
    Counts eight bit characters, i.e. bytes with their highest bit set.
    int
    Counts "safe" (i.e. seven-bit non-control) ASCII characters.
    boolean
    Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e.
    boolean
    Checks whether the observed byte stream looks like UTF-8 encoded text.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • TextStatistics

      public TextStatistics()
  • Method Details

    • addData

      public void addData(byte[] buffer, int offset, int length)
    • isMostlyAscii

      public boolean isMostlyAscii()
      Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range).
      Returns:
      true if the seen bytes were mostly safe ASCII, false otherwise
      See Also:
    • looksLikeUTF8

      public boolean looksLikeUTF8()
      Checks whether the observed byte stream looks like UTF-8 encoded text.
      Returns:
      true if the seen bytes look like UTF-8, false otherwise
      Since:
      Apache Tika 1.3
    • count

      public int count()
      Returns the total number of bytes seen so far.
      Returns:
      count of all bytes
    • count

      public int count(int b)
      Returns the number of occurrences of the given byte.
      Parameters:
      b - byte
      Returns:
      count of the given byte
    • countControl

      public int countControl()
      Counts control characters (i.e. < 0x20, excluding tab, CR, LF, page feed and escape).

      This definition of control characters is based on section 4 of the "Content-Type Processing Model" Internet-draft (draft-abarth-mime-sniff-01).

       +-------------------------+
       | Binary data byte ranges |
       +-------------------------+
       | 0x00 -- 0x08            |
       | 0x0B                    |
       | 0x0E -- 0x1A            |
       | 0x1C -- 0x1F            |
       +-------------------------+
       
      Returns:
      count of control characters
      See Also:
    • countSafeAscii

      public int countSafeAscii()
      Counts "safe" (i.e. seven-bit non-control) ASCII characters.
      Returns:
      count of safe ASCII characters
      See Also:
    • countEightBit

      public int countEightBit()
      Counts eight bit characters, i.e. bytes with their highest bit set.
      Returns:
      count of eight bit characters