org.apache.tika.detect
Class TextStatistics

java.lang.Object
  extended by org.apache.tika.detect.TextStatistics

public class TextStatistics
extends Object

Utility class for computing a histogram of the bytes seen in a stream.

Since:
Apache Tika 1.2

Constructor Summary
TextStatistics()
           
 
Method Summary
 void addData(byte[] buffer, int offset, int length)
           
 int count()
          Returns the total number of bytes seen so far.
 int count(int b)
          Returns the number of occurrences of the given byte.
 int countControl()
          Counts control characters (i.e.
 int countEightBit()
          Counts eight bit characters, i.e.
 int countSafeAscii()
          Counts "safe" (i.e.
 boolean isMostlyAscii()
          Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TextStatistics

public TextStatistics()
Method Detail

addData

public void addData(byte[] buffer,
                    int offset,
                    int length)

isMostlyAscii

public boolean isMostlyAscii()
Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range).

Returns:
true if the seen bytes were mostly safe ASCII, false otherwise
See Also:
TIKA-483, TIKA-688

count

public int count()
Returns the total number of bytes seen so far.

Returns:
count of all bytes

count

public int count(int b)
Returns the number of occurrences of the given byte.

Parameters:
b - byte
Returns:
count of the given byte

countControl

public int countControl()
Counts control characters (i.e. < 0x20, excluding tab, CR, LF, page feed and escape).

This definition of control characters is based on section 4 of the "Content-Type Processing Model" Internet-draft (draft-abarth-mime-sniff-01).

 +-------------------------+
 | Binary data byte ranges |
 +-------------------------+
 | 0x00 -- 0x08            |
 | 0x0B                    |
 | 0x0E -- 0x1A            |
 | 0x1C -- 0x1F            |
 +-------------------------+
 

Returns:
count of control characters
See Also:
TIKA-154

countSafeAscii

public int countSafeAscii()
Counts "safe" (i.e. seven-bit non-control) ASCII characters.

Returns:
count of safe ASCII characters
See Also:
countControl()

countEightBit

public int countEightBit()
Counts eight bit characters, i.e. bytes with their highest bit set.

Returns:
count of eight bit characters


Copyright © 2007-2012 The Apache Software Foundation. All Rights Reserved.