org.apache.tika.parser.txt.CharsetMatch

All Implemented Interfaces:: Comparable<CharsetMatch>

public class CharsetMatch extends Object implements Comparable<CharsetMatch>

This class represents a charset that has been identified by a CharsetDetector as a possible encoding for a set of input data. From an instance of this class, you can ask for a confidence level in the charset identification, or for Java Reader or String to access the original byte data in Unicode form.

Instances of this class are created only by CharsetDetectors.

Note: this class has a natural ordering that is inconsistent with equals. The natural ordering is based on the match confidence value.

Method Summary

Modifier and Type

Method

Description

int

compareTo(CharsetMatch other)

Compare to other CharsetMatch objects.

boolean

equals(Object o)

compare this CharsetMatch to another based on confidence value

int

getConfidence()

Get an indication of the confidence in the charset detected.

String

getLanguage()

Get the ISO code for the language of the detected charset.

String

getName()

Get the name of the detected charset.

String

getNormalizedName()

strips e.g.

Reader

getReader()

Create a java.io.Reader for reading the Unicode character data corresponding to the original byte data supplied to the Charset detect operation.

String

getString()

Create a Java String from Unicode character data corresponding to the original byte data supplied to the Charset detect operation.

String

getString(int maxLength)

Create a Java String from Unicode character data corresponding to the original byte data supplied to the Charset detect operation.

int

hashCode()

generates a hashCode based on the confidence value

String

toString()

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Method Details
- getReader
  
  public Reader getReader()
  
  Create a java.io.Reader for reading the Unicode character data corresponding to the original byte data supplied to the Charset detect operation.
  CAUTION: if the source of the byte data was an InputStream, a Reader can be created for only one matching char set using this method. If more than one charset needs to be tried, the caller will need to reset the InputStream and create InputStreamReaders itself, based on the charset name.
  
  Returns:
  
  the Reader for the Unicode character data.
- getString
  
  public String getString() throws IOException
  
  Create a Java String from Unicode character data corresponding to the original byte data supplied to the Charset detect operation.
  
  Returns:
  
  a String created from the converted input data.
  
  Throws:
  
  IOException
- getString
  
  public String getString(int maxLength) throws IOException
  
  Create a Java String from Unicode character data corresponding to the original byte data supplied to the Charset detect operation. The length of the returned string is limited to the specified size; the string will be trunctated to this length if necessary. A limit value of zero or less is ignored, and treated as no limit.
  
  Parameters:
  
  maxLength - The maximium length of the String to be created when the source of the data is an input stream, or -1 for unlimited length.
  
  Returns:
  
  a String created from the converted input data.
  
  Throws:
  
  IOException
- getNormalizedName
  
  public String getNormalizedName()
  
  strips e.g. _rtl, _ltr off of charset names so that they can be used as a charset.
  
  Returns:
- getConfidence
  
  public int getConfidence()
  
  Get an indication of the confidence in the charset detected. Confidence values range from 0-100, with larger numbers indicating a better match of the input data to the characteristics of the charset.
  
  Returns:
  
  the confidence in the charset match
- getName
  
  public String getName()
  
  Get the name of the detected charset. The name will be one that can be used with other APIs on the platform that accept charset names. It is the "Canonical name" as defined by the class java.nio.charset.Charset; for charsets that are registered with the IANA charset registry, this is the MIME-preferred registerd name.
  Returns:
  
  The name of the charset.
  
  See Also:
  
  Charset
  
  InputStreamReader
- getLanguage
  
  public String getLanguage()
  
  Get the ISO code for the language of the detected charset.
  
  Returns:
  
  The ISO code for the language or null if the language cannot be determined.
- compareTo
  
  public int compareTo(CharsetMatch other)
  
  Compare to other CharsetMatch objects. Comparison is based on the match confidence value, which allows CharsetDetector.detectAll() to order its results.
  
  Specified by:
  
  compareTo in interface Comparable<CharsetMatch>
  
  Parameters:
  
  other - the CharsetMatch object to compare against.
  
  Returns:
  
  a negative integer, zero, or a positive integer as the confidence level of this CharsetMatch is less than, equal to, or greater than that of the argument.
  
  Throws:
  
  ClassCastException - if the argument is not a CharsetMatch.
- equals
  
  public boolean equals(Object o)
  
  compare this CharsetMatch to another based on confidence value
  
  Overrides:
  
  equals in class Object
  
  Parameters:
  
  o - the CharsetMatch object to compare against
  
  Returns:
  
  true if equal
- hashCode
  
  public int hashCode()
  
  generates a hashCode based on the confidence value
  
  Overrides:
  
  hashCode in class Object
  
  Returns:
  
  the hashCode
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object

Class CharsetMatch

Method Summary

Methods inherited from class java.lang.Object

Method Details

getReader

getString

getString

getNormalizedName

getConfidence

getName

getLanguage

compareTo

equals

hashCode

toString