Interface TikaCoreProperties


public interface TikaCoreProperties
Contains a core set of basic Tika metadata properties, which all parsers will attempt to supply (where the file format permits). These are all defined in terms of other standard namespaces.

Users of Tika who wish to have consistent metadata across file formats can make use of these Properties, knowing that where present they will have consistent semantic meaning between different file formats. (No matter if one file format calls it Title, another Long-Title and another Long-Name, if they all mean the same thing as defined by DublinCore.TITLE then they will all be present as such)

For now, most of these properties are composite ones including the deprecated non-prefixed String properties from the Metadata class. In Tika 2.0, most of these will revert back to simple assignments.

Since:
Apache Tika 1.2
  • Field Details

    • NAMESPACE_PREFIX_DELIMITER

      static final String NAMESPACE_PREFIX_DELIMITER
      The common delimiter used between the namespace abbreviation and the property name
      See Also:
    • TIKA_META_PREFIX

      static final String TIKA_META_PREFIX
      Use this to prefix metadata properties that store information about the parsing process. Users should be able to distinguish between metadata that was contained within the document and metadata about the parsing process.
      See Also:
    • EMBEDDED_DEPTH

      static final Property EMBEDDED_DEPTH
    • EMBEDDED_RESOURCE_PATH

      static final Property EMBEDDED_RESOURCE_PATH
      This tracks the embedded file paths based on the name of embedded files where available.

      This field should be treated with great care and should NOT be used for creating a directory structure to write out attachments because: there may be path collisions or illegal characters or other mayhem.

      For a more robust path, see EMBEDDED_ID_PATH.

    • FINAL_EMBEDDED_RESOURCE_PATH

      static final Property FINAL_EMBEDDED_RESOURCE_PATH
      This is calculated in RecursiveParserWrapperHandler. It differs from EMBEDDED_RESOURCE_PATH in that it is calculated at the end of the full parse of a file. EMBEDDED_RESOURCE_PATH is calculated during the parse, and, for some parsers, an embedded file's name isn't known until after its child files have been parsed.

      Note that the unknown file count may differ between EMBEDDED_RESOURCE_PATH because there should be fewer unknown files when this is calculated. More simply, there is no connection between "embedded-1" in this field and "embedded-1" in EMBEDDED_RESOURCE_PATH.

      This field should be treated with great care and should NOT be used for creating a directory structure to write out attachments because: there may be path collisions or illegal characters or other mayhem.

      For a more robust path, see EMBEDDED_ID_PATH.

    • EMBEDDED_ID_PATH

      static final Property EMBEDDED_ID_PATH
      This tracks the embedded file paths based on the embedded file's EMBEDDED_ID.
    • EMBEDDED_ID

      static final Property EMBEDDED_ID
      This is a 1-index counter for embedded files, used by the RecursiveParserWrapper
    • PARSE_TIME_MILLIS

      static final Property PARSE_TIME_MILLIS
    • TIKA_CONTENT_HANDLER

      static final Property TIKA_CONTENT_HANDLER
      Simple class name of the content handler
    • TIKA_CONTENT

      static final Property TIKA_CONTENT
    • TIKA_META_EXCEPTION_PREFIX

      static final String TIKA_META_EXCEPTION_PREFIX
      Use this to store parse exception information in the Metadata object.
      See Also:
    • TIKA_META_WARN_PREFIX

      static final String TIKA_META_WARN_PREFIX
      Use this to store warnings that happened during the parse.
      See Also:
    • CONTAINER_EXCEPTION

      static final Property CONTAINER_EXCEPTION
    • EMBEDDED_EXCEPTION

      static final Property EMBEDDED_EXCEPTION
    • EMBEDDED_BYTES_EXCEPTION

      static final Property EMBEDDED_BYTES_EXCEPTION
    • EMBEDDED_WARNING

      static final Property EMBEDDED_WARNING
    • WRITE_LIMIT_REACHED

      static final Property WRITE_LIMIT_REACHED
    • TIKA_META_EXCEPTION_WARNING

      static final Property TIKA_META_EXCEPTION_WARNING
      Use this to store exceptions caught during a parse that are non-fatal, e.g. if a parser is in lenient mode and more content can be extracted if we ignore an exception thrown by a dependency.
    • TRUNCATED_METADATA

      static final Property TRUNCATED_METADATA
      This means that metadata keys or metadata values were truncated. If there is an "include" filter, this should not be set if a field is not in the "include" set.
    • TIKA_META_EXCEPTION_EMBEDDED_STREAM

      static final Property TIKA_META_EXCEPTION_EMBEDDED_STREAM
      Use this to store exceptions caught while trying to read the stream of an embedded resource. Do not use this if there is a parse exception on the embedded resource.
    • TIKA_PARSED_BY

      static final Property TIKA_PARSED_BY
    • TIKA_PARSED_BY_FULL_SET

      static final Property TIKA_PARSED_BY_FULL_SET
      Use this to store a record of all parsers that touched a given file in the container file's metadata.
    • TIKA_DETECTED_LANGUAGE

      static final Property TIKA_DETECTED_LANGUAGE
    • TIKA_DETECTED_LANGUAGE_CONFIDENCE

      static final Property TIKA_DETECTED_LANGUAGE_CONFIDENCE
    • TIKA_DETECTED_LANGUAGE_CONFIDENCE_RAW

      static final Property TIKA_DETECTED_LANGUAGE_CONFIDENCE_RAW
    • RESOURCE_NAME_KEY

      static final String RESOURCE_NAME_KEY
      See Also:
    • PROTECTED

      static final String PROTECTED
      See Also:
    • EMBEDDED_RELATIONSHIP_ID

      static final String EMBEDDED_RELATIONSHIP_ID
      See Also:
    • EMBEDDED_STORAGE_CLASS_ID

      static final String EMBEDDED_STORAGE_CLASS_ID
      See Also:
    • EMBEDDED_RESOURCE_TYPE_KEY

      static final String EMBEDDED_RESOURCE_TYPE_KEY
      See Also:
    • ORIGINAL_RESOURCE_NAME

      static final Property ORIGINAL_RESOURCE_NAME
      Some file formats can store information about their original file name/location or about their attachment's original file name/location within the file.
    • SOURCE_PATH

      static final Property SOURCE_PATH
      This should be used to store the path (relative or full) of the source file, including the file name, e.g. doc/path/to/my_pdf.pdf

      This can also be used for a primary key within a database.

    • CONTENT_TYPE_HINT

      static final Property CONTENT_TYPE_HINT
      This is currently used to identify Content-Type that may be included within a document, such as in html documents (e.g. ) , or the value might come from outside the document. This information may be faulty and should be treated only as a hint.
    • CONTENT_TYPE_USER_OVERRIDE

      static final Property CONTENT_TYPE_USER_OVERRIDE
      This is used by users to override detection with the override detector.
    • CONTENT_TYPE_PARSER_OVERRIDE

      static final Property CONTENT_TYPE_PARSER_OVERRIDE
      This is used by parsers to override detection of embedded resources with the override detector.
    • FORMAT

      static final Property FORMAT
      See Also:
    • IDENTIFIER

      static final Property IDENTIFIER
      See Also:
    • CONTRIBUTOR

      static final Property CONTRIBUTOR
      See Also:
    • COVERAGE

      static final Property COVERAGE
      See Also:
    • CREATOR

      static final Property CREATOR
      See Also:
    • MODIFIER

      static final Property MODIFIER
      See Also:
    • CREATOR_TOOL

      static final Property CREATOR_TOOL
      See Also:
    • LANGUAGE

      static final Property LANGUAGE
      See Also:
    • PUBLISHER

      static final Property PUBLISHER
      See Also:
    • RELATION

      static final Property RELATION
      See Also:
    • RIGHTS

      static final Property RIGHTS
      See Also:
    • SOURCE

      static final Property SOURCE
      See Also:
    • TYPE

      static final Property TYPE
      See Also:
    • TITLE

      static final Property TITLE
      See Also:
    • DESCRIPTION

      static final Property DESCRIPTION
      See Also:
    • SUBJECT

      static final Property SUBJECT
      DublinCore.SUBJECT; should include both subject and keywords if a document format has both. See also Office.KEYWORDS and OfficeOpenXMLCore.SUBJECT.
    • CREATED

      static final Property CREATED
      See Also:
    • MODIFIED

      static final Property MODIFIED
      See Also:
    • METADATA_DATE

      static final Property METADATA_DATE
      See Also:
    • LATITUDE

      static final Property LATITUDE
      See Also:
    • LONGITUDE

      static final Property LONGITUDE
      See Also:
    • ALTITUDE

      static final Property ALTITUDE
      See Also:
    • RATING

      static final Property RATING
      See Also:
    • NUM_IMAGES

      static final Property NUM_IMAGES
      This is the number of images (as in a multi-frame gif) returned by Java's ImageReader.getNumImages(boolean). See the javadocs for known limitations.
    • COMMENTS

      static final Property COMMENTS
      See Also:
    • EMBEDDED_RESOURCE_TYPE

      static final Property EMBEDDED_RESOURCE_TYPE
      Embedded resource type property
    • HAS_SIGNATURE

      static final Property HAS_SIGNATURE
    • SIGNATURE_NAME

      static final Property SIGNATURE_NAME
    • SIGNATURE_DATE

      static final Property SIGNATURE_DATE
    • SIGNATURE_LOCATION

      static final Property SIGNATURE_LOCATION
    • SIGNATURE_REASON

      static final Property SIGNATURE_REASON
    • SIGNATURE_FILTER

      static final Property SIGNATURE_FILTER
    • SIGNATURE_CONTACT_INFO

      static final Property SIGNATURE_CONTACT_INFO
    • IS_ENCRYPTED

      static final Property IS_ENCRYPTED
    • DETECTED_ENCODING

      static final Property DETECTED_ENCODING
      When an EncodingDetector detects an encoding, the encoding should be stored in this field. This is different from HttpHeaders.CONTENT_ENCODING because that is what a parser chooses to use for processing a file. If an EncodingDetector returns "null", a parser may choose to use a default encoding. We want to differentiate between a parser using a default encoding and the output of an EncodingDetector.
    • ENCODING_DETECTOR

      static final Property ENCODING_DETECTOR
      This should be the simple class name for the EncodingDetectors whose detected encoding was used in the parse.
    • VERSION_COUNT

      static final Property VERSION_COUNT
      General metadata key for the count of non-final versions available within a file. This was added initially to support generalizing incremental updates in PDF.
    • VERSION_NUMBER

      static final Property VERSION_NUMBER
      General metadata key for the version number of a given file that contains earlier versions within it. This number is 0-indexed for the earliest version. The latest version does not have this metadata value. This was added initially to support generalizing incremental updates in PDF.
    • PIPES_RESULT

      static final Property PIPES_RESULT