Package org.apache.tika.metadata
Interface TikaCoreProperties
public interface TikaCoreProperties
Contains a core set of basic Tika metadata properties, which all parsers
will attempt to supply (where the file format permits). These are all
defined in terms of other standard namespaces.
Users of Tika who wish to have consistent metadata across file formats
can make use of these Properties, knowing that where present they will
have consistent semantic meaning between different file formats. (No
matter if one file format calls it Title, another Long-Title and another
Long-Name, if they all mean the same thing as defined by
DublinCore.TITLE
then they will all be present as such)
For now, most of these properties are composite ones including the deprecated non-prefixed String properties from the Metadata class. In Tika 2.0, most of these will revert back to simple assignments.
- Since:
- Apache Tika 1.2
-
Nested Class Summary
Modifier and TypeInterfaceDescriptionstatic enum
A file might contain different types of embedded documents. -
Field Summary
Modifier and TypeFieldDescriptionstatic final Property
static final Property
static final Property
static final Property
This is currently used to identify Content-Type that may be included within a document, such as in html documents (e.g.static final Property
This is used by parsers to override detection of embedded resources with the override detector.static final Property
This is used by users to override detection with the override detector.static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
When an EncodingDetector detects an encoding, the encoding should be stored in this field.static final Property
static final Property
static final Property
static final Property
This is a 1-index counter for embedded files, used by the RecursiveParserWrapperstatic final Property
This tracks the embedded file paths based on the embedded file'sEMBEDDED_ID
.static final String
static final Property
This tracks the embedded file paths based on the name of embedded files where available.static final Property
Embedded resource type propertystatic final String
static final String
static final Property
static final Property
This should be the simple class name for the EncodingDetectors whose detected encoding was used in the parse.static final Property
This is calculated inRecursiveParserWrapperHandler
.static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
static final String
The common delimiter used between the namespace abbreviation and the property namestatic final Property
This is the number of images (as in a multi-frame gif) returned by Java'sImageReader.getNumImages(boolean)
.static final Property
Some file formats can store information about their original file name/location or about their attachment's original file name/location within the file.static final Property
static final Property
static final Property
static final String
static final Property
static final Property
static final Property
static final String
static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
static final Property
This should be used to store the path (relative or full) of the source file, including the file name, e.g. doc/path/to/my_pdf.pdfstatic final Property
DublinCore.SUBJECT
; should include both subject and keywords if a document format has both.static final Property
static final Property
Simple class name of the content handlerstatic final Property
static final Property
static final Property
static final Property
Use this to store exceptions caught while trying to read the stream of an embedded resource.static final String
Use this to store parse exception information in the Metadata object.static final Property
Use this to store exceptions caught during a parse that are non-fatal, e.g. if a parser is in lenient mode and more content can be extracted if we ignore an exception thrown by a dependency.static final String
Use this to prefix metadata properties that store information about the parsing process.static final String
Use this to store warnings that happened during the parse.static final Property
static final Property
Use this to store a record of all parsers that touched a given file in the container file's metadata.static final Property
static final Property
This means that metadata keys or metadata values were truncated.static final Property
static final Property
General metadata key for the count of non-final versions available within a file.static final Property
General metadata key for the version number of a given file that contains earlier versions within it.static final Property
-
Field Details
-
NAMESPACE_PREFIX_DELIMITER
The common delimiter used between the namespace abbreviation and the property name- See Also:
-
TIKA_META_PREFIX
Use this to prefix metadata properties that store information about the parsing process. Users should be able to distinguish between metadata that was contained within the document and metadata about the parsing process.- See Also:
-
EMBEDDED_DEPTH
-
EMBEDDED_RESOURCE_PATH
This tracks the embedded file paths based on the name of embedded files where available. This field should be treated with great care and should NOT be used for creating a directory structure to write out attachments because: there may be path collisions or illegal characters or other mayhem. For a more robust path, seeEMBEDDED_ID_PATH
. -
FINAL_EMBEDDED_RESOURCE_PATH
This is calculated inRecursiveParserWrapperHandler
. It differs fromEMBEDDED_RESOURCE_PATH
in that it is calculated at the end of the full parse of a file.EMBEDDED_RESOURCE_PATH
is calculated during the parse, and, for some parsers, an embedded file's name isn't known until after its child files have been parsed. Note that the unknown file count may differ betweenEMBEDDED_RESOURCE_PATH
because there should be fewer unknown files when this is calculated. More simply, there is no connection between "embedded-1" in this field and "embedded-1" inEMBEDDED_RESOURCE_PATH
. This field should be treated with great care and should NOT be used for creating a directory structure to write out attachments because: there may be path collisions or illegal characters or other mayhem. For a more robust path, seeEMBEDDED_ID_PATH
. -
EMBEDDED_ID_PATH
This tracks the embedded file paths based on the embedded file'sEMBEDDED_ID
. -
EMBEDDED_ID
This is a 1-index counter for embedded files, used by the RecursiveParserWrapper -
PARSE_TIME_MILLIS
-
TIKA_CONTENT_HANDLER
Simple class name of the content handler -
TIKA_CONTENT
-
TIKA_META_EXCEPTION_PREFIX
Use this to store parse exception information in the Metadata object.- See Also:
-
TIKA_META_WARN_PREFIX
Use this to store warnings that happened during the parse.- See Also:
-
CONTAINER_EXCEPTION
-
EMBEDDED_EXCEPTION
-
EMBEDDED_BYTES_EXCEPTION
-
EMBEDDED_WARNING
-
WRITE_LIMIT_REACHED
-
TIKA_META_EXCEPTION_WARNING
Use this to store exceptions caught during a parse that are non-fatal, e.g. if a parser is in lenient mode and more content can be extracted if we ignore an exception thrown by a dependency. -
TRUNCATED_METADATA
This means that metadata keys or metadata values were truncated. If there is an "include" filter, this should not be set if a field is not in the "include" set. -
TIKA_META_EXCEPTION_EMBEDDED_STREAM
Use this to store exceptions caught while trying to read the stream of an embedded resource. Do not use this if there is a parse exception on the embedded resource. -
TIKA_PARSED_BY
-
TIKA_PARSED_BY_FULL_SET
Use this to store a record of all parsers that touched a given file in the container file's metadata. -
TIKA_DETECTED_LANGUAGE
-
TIKA_DETECTED_LANGUAGE_CONFIDENCE
-
TIKA_DETECTED_LANGUAGE_CONFIDENCE_RAW
-
RESOURCE_NAME_KEY
- See Also:
-
PROTECTED
- See Also:
-
EMBEDDED_RELATIONSHIP_ID
- See Also:
-
EMBEDDED_STORAGE_CLASS_ID
- See Also:
-
EMBEDDED_RESOURCE_TYPE_KEY
- See Also:
-
ORIGINAL_RESOURCE_NAME
Some file formats can store information about their original file name/location or about their attachment's original file name/location within the file. -
SOURCE_PATH
This should be used to store the path (relative or full) of the source file, including the file name, e.g. doc/path/to/my_pdf.pdfThis can also be used for a primary key within a database.
-
CONTENT_TYPE_HINT
This is currently used to identify Content-Type that may be included within a document, such as in html documents (e.g. ) , or the value might come from outside the document. This information may be faulty and should be treated only as a hint. -
CONTENT_TYPE_USER_OVERRIDE
This is used by users to override detection with the override detector. -
CONTENT_TYPE_PARSER_OVERRIDE
This is used by parsers to override detection of embedded resources with the override detector. -
FORMAT
- See Also:
-
IDENTIFIER
- See Also:
-
CONTRIBUTOR
- See Also:
-
COVERAGE
- See Also:
-
CREATOR
- See Also:
-
MODIFIER
- See Also:
-
CREATOR_TOOL
- See Also:
-
LANGUAGE
- See Also:
-
PUBLISHER
- See Also:
-
RELATION
- See Also:
-
RIGHTS
- See Also:
-
SOURCE
- See Also:
-
TYPE
- See Also:
-
TITLE
- See Also:
-
DESCRIPTION
- See Also:
-
SUBJECT
DublinCore.SUBJECT
; should include both subject and keywords if a document format has both. See alsoOffice.KEYWORDS
andOfficeOpenXMLCore.SUBJECT
. -
CREATED
- See Also:
-
MODIFIED
- See Also:
-
PRINT_DATE
- See Also:
-
METADATA_DATE
- See Also:
-
LATITUDE
- See Also:
-
LONGITUDE
- See Also:
-
ALTITUDE
- See Also:
-
RATING
- See Also:
-
NUM_IMAGES
This is the number of images (as in a multi-frame gif) returned by Java'sImageReader.getNumImages(boolean)
. See the javadocs for known limitations. -
COMMENTS
- See Also:
-
EMBEDDED_RESOURCE_TYPE
Embedded resource type property -
HAS_SIGNATURE
-
SIGNATURE_NAME
-
SIGNATURE_DATE
-
SIGNATURE_LOCATION
-
SIGNATURE_REASON
-
SIGNATURE_FILTER
-
SIGNATURE_CONTACT_INFO
-
IS_ENCRYPTED
-
DETECTED_ENCODING
When an EncodingDetector detects an encoding, the encoding should be stored in this field. This is different fromHttpHeaders.CONTENT_ENCODING
because that is what a parser chooses to use for processing a file. If an EncodingDetector returns "null", a parser may choose to use a default encoding. We want to differentiate between a parser using a default encoding and the output of an EncodingDetector. -
ENCODING_DETECTOR
This should be the simple class name for the EncodingDetectors whose detected encoding was used in the parse. -
VERSION_COUNT
General metadata key for the count of non-final versions available within a file. This was added initially to support generalizing incremental updates in PDF. -
VERSION_NUMBER
General metadata key for the version number of a given file that contains earlier versions within it. This number is 0-indexed for the earliest version. The latest version does not have this metadata value. This was added initially to support generalizing incremental updates in PDF. -
PIPES_RESULT
-