Package org.apache.tika.metadata
Interface TikaCoreProperties
-
public interface TikaCorePropertiesContains a core set of basic Tika metadata properties, which all parsers will attempt to supply (where the file format permits). These are all defined in terms of other standard namespaces.Users of Tika who wish to have consistent metadata across file formats can make use of these Properties, knowing that where present they will have consistent semantic meaning between different file formats. (No matter if one file format calls it Title, another Long-Title and another Long-Name, if they all mean the same thing as defined by
DublinCore.TITLEthen they will all be present as such)For now, most of these properties are composite ones including the deprecated non-prefixed String properties from the Metadata class. In Tika 2.0, most of these will revert back to simple assignments.
- Since:
- Apache Tika 1.2
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static classTikaCoreProperties.EmbeddedResourceTypeA file might contain different types of embedded documents.
-
Field Summary
Fields Modifier and Type Field Description static PropertyALTITUDEstatic PropertyCOMMENTSstatic PropertyCONTAINER_EXCEPTIONstatic PropertyCONTENT_TYPE_HINTThis is currently used to identify Content-Type that may be included within a document, such as in html documents (e.g.static PropertyCONTENT_TYPE_PARSER_OVERRIDEThis is used by parsers to override detection of embedded resources with the override detector.static PropertyCONTENT_TYPE_USER_OVERRIDEThis is used by users to override detection with the override detector.static PropertyCONTRIBUTORstatic PropertyCOVERAGEstatic PropertyCREATEDstatic PropertyCREATORstatic PropertyCREATOR_TOOLstatic PropertyDESCRIPTIONstatic PropertyDETECTED_ENCODINGWhen an EncodingDetector detects an encoding, the encoding should be stored in this field.static PropertyEMBEDDED_BYTES_EXCEPTIONstatic PropertyEMBEDDED_DEPTHstatic PropertyEMBEDDED_EXCEPTIONstatic PropertyEMBEDDED_IDThis is a 1-index counter for embedded files, used by the RecursiveParserWrapperstatic PropertyEMBEDDED_ID_PATHThis tracks the embedded file paths based on the embedded file'sEMBEDDED_ID.static StringEMBEDDED_RELATIONSHIP_IDstatic PropertyEMBEDDED_RESOURCE_PATHThis tracks the embedded file paths based on the name of embedded files where available.static PropertyEMBEDDED_RESOURCE_TYPEEmbedded resource type propertystatic StringEMBEDDED_RESOURCE_TYPE_KEYstatic StringEMBEDDED_STORAGE_CLASS_IDstatic PropertyEMBEDDED_WARNINGstatic PropertyENCODING_DETECTORThis should be the simple class name for the EncodingDetectors whose detected encoding was used in the parse.static PropertyFINAL_EMBEDDED_RESOURCE_PATHThis is calculated inRecursiveParserWrapperHandler.static PropertyFORMATstatic PropertyHAS_SIGNATUREstatic PropertyIDENTIFIERstatic PropertyIS_ENCRYPTEDstatic PropertyLANGUAGEstatic PropertyLATITUDEstatic PropertyLONGITUDEstatic PropertyMETADATA_DATEstatic PropertyMODIFIEDstatic PropertyMODIFIERstatic StringNAMESPACE_PREFIX_DELIMITERThe common delimiter used between the namespace abbreviation and the property namestatic PropertyNUM_IMAGESThis is the number of images (as in a multi-frame gif) returned by Java'sImageReader.getNumImages(boolean).static PropertyORIGINAL_RESOURCE_NAMESome file formats can store information about their original file name/location or about their attachment's original file name/location within the file.static PropertyPARSE_TIME_MILLISstatic PropertyPIPES_RESULTstatic PropertyPRINT_DATEstatic StringPROTECTEDstatic PropertyPUBLISHERstatic PropertyRATINGstatic PropertyRELATIONstatic StringRESOURCE_NAME_KEYstatic PropertyRIGHTSstatic PropertySIGNATURE_CONTACT_INFOstatic PropertySIGNATURE_DATEstatic PropertySIGNATURE_FILTERstatic PropertySIGNATURE_LOCATIONstatic PropertySIGNATURE_NAMEstatic PropertySIGNATURE_REASONstatic PropertySOURCEstatic PropertySOURCE_PATHThis should be used to store the path (relative or full) of the source file, including the file name, e.g. doc/path/to/my_pdf.pdfstatic PropertySUBJECTDublinCore.SUBJECT; should include both subject and keywords if a document format has both.static PropertyTIKA_CONTENTstatic PropertyTIKA_CONTENT_HANDLERSimple class name of the content handlerstatic PropertyTIKA_DETECTED_LANGUAGEstatic PropertyTIKA_DETECTED_LANGUAGE_CONFIDENCEstatic PropertyTIKA_DETECTED_LANGUAGE_CONFIDENCE_RAWstatic PropertyTIKA_META_EXCEPTION_EMBEDDED_STREAMUse this to store exceptions caught while trying to read the stream of an embedded resource.static StringTIKA_META_EXCEPTION_PREFIXUse this to store parse exception information in the Metadata object.static PropertyTIKA_META_EXCEPTION_WARNINGUse this to store exceptions caught during a parse that are non-fatal, e.g. if a parser is in lenient mode and more content can be extracted if we ignore an exception thrown by a dependency.static StringTIKA_META_PREFIXUse this to prefix metadata properties that store information about the parsing process.static StringTIKA_META_WARN_PREFIXUse this to store warnings that happened during the parse.static PropertyTIKA_PARSED_BYstatic PropertyTIKA_PARSED_BY_FULL_SETUse this to store a record of all parsers that touched a given file in the container file's metadata.static PropertyTITLEstatic PropertyTRUNCATED_METADATAThis means that metadata keys or metadata values were truncated.static PropertyTYPEstatic PropertyVERSION_COUNTGeneral metadata key for the count of non-final versions available within a file.static PropertyVERSION_NUMBERGeneral metadata key for the version number of a given file that contains earlier versions within it.static PropertyWRITE_LIMIT_REACHED
-
-
-
Field Detail
-
NAMESPACE_PREFIX_DELIMITER
static final String NAMESPACE_PREFIX_DELIMITER
The common delimiter used between the namespace abbreviation and the property name- See Also:
- Constant Field Values
-
TIKA_META_PREFIX
static final String TIKA_META_PREFIX
Use this to prefix metadata properties that store information about the parsing process. Users should be able to distinguish between metadata that was contained within the document and metadata about the parsing process.- See Also:
- Constant Field Values
-
EMBEDDED_DEPTH
static final Property EMBEDDED_DEPTH
-
EMBEDDED_RESOURCE_PATH
static final Property EMBEDDED_RESOURCE_PATH
This tracks the embedded file paths based on the name of embedded files where available. This field should be treated with great care and should NOT be used for creating a directory structure to write out attachments because: there may be path collisions or illegal characters or other mayhem. For a more robust path, seeEMBEDDED_ID_PATH.
-
FINAL_EMBEDDED_RESOURCE_PATH
static final Property FINAL_EMBEDDED_RESOURCE_PATH
This is calculated inRecursiveParserWrapperHandler. It differs fromEMBEDDED_RESOURCE_PATHin that it is calculated at the end of the full parse of a file.EMBEDDED_RESOURCE_PATHis calculated during the parse, and, for some parsers, an embedded file's name isn't known until after its child files have been parsed. Note that the unknown file count may differ betweenEMBEDDED_RESOURCE_PATHbecause there should be fewer unknown files when this is calculated. More simply, there is no connection between "embedded-1" in this field and "embedded-1" inEMBEDDED_RESOURCE_PATH. This field should be treated with great care and should NOT be used for creating a directory structure to write out attachments because: there may be path collisions or illegal characters or other mayhem. For a more robust path, seeEMBEDDED_ID_PATH.
-
EMBEDDED_ID_PATH
static final Property EMBEDDED_ID_PATH
This tracks the embedded file paths based on the embedded file'sEMBEDDED_ID.
-
EMBEDDED_ID
static final Property EMBEDDED_ID
This is a 1-index counter for embedded files, used by the RecursiveParserWrapper
-
PARSE_TIME_MILLIS
static final Property PARSE_TIME_MILLIS
-
TIKA_CONTENT_HANDLER
static final Property TIKA_CONTENT_HANDLER
Simple class name of the content handler
-
TIKA_CONTENT
static final Property TIKA_CONTENT
-
TIKA_META_EXCEPTION_PREFIX
static final String TIKA_META_EXCEPTION_PREFIX
Use this to store parse exception information in the Metadata object.- See Also:
- Constant Field Values
-
TIKA_META_WARN_PREFIX
static final String TIKA_META_WARN_PREFIX
Use this to store warnings that happened during the parse.- See Also:
- Constant Field Values
-
CONTAINER_EXCEPTION
static final Property CONTAINER_EXCEPTION
-
EMBEDDED_EXCEPTION
static final Property EMBEDDED_EXCEPTION
-
EMBEDDED_BYTES_EXCEPTION
static final Property EMBEDDED_BYTES_EXCEPTION
-
EMBEDDED_WARNING
static final Property EMBEDDED_WARNING
-
WRITE_LIMIT_REACHED
static final Property WRITE_LIMIT_REACHED
-
TIKA_META_EXCEPTION_WARNING
static final Property TIKA_META_EXCEPTION_WARNING
Use this to store exceptions caught during a parse that are non-fatal, e.g. if a parser is in lenient mode and more content can be extracted if we ignore an exception thrown by a dependency.
-
TRUNCATED_METADATA
static final Property TRUNCATED_METADATA
This means that metadata keys or metadata values were truncated. If there is an "include" filter, this should not be set if a field is not in the "include" set.
-
TIKA_META_EXCEPTION_EMBEDDED_STREAM
static final Property TIKA_META_EXCEPTION_EMBEDDED_STREAM
Use this to store exceptions caught while trying to read the stream of an embedded resource. Do not use this if there is a parse exception on the embedded resource.
-
TIKA_PARSED_BY
static final Property TIKA_PARSED_BY
-
TIKA_PARSED_BY_FULL_SET
static final Property TIKA_PARSED_BY_FULL_SET
Use this to store a record of all parsers that touched a given file in the container file's metadata.
-
TIKA_DETECTED_LANGUAGE
static final Property TIKA_DETECTED_LANGUAGE
-
TIKA_DETECTED_LANGUAGE_CONFIDENCE
static final Property TIKA_DETECTED_LANGUAGE_CONFIDENCE
-
TIKA_DETECTED_LANGUAGE_CONFIDENCE_RAW
static final Property TIKA_DETECTED_LANGUAGE_CONFIDENCE_RAW
-
RESOURCE_NAME_KEY
static final String RESOURCE_NAME_KEY
- See Also:
- Constant Field Values
-
PROTECTED
static final String PROTECTED
- See Also:
- Constant Field Values
-
EMBEDDED_RELATIONSHIP_ID
static final String EMBEDDED_RELATIONSHIP_ID
- See Also:
- Constant Field Values
-
EMBEDDED_STORAGE_CLASS_ID
static final String EMBEDDED_STORAGE_CLASS_ID
- See Also:
- Constant Field Values
-
EMBEDDED_RESOURCE_TYPE_KEY
static final String EMBEDDED_RESOURCE_TYPE_KEY
- See Also:
- Constant Field Values
-
ORIGINAL_RESOURCE_NAME
static final Property ORIGINAL_RESOURCE_NAME
Some file formats can store information about their original file name/location or about their attachment's original file name/location within the file.
-
SOURCE_PATH
static final Property SOURCE_PATH
This should be used to store the path (relative or full) of the source file, including the file name, e.g. doc/path/to/my_pdf.pdfThis can also be used for a primary key within a database.
-
CONTENT_TYPE_HINT
static final Property CONTENT_TYPE_HINT
This is currently used to identify Content-Type that may be included within a document, such as in html documents (e.g. ) , or the value might come from outside the document. This information may be faulty and should be treated only as a hint.
-
CONTENT_TYPE_USER_OVERRIDE
static final Property CONTENT_TYPE_USER_OVERRIDE
This is used by users to override detection with the override detector.
-
CONTENT_TYPE_PARSER_OVERRIDE
static final Property CONTENT_TYPE_PARSER_OVERRIDE
This is used by parsers to override detection of embedded resources with the override detector.
-
FORMAT
static final Property FORMAT
- See Also:
DublinCore.FORMAT
-
IDENTIFIER
static final Property IDENTIFIER
- See Also:
DublinCore.IDENTIFIER
-
CONTRIBUTOR
static final Property CONTRIBUTOR
- See Also:
DublinCore.CONTRIBUTOR
-
COVERAGE
static final Property COVERAGE
- See Also:
DublinCore.COVERAGE
-
CREATOR
static final Property CREATOR
- See Also:
DublinCore.CREATOR
-
MODIFIER
static final Property MODIFIER
- See Also:
Office.LAST_AUTHOR
-
CREATOR_TOOL
static final Property CREATOR_TOOL
- See Also:
XMP.CREATOR_TOOL
-
LANGUAGE
static final Property LANGUAGE
- See Also:
DublinCore.LANGUAGE
-
PUBLISHER
static final Property PUBLISHER
- See Also:
DublinCore.PUBLISHER
-
RELATION
static final Property RELATION
- See Also:
DublinCore.RELATION
-
RIGHTS
static final Property RIGHTS
- See Also:
DublinCore.RIGHTS
-
SOURCE
static final Property SOURCE
- See Also:
DublinCore.SOURCE
-
TYPE
static final Property TYPE
- See Also:
DublinCore.TYPE
-
TITLE
static final Property TITLE
- See Also:
DublinCore.TITLE
-
DESCRIPTION
static final Property DESCRIPTION
- See Also:
DublinCore.DESCRIPTION
-
SUBJECT
static final Property SUBJECT
DublinCore.SUBJECT; should include both subject and keywords if a document format has both. See alsoOffice.KEYWORDSandOfficeOpenXMLCore.SUBJECT.
-
CREATED
static final Property CREATED
- See Also:
DublinCore.DATE
-
MODIFIED
static final Property MODIFIED
- See Also:
DublinCore.MODIFIED,Office.SAVE_DATE
-
PRINT_DATE
static final Property PRINT_DATE
- See Also:
Office.PRINT_DATE
-
METADATA_DATE
static final Property METADATA_DATE
- See Also:
XMP.METADATA_DATE
-
LATITUDE
static final Property LATITUDE
- See Also:
Geographic.LATITUDE
-
LONGITUDE
static final Property LONGITUDE
- See Also:
Geographic.LONGITUDE
-
ALTITUDE
static final Property ALTITUDE
- See Also:
Geographic.ALTITUDE
-
RATING
static final Property RATING
- See Also:
XMP.RATING
-
NUM_IMAGES
static final Property NUM_IMAGES
This is the number of images (as in a multi-frame gif) returned by Java'sImageReader.getNumImages(boolean). See the javadocs for known limitations.
-
COMMENTS
static final Property COMMENTS
- See Also:
OfficeOpenXMLExtended.COMMENTS
-
EMBEDDED_RESOURCE_TYPE
static final Property EMBEDDED_RESOURCE_TYPE
Embedded resource type property
-
HAS_SIGNATURE
static final Property HAS_SIGNATURE
-
SIGNATURE_NAME
static final Property SIGNATURE_NAME
-
SIGNATURE_DATE
static final Property SIGNATURE_DATE
-
SIGNATURE_LOCATION
static final Property SIGNATURE_LOCATION
-
SIGNATURE_REASON
static final Property SIGNATURE_REASON
-
SIGNATURE_FILTER
static final Property SIGNATURE_FILTER
-
SIGNATURE_CONTACT_INFO
static final Property SIGNATURE_CONTACT_INFO
-
IS_ENCRYPTED
static final Property IS_ENCRYPTED
-
DETECTED_ENCODING
static final Property DETECTED_ENCODING
When an EncodingDetector detects an encoding, the encoding should be stored in this field. This is different fromHttpHeaders.CONTENT_ENCODINGbecause that is what a parser chooses to use for processing a file. If an EncodingDetector returns "null", a parser may choose to use a default encoding. We want to differentiate between a parser using a default encoding and the output of an EncodingDetector.
-
ENCODING_DETECTOR
static final Property ENCODING_DETECTOR
This should be the simple class name for the EncodingDetectors whose detected encoding was used in the parse.
-
VERSION_COUNT
static final Property VERSION_COUNT
General metadata key for the count of non-final versions available within a file. This was added initially to support generalizing incremental updates in PDF.
-
VERSION_NUMBER
static final Property VERSION_NUMBER
General metadata key for the version number of a given file that contains earlier versions within it. This number is 0-indexed for the earliest version. The latest version does not have this metadata value. This was added initially to support generalizing incremental updates in PDF.
-
PIPES_RESULT
static final Property PIPES_RESULT
-
-