Embedded Document Metadata
When Tika parses container files (such as ZIP archives, emails, PDFs with attachments, or Microsoft Office documents), it extracts embedded documents recursively. Tika provides several metadata fields to help you understand and track the structure of these embedded resources.
Overview
Understanding embedded document metadata requires distinguishing between two fundamentally different types of information:
-
Containment Structure (Tika-Generated) - Metadata that Tika generates to track how documents are nested within each other. This answers questions like: "Which file contained this attachment?" and "What is the nesting depth?"
-
Container Metadata (From the File) - Metadata that comes from the container file itself, describing what the container knows about its contents. This answers questions like: "What path was this file stored at inside the archive?" and "What was the original filename?"
The distinction matters because containers often store embedded files in internal directory
structures that are independent of how deeply nested the embedding is. A ZIP file preserves
its original folder hierarchy; an OOXML document stores media in xl/media/ or ppt/media/;
a PST file organizes emails by folder. This internal organization is separate from the
question of containment.
Containment Structure (Tika-Generated)
These fields are generated by Tika during parsing to track the nesting relationships between
documents. They answer: "Which document contained this one?" All fields are defined in
TikaCoreProperties.
Nesting Identifiers
TikaCoreProperties.EMBEDDED_ID(X-TIKA:embedded_id)-
A 1-indexed integer assigned by Tika to each embedded document during parsing. IDs are assigned in the order documents are encountered by the
RecursiveParserWrapper. This ID uniquely identifies each embedded document within a single parse operation. TikaCoreProperties.EMBEDDED_ID_PATH(X-TIKA:embedded_id_path)-
A path showing the containment hierarchy using
EMBEDDED_IDvalues. For example,/1/3indicates that the file withEMBEDDED_ID=3was contained within the file withEMBEDDED_ID=1. This is the most reliable field for tracking containment relationships.This is purely about which document contains which - it tells you nothing about folder structures or original paths within the containers themselves.
Synthetic Paths
TikaCoreProperties.EMBEDDED_RESOURCE_PATH(X-TIKA:embedded_resource_path)-
A synthetic path built by concatenating file names (from
RESOURCE_NAME_KEY) at each nesting level. This provides a human-readable path through the containment hierarchy.Do not use this field for creating directory structures to write out attachments. There may be path collisions, illegal characters, or zip slip vulnerabilities. Use EMBEDDED_ID_PATHfor reliable containment tracking. TikaCoreProperties.FINAL_EMBEDDED_RESOURCE_PATH(X-TIKA:final_embedded_resource_path)-
Similar to
EMBEDDED_RESOURCE_PATH, but calculated at the end of the full parse. For some parsers, an embedded file’s name isn’t known until after its child files have been parsed. This field may have fewer "unknown" file names thanEMBEDDED_RESOURCE_PATH.
Resource Naming
TikaCoreProperties.RESOURCE_NAME_KEY(X-TIKA:resourceName)-
The file name (not path) of the resource. Tika makes a best effort to determine a meaningful name from the container’s metadata. When unavailable, Tika falls back to synthetic names such as
embedded-1.jpeg.In Tika 4.x, this field contains only the file name. Use INTERNAL_PATHfor the full path as stored in the container.
Container Metadata (From the File)
These fields contain metadata that is stored within the container file itself. This is
information the container preserves about its contents, independent of how Tika traverses
the nesting structure. All fields below are defined in TikaCoreProperties.
Internal Paths
TikaCoreProperties.INTERNAL_PATH(X-TIKA:internalPath)-
The path (including file name) as literally stored within the container. This is what the container knows about where the file lives in its internal structure:
-
In a ZIP: the entry path (e.g.,
reports/Q1/sales.xlsx) -
In a PST: the folder path plus message name (e.g.,
Inbox/Important/Meeting notes.msg) -
In an OOXML document: the part name (e.g.,
xl/media/image1.png)This differs fundamentally from
EMBEDDED_RESOURCE_PATH: -
INTERNAL_PATHis what the container stores about the file’s location within itself -
EMBEDDED_RESOURCE_PATHis what Tika synthesizes from the nesting structure
-
TikaCoreProperties.ORIGINAL_RESOURCE_NAME(X-TIKA:origResourceName)-
For some file formats, the file path where the document was last saved on the creator’s system. For example, an
.xlsxfile namedbudget.xlsxmay include a metadata property storing where it was last saved:C:\Users\Alice\budget.xlsx. This is not specific to embedded files - it’s a property that certain file formats preserve about themselves.
Microsoft-Specific Metadata
Microsoft Office formats use additional identifiers for embedded objects.
TikaCoreProperties.EMBEDDED_RELATIONSHIP_ID(X-TIKA:embeddedRelationshipId)-
A Microsoft-specific identifier used internally to reference embedded objects within Office documents. This is the relationship ID from the Office Open XML or OLE structure.
Office.EMBEDDED_STORAGE_CLASS_ID(msoffice:embeddedStorageClassId)-
A UUID that identifies the class of embedded object in Microsoft formats. While not exactly a MIME type, it provides similar information about what type of object is embedded. Defined in the
Officemetadata class.
Quick Reference
| Property | Metadata Key | Source |
|---|---|---|
|
|
Containment |
|
|
Containment |
|
|
Containment |
|
|
Containment |
|
|
Containment |
|
|
Container |
|
|
Container |
|
|
Container (MS) |
|
|
Container (MS) |
Example: Understanding the Difference
Consider a ZIP file archive.zip containing reports/Q1/sales.xlsx, where the spreadsheet
itself contains an embedded image:
| Document | Field | Value |
|---|---|---|
Container ( |
|
(not set - this is the root document) |
|
(not set) |
|
|
(not set) |
|
|
|
|
|
(not set) |
|
Spreadsheet ( |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Embedded image in spreadsheet |
|
|
|
|
|
|
|
|
|
|
|
|
|
Key Observations
The table above illustrates the fundamental distinction between containment tracking and container metadata:
Containment structure (Tika-generated):
-
EMBEDDED_ID_PATH/1/2tells you that the image (ID=2) was found inside the spreadsheet (ID=1). It answers: "What contains what?" -
EMBEDDED_RESOURCE_PATH/sales.xlsx/image1.pngis synthesized from file names at each nesting level. It provides a human-readable path through the containment hierarchy.
Container metadata (from the file):
-
INTERNAL_PATHfor the spreadsheet (reports/Q1/sales.xlsx) is what the ZIP file knows about where that entry was stored - its internal folder structure. -
INTERNAL_PATHfor the image (xl/media/image1.png) is what the XLSX file knows about where that media file lives - its internal OOXML part name.
Notice that INTERNAL_PATH resets at each container boundary. The image’s internal path
doesn’t include reports/Q1/ because that path information belongs to the ZIP container,
not the XLSX container. Each container only knows about its own internal organization.