Embedded Document Metadata

When Tika parses container files (such as ZIP archives, emails, PDFs with attachments, or Microsoft Office documents), it extracts embedded documents recursively. Tika provides several metadata fields to help you understand and track the structure of these embedded resources.

Overview

Understanding embedded document metadata requires distinguishing between two fundamentally different types of information:

  • Containment Structure (Tika-Generated) - Metadata that Tika generates to track how documents are nested within each other. This answers questions like: "Which file contained this attachment?" and "What is the nesting depth?"

  • Container Metadata (From the File) - Metadata that comes from the container file itself, describing what the container knows about its contents. This answers questions like: "What path was this file stored at inside the archive?" and "What was the original filename?"

The distinction matters because containers often store embedded files in internal directory structures that are independent of how deeply nested the embedding is. A ZIP file preserves its original folder hierarchy; an OOXML document stores media in xl/media/ or ppt/media/; a PST file organizes emails by folder. This internal organization is separate from the question of containment.

Containment Structure (Tika-Generated)

These fields are generated by Tika during parsing to track the nesting relationships between documents. They answer: "Which document contained this one?" All fields are defined in TikaCoreProperties.

Nesting Identifiers

TikaCoreProperties.EMBEDDED_ID (X-TIKA:embedded_id)

A 1-indexed integer assigned by Tika to each embedded document during parsing. IDs are assigned in the order documents are encountered by the RecursiveParserWrapper. This ID uniquely identifies each embedded document within a single parse operation.

TikaCoreProperties.EMBEDDED_ID_PATH (X-TIKA:embedded_id_path)

A path showing the containment hierarchy using EMBEDDED_ID values. For example, /1/3 indicates that the file with EMBEDDED_ID=3 was contained within the file with EMBEDDED_ID=1. This is the most reliable field for tracking containment relationships.

This is purely about which document contains which - it tells you nothing about folder structures or original paths within the containers themselves.

Synthetic Paths

TikaCoreProperties.EMBEDDED_RESOURCE_PATH (X-TIKA:embedded_resource_path)

A synthetic path built by concatenating file names (from RESOURCE_NAME_KEY) at each nesting level. This provides a human-readable path through the containment hierarchy.

Do not use this field for creating directory structures to write out attachments. There may be path collisions, illegal characters, or zip slip vulnerabilities. Use EMBEDDED_ID_PATH for reliable containment tracking.
TikaCoreProperties.FINAL_EMBEDDED_RESOURCE_PATH (X-TIKA:final_embedded_resource_path)

Similar to EMBEDDED_RESOURCE_PATH, but calculated at the end of the full parse. For some parsers, an embedded file’s name isn’t known until after its child files have been parsed. This field may have fewer "unknown" file names than EMBEDDED_RESOURCE_PATH.

Resource Naming

TikaCoreProperties.RESOURCE_NAME_KEY (X-TIKA:resourceName)

The file name (not path) of the resource. Tika makes a best effort to determine a meaningful name from the container’s metadata. When unavailable, Tika falls back to synthetic names such as embedded-1.jpeg.

In Tika 4.x, this field contains only the file name. Use INTERNAL_PATH for the full path as stored in the container.

Container Metadata (From the File)

These fields contain metadata that is stored within the container file itself. This is information the container preserves about its contents, independent of how Tika traverses the nesting structure. All fields below are defined in TikaCoreProperties.

Internal Paths

TikaCoreProperties.INTERNAL_PATH (X-TIKA:internalPath)

The path (including file name) as literally stored within the container. This is what the container knows about where the file lives in its internal structure:

  • In a ZIP: the entry path (e.g., reports/Q1/sales.xlsx)

  • In a PST: the folder path plus message name (e.g., Inbox/Important/Meeting notes.msg)

  • In an OOXML document: the part name (e.g., xl/media/image1.png)

    This differs fundamentally from EMBEDDED_RESOURCE_PATH:

  • INTERNAL_PATH is what the container stores about the file’s location within itself

  • EMBEDDED_RESOURCE_PATH is what Tika synthesizes from the nesting structure

TikaCoreProperties.ORIGINAL_RESOURCE_NAME (X-TIKA:origResourceName)

For some file formats, the file path where the document was last saved on the creator’s system. For example, an .xlsx file named budget.xlsx may include a metadata property storing where it was last saved: C:\Users\Alice\budget.xlsx. This is not specific to embedded files - it’s a property that certain file formats preserve about themselves.

Microsoft-Specific Metadata

Microsoft Office formats use additional identifiers for embedded objects.

TikaCoreProperties.EMBEDDED_RELATIONSHIP_ID (X-TIKA:embeddedRelationshipId)

A Microsoft-specific identifier used internally to reference embedded objects within Office documents. This is the relationship ID from the Office Open XML or OLE structure.

Office.EMBEDDED_STORAGE_CLASS_ID (msoffice:embeddedStorageClassId)

A UUID that identifies the class of embedded object in Microsoft formats. While not exactly a MIME type, it provides similar information about what type of object is embedded. Defined in the Office metadata class.

Quick Reference

Property Metadata Key Source

EMBEDDED_ID

X-TIKA:embedded_id

Containment

EMBEDDED_ID_PATH

X-TIKA:embedded_id_path

Containment

EMBEDDED_RESOURCE_PATH

X-TIKA:embedded_resource_path

Containment

FINAL_EMBEDDED_RESOURCE_PATH

X-TIKA:final_embedded_resource_path

Containment

RESOURCE_NAME_KEY

X-TIKA:resourceName

Containment

INTERNAL_PATH

X-TIKA:internalPath

Container

ORIGINAL_RESOURCE_NAME

X-TIKA:origResourceName

Container

EMBEDDED_RELATIONSHIP_ID

X-TIKA:embeddedRelationshipId

Container (MS)

Office.EMBEDDED_STORAGE_CLASS_ID

msoffice:embeddedStorageClassId

Container (MS)

Example: Understanding the Difference

Consider a ZIP file archive.zip containing reports/Q1/sales.xlsx, where the spreadsheet itself contains an embedded image:

Document Field Value

Container (archive.zip)

EMBEDDED_ID

(not set - this is the root document)

EMBEDDED_ID_PATH

(not set)

INTERNAL_PATH

(not set)

RESOURCE_NAME_KEY

archive.zip

EMBEDDED_RESOURCE_PATH

(not set)

Spreadsheet (sales.xlsx)

EMBEDDED_ID

1

EMBEDDED_ID_PATH

/1

INTERNAL_PATH

reports/Q1/sales.xlsx (from ZIP entry)

RESOURCE_NAME_KEY

sales.xlsx

EMBEDDED_RESOURCE_PATH

/sales.xlsx

Embedded image in spreadsheet

EMBEDDED_ID

2

EMBEDDED_ID_PATH

/1/2 (embedded in file with ID=1)

INTERNAL_PATH

xl/media/image1.png (from XLSX structure)

RESOURCE_NAME_KEY

image1.png

EMBEDDED_RESOURCE_PATH

/sales.xlsx/image1.png

Key Observations

The table above illustrates the fundamental distinction between containment tracking and container metadata:

Containment structure (Tika-generated):

  • EMBEDDED_ID_PATH /1/2 tells you that the image (ID=2) was found inside the spreadsheet (ID=1). It answers: "What contains what?"

  • EMBEDDED_RESOURCE_PATH /sales.xlsx/image1.png is synthesized from file names at each nesting level. It provides a human-readable path through the containment hierarchy.

Container metadata (from the file):

  • INTERNAL_PATH for the spreadsheet (reports/Q1/sales.xlsx) is what the ZIP file knows about where that entry was stored - its internal folder structure.

  • INTERNAL_PATH for the image (xl/media/image1.png) is what the XLSX file knows about where that media file lives - its internal OOXML part name.

Notice that INTERNAL_PATH resets at each container boundary. The image’s internal path doesn’t include reports/Q1/ because that path information belongs to the ZIP container, not the XLSX container. Each container only knows about its own internal organization.