Metadata Changes in Tika 4.x

This document details the metadata key changes in Apache Tika 4.x.

Overview

Tika 4.x prefixes all "user generated" metadata keys to prevent overwrites and improve namespace clarity. This is a security-focused change that prevents user-controlled data from potentially overwriting existing metadata values in the Metadata object.

Metadata Key Changes

Category Change Details

HTML custom metadata

Prefixed with html:

Custom metadata from HTML documents now uses the html: prefix

MAPI metadata

Prefix changed to mapi:

Microsoft MAPI properties now use the mapi: prefix

Resource name

Renamed

resourceName changed to X-TIKA:resourceName

Unrecognized image metadata

Prefixed with img:

Unrecognized image metadata keys now use the img: prefix

Office metadata

Prefix changed

Changed from meta prefix to office prefix

Migration Steps

When upgrading to Tika 4.x, you will need to update any code that references metadata keys directly:

HTML Metadata

// Before (3.x)
String value = metadata.get("custom-key");

// After (4.x)
String value = metadata.get("html:custom-key");

MAPI Metadata

// Before (3.x)
String value = metadata.get("mapi:some-property");

// After (4.x) - prefix remains mapi: but verify specific keys
String value = metadata.get("mapi:some-property");

Resource Name

// Before (3.x)
String name = metadata.get("resourceName");

// After (4.x)
String name = metadata.get("X-TIKA:resourceName");

Image Metadata

// Before (3.x)
String value = metadata.get("unknown-image-key");

// After (4.x)
String value = metadata.get("img:unknown-image-key");

Office Metadata

// Before (3.x)
String value = metadata.get("meta:some-property");

// After (4.x)
String value = metadata.get("office:some-property");

Rationale

The namespacing of metadata keys provides several benefits:

  • Security: Prevents user-controlled content from overwriting internal metadata

  • Clarity: Makes it clear which parser or source generated a metadata key

  • Consistency: Provides a uniform approach to metadata naming across all parsers