Setting Limits

When processing untrusted documents, it’s important to set limits on resource consumption to prevent denial-of-service attacks and protect against malicious or pathological files. Tika provides several mechanisms for limiting resource usage during parsing.

Overview

Tika 4.x provides a unified configuration system for all limits through the parse-context section of the JSON configuration file. All limits are loaded into the ParseContext and flow through the parsing pipeline.

Complete Example

Here’s a comprehensive example showing all limit configurations together. This is the same configuration tested in AllLimitsTest.java:

{
  "parsers": ["default-parser"],
  "parse-context": {
    "embedded-limits": {
      "maxDepth": 10,
      "throwOnMaxDepth": false,
      "maxCount": 1000,
      "throwOnMaxCount": false
    },
    "output-limits": {
      "writeLimit": 100000,
      "throwOnWriteLimit": false,
      "maxXmlDepth": 100,
      "maxPackageEntryDepth": 10,
      "zipBombThreshold": 1000000,
      "zipBombRatio": 100
    },
    "timeout-limits": {
      "taskTimeoutMillis": 60000
    },
    "standard-metadata-limiter-factory": {
      "maxTotalBytes": 1048576,
      "maxFieldSize": 102400,
      "maxKeySize": 1024,
      "maxValuesPerField": 100
    }
  }
}

Configuration file: tika-serialization/src/test/resources/configs/all-limits-test.json

Loading Limits

Use TikaLoader.loadParseContext() to load all configured limits into a ParseContext:

TikaLoader loader = TikaLoader.load(configPath);
ParseContext context = loader.loadParseContext();

// Limits are now available from the context
EmbeddedLimits embeddedLimits = context.get(EmbeddedLimits.class);
OutputLimits outputLimits = context.get(OutputLimits.class);
TimeoutLimits timeoutLimits = context.get(TimeoutLimits.class);

See test: tika-serialization/src/test/java/org/apache/tika/config/AllLimitsTest.java

Embedded Document Limits

The EmbeddedLimits class controls how deeply nested and how many embedded documents are processed. This is critical for protecting against "zip bomb" style attacks where documents contain deeply nested or numerous embedded files.

Configuration Options

Setting Default Description

maxDepth

-1 (unlimited)

Maximum nesting depth for embedded documents. When reached, recursion stops but siblings at the current level continue to be processed.

throwOnMaxDepth

false

Whether to throw an EmbeddedLimitReachedException when maxDepth is reached. If false, processing continues and X-TIKA:maxDepthReached=true is set in metadata.

maxCount

-1 (unlimited)

Maximum total number of embedded documents to process. When reached, processing stops immediately.

throwOnMaxCount

false

Whether to throw an EmbeddedLimitReachedException when maxCount is reached. If false, processing continues and X-TIKA:maxEmbeddedCountReached=true is set.

maxDepth Behavior

When the depth limit is reached, recursion stops but siblings at the current level continue to be processed. For example, with maxDepth=1:

container.zip (depth 0)
├── doc1.docx (depth 1) ✓ PARSED
│   ├── image1.png (depth 2) ✗ NOT PARSED (exceeds maxDepth)
│   └── embed.xlsx (depth 2) ✗ NOT PARSED (exceeds maxDepth)
├── doc2.pdf (depth 1) ✓ PARSED (sibling at same level)
└── doc3.txt (depth 1) ✓ PARSED (sibling at same level)

JSON Configuration

{
  "parse-context": {
    "embedded-limits": {
      "maxDepth": 5,
      "throwOnMaxDepth": true,
      "maxCount": 100,
      "throwOnMaxCount": false
    }
  }
}

Configuration file: tika-serialization/src/test/resources/configs/embedded-limits-test.json

Java API

// Create with constructor
EmbeddedLimits limits = new EmbeddedLimits(10, true, 500, false);

// Or use setters
EmbeddedLimits limits = new EmbeddedLimits();
limits.setMaxDepth(10);
limits.setThrowOnMaxDepth(true);
limits.setMaxCount(500);
limits.setThrowOnMaxCount(false);

// Add to ParseContext
context.set(EmbeddedLimits.class, limits);

// Helper method to get limits with defaults
EmbeddedLimits limits = EmbeddedLimits.get(context); // Returns defaults if not set

See test: tika-serialization/src/test/java/org/apache/tika/config/EmbeddedLimitsTest.java

Output Limits

The OutputLimits class controls limits on parsing output including text extraction and protection against zip bombs.

Configuration Options

Setting Default Description

writeLimit

-1 (unlimited)

Maximum characters of text to extract. When reached, extraction stops.

throwOnWriteLimit

false

Whether to throw a WriteLimitReachedException when writeLimit is reached.

maxXmlDepth

100

Maximum XML element nesting depth. Protects against XML bomb attacks.

maxPackageEntryDepth

10

Maximum depth of nested package entries (e.g., zip within zip).

zipBombThreshold

1,000,000

Minimum decompressed size (in bytes) before zip bomb detection activates.

zipBombRatio

100

Maximum ratio of decompressed to compressed size before flagging as zip bomb.

JSON Configuration

{
  "parse-context": {
    "output-limits": {
      "writeLimit": 50000,
      "throwOnWriteLimit": true,
      "maxXmlDepth": 50,
      "maxPackageEntryDepth": 5,
      "zipBombThreshold": 500000,
      "zipBombRatio": 50
    }
  }
}

Configuration file: tika-serialization/src/test/resources/configs/output-limits-test.json

Java API

OutputLimits limits = new OutputLimits(50000, true, 50, 5, 500000, 50);
context.set(OutputLimits.class, limits);

// Helper method
OutputLimits limits = OutputLimits.get(context);

See test: tika-serialization/src/test/java/org/apache/tika/config/OutputLimitsTest.java

Timeout Limits

The TimeoutLimits class controls time-based limits for parsing operations.

Configuration Options

Setting Default Description

taskTimeoutMillis

60000 (1 minute)

Maximum time in milliseconds for a parse operation to complete.

JSON Configuration

{
  "parse-context": {
    "timeout-limits": {
      "taskTimeoutMillis": 120000
    }
  }
}

Configuration file: tika-serialization/src/test/resources/configs/timeout-limits-test.json

Java API

TimeoutLimits limits = new TimeoutLimits(120000);
context.set(TimeoutLimits.class, limits);

// Helper method
TimeoutLimits limits = TimeoutLimits.get(context);

See test: tika-serialization/src/test/java/org/apache/tika/config/TimeoutLimitsTest.java

Embedded Byte Extraction Limits

When extracting embedded document bytes using ParseMode.UNPACK, the UnpackConfig class provides safety limits on total bytes extracted. This protects against zip bombs and other malicious files that may expand to enormous sizes when unpacked.

Configuration Options

Setting Default Description

maxUnpackBytes

10 GB

Maximum total bytes to extract from all embedded documents per file. Set to -1 for unlimited (not recommended for untrusted input).

Behavior

When the byte limit is reached:

  • Extraction stops for remaining embedded documents

  • An exception is logged but processing continues

  • Already-extracted bytes are kept

  • The parse result status is PARSE_SUCCESS_WITH_EXCEPTION

JSON Configuration

{
  "parseContext": {
    "parseMode": "UNPACK",
    "unpack-config": {
      "maxUnpackBytes": 104857600
    }
  }
}

This limits extraction to 100 MB total.

Java API

UnpackConfig config = new UnpackConfig();
config.setMaxUnpackBytes(100 * 1024 * 1024); // 100 MB
config.setEmitter("my-emitter");
parseContext.set(UnpackConfig.class, config);
parseContext.set(ParseMode.class, ParseMode.UNPACK);

For more details on embedded byte extraction configuration, see Extracting Embedded Bytes.

See tests: tika-pipes/tika-pipes-integration-tests/src/test/java/org/apache/tika/pipes/core/UnpackModeTest.java

Metadata Limits

The MetadataWriteLimiter system allows you to constrain metadata size at write time, ensuring parsers cannot exceed your configured limits.

How It Works

When you configure a MetadataWriteLimiterFactory in the ParseContext, calling Metadata.newInstance(parseContext) creates a Metadata object with limits already applied. All subsequent writes to that metadata object are filtered through the limiter.

// Configure the factory
StandardMetadataLimiterFactory factory = new StandardMetadataLimiterFactory();
factory.setMaxTotalBytes(1024 * 1024);  // 1 MB total
factory.setMaxFieldSize(100 * 1024);     // 100 KB per field
factory.setMaxValuesPerField(100);       // Max 100 values per multi-valued field

// Add to ParseContext
ParseContext context = new ParseContext();
context.set(MetadataWriteLimiterFactory.class, factory);

// Create limited metadata - limits are enforced from the start
Metadata metadata = Metadata.newInstance(context);

Configuration Options

Setting Default Description

maxTotalBytes

10 MB

Maximum total estimated size of all metadata in UTF-16 bytes. When exceeded, additional metadata is silently dropped and X-TIKA:WARN:truncated_metadata is set.

maxFieldSize

100 KB

Maximum size of any single field’s value(s) in UTF-16 bytes. Values exceeding this limit are truncated.

maxKeySize

1024

Maximum length of metadata key names in UTF-16 bytes. Keys exceeding this limit are truncated.

maxValuesPerField

10

Maximum number of values for multi-valued fields. Additional values are dropped.

includeFields

empty (all)

If non-empty, only these fields are stored (plus system fields like Content-Type). Use this to extract only the metadata you need.

excludeFields

empty (none)

These fields are never stored, regardless of other settings.

includeEmpty

false

Whether to store empty or null values.

JSON Configuration

{
  "parsers": ["default-parser"],
  "parse-context": {
    "standard-metadata-limiter-factory": {
      "maxTotalBytes": 1048576,
      "maxFieldSize": 102400,
      "maxKeySize": 1024,
      "maxValuesPerField": 100,
      "includeFields": ["dc:title", "dc:creator", "dc:subject"],
      "excludeFields": ["pdf:unmappedUnicodeCharsPerPage"]
    }
  }
}

Always-Included Fields

Certain fields are critical for Tika’s operation and are always allowed, regardless of includeFields or size limits:

  • Content-Type - Required for parser selection

  • Content-Length, Content-Encoding, Content-Disposition

  • X-TIKA:content - The extracted text content

  • X-TIKA:Parsed-By - Parser chain information

  • X-TIKA:WARN:* - Warning metadata

  • Access permission fields

Detecting Truncation

When metadata is truncated due to limits, Tika sets the metadata field X-TIKA:WARN:truncated_metadata to true. You can check for this in your code:

if ("true".equals(metadata.get(TikaCoreProperties.TRUNCATED_METADATA))) {
    // Some metadata was dropped or truncated
    log.warn("Metadata was truncated for: " + resourceName);
}

Recommendations

  1. Always set limits when processing untrusted content

  2. Use includeFields to capture only the metadata you need

  3. Monitor for truncation by checking X-TIKA:WARN:truncated_metadata

  4. Combine with process isolation - limits protect against memory issues, but process isolation protects against crashes

  5. Test with adversarial files - use Tika’s MockParser to simulate extreme cases

See Also