Setting Limits
When processing untrusted documents, it’s important to set limits on resource consumption to prevent denial-of-service attacks and protect against malicious or pathological files. Tika provides several mechanisms for limiting resource usage during parsing.
Overview
Tika 4.x provides a unified configuration system for all limits through the parse-context
section of the JSON configuration file. All limits are loaded into the ParseContext and
flow through the parsing pipeline.
Complete Example
Here’s a comprehensive example showing all limit configurations together.
This is the same configuration tested in AllLimitsTest.java:
{
"parsers": ["default-parser"],
"parse-context": {
"embedded-limits": {
"maxDepth": 10,
"throwOnMaxDepth": false,
"maxCount": 1000,
"throwOnMaxCount": false
},
"output-limits": {
"writeLimit": 100000,
"throwOnWriteLimit": false,
"maxXmlDepth": 100,
"maxPackageEntryDepth": 10,
"zipBombThreshold": 1000000,
"zipBombRatio": 100
},
"timeout-limits": {
"taskTimeoutMillis": 60000
},
"standard-metadata-limiter-factory": {
"maxTotalBytes": 1048576,
"maxFieldSize": 102400,
"maxKeySize": 1024,
"maxValuesPerField": 100
}
}
}
Configuration file: tika-serialization/src/test/resources/configs/all-limits-test.json
Loading Limits
Use TikaLoader.loadParseContext() to load all configured limits into a ParseContext:
TikaLoader loader = TikaLoader.load(configPath);
ParseContext context = loader.loadParseContext();
// Limits are now available from the context
EmbeddedLimits embeddedLimits = context.get(EmbeddedLimits.class);
OutputLimits outputLimits = context.get(OutputLimits.class);
TimeoutLimits timeoutLimits = context.get(TimeoutLimits.class);
See test: tika-serialization/src/test/java/org/apache/tika/config/AllLimitsTest.java
Embedded Document Limits
The EmbeddedLimits class controls how deeply nested and how many embedded documents
are processed. This is critical for protecting against "zip bomb" style attacks where
documents contain deeply nested or numerous embedded files.
Configuration Options
| Setting | Default | Description |
|---|---|---|
|
-1 (unlimited) |
Maximum nesting depth for embedded documents. When reached, recursion stops but siblings at the current level continue to be processed. |
|
false |
Whether to throw an |
|
-1 (unlimited) |
Maximum total number of embedded documents to process. When reached, processing stops immediately. |
|
false |
Whether to throw an |
maxDepth Behavior
When the depth limit is reached, recursion stops but siblings at the current level
continue to be processed. For example, with maxDepth=1:
container.zip (depth 0)
├── doc1.docx (depth 1) ✓ PARSED
│ ├── image1.png (depth 2) ✗ NOT PARSED (exceeds maxDepth)
│ └── embed.xlsx (depth 2) ✗ NOT PARSED (exceeds maxDepth)
├── doc2.pdf (depth 1) ✓ PARSED (sibling at same level)
└── doc3.txt (depth 1) ✓ PARSED (sibling at same level)
JSON Configuration
{
"parse-context": {
"embedded-limits": {
"maxDepth": 5,
"throwOnMaxDepth": true,
"maxCount": 100,
"throwOnMaxCount": false
}
}
}
Configuration file: tika-serialization/src/test/resources/configs/embedded-limits-test.json
Java API
// Create with constructor
EmbeddedLimits limits = new EmbeddedLimits(10, true, 500, false);
// Or use setters
EmbeddedLimits limits = new EmbeddedLimits();
limits.setMaxDepth(10);
limits.setThrowOnMaxDepth(true);
limits.setMaxCount(500);
limits.setThrowOnMaxCount(false);
// Add to ParseContext
context.set(EmbeddedLimits.class, limits);
// Helper method to get limits with defaults
EmbeddedLimits limits = EmbeddedLimits.get(context); // Returns defaults if not set
See test: tika-serialization/src/test/java/org/apache/tika/config/EmbeddedLimitsTest.java
Output Limits
The OutputLimits class controls limits on parsing output including text extraction
and protection against zip bombs.
Configuration Options
| Setting | Default | Description |
|---|---|---|
|
-1 (unlimited) |
Maximum characters of text to extract. When reached, extraction stops. |
|
false |
Whether to throw a |
|
100 |
Maximum XML element nesting depth. Protects against XML bomb attacks. |
|
10 |
Maximum depth of nested package entries (e.g., zip within zip). |
|
1,000,000 |
Minimum decompressed size (in bytes) before zip bomb detection activates. |
|
100 |
Maximum ratio of decompressed to compressed size before flagging as zip bomb. |
Timeout Limits
The TimeoutLimits class controls time-based limits for parsing operations.
Configuration Options
| Setting | Default | Description |
|---|---|---|
|
60000 (1 minute) |
Maximum time in milliseconds for a parse operation to complete. |
Embedded Byte Extraction Limits
When extracting embedded document bytes using ParseMode.UNPACK, the UnpackConfig class
provides safety limits on total bytes extracted. This protects against zip bombs and other
malicious files that may expand to enormous sizes when unpacked.
Configuration Options
| Setting | Default | Description |
|---|---|---|
|
10 GB |
Maximum total bytes to extract from all embedded documents per file. Set to -1 for unlimited (not recommended for untrusted input). |
Behavior
When the byte limit is reached:
-
Extraction stops for remaining embedded documents
-
An exception is logged but processing continues
-
Already-extracted bytes are kept
-
The parse result status is
PARSE_SUCCESS_WITH_EXCEPTION
JSON Configuration
{
"parseContext": {
"parseMode": "UNPACK",
"unpack-config": {
"maxUnpackBytes": 104857600
}
}
}
This limits extraction to 100 MB total.
Java API
UnpackConfig config = new UnpackConfig();
config.setMaxUnpackBytes(100 * 1024 * 1024); // 100 MB
config.setEmitter("my-emitter");
parseContext.set(UnpackConfig.class, config);
parseContext.set(ParseMode.class, ParseMode.UNPACK);
For more details on embedded byte extraction configuration, see Extracting Embedded Bytes.
See tests: tika-pipes/tika-pipes-integration-tests/src/test/java/org/apache/tika/pipes/core/UnpackModeTest.java
Metadata Limits
The MetadataWriteLimiter system allows you to constrain metadata size at write time,
ensuring parsers cannot exceed your configured limits.
How It Works
When you configure a MetadataWriteLimiterFactory in the ParseContext, calling
Metadata.newInstance(parseContext) creates a Metadata object with limits already applied.
All subsequent writes to that metadata object are filtered through the limiter.
// Configure the factory
StandardMetadataLimiterFactory factory = new StandardMetadataLimiterFactory();
factory.setMaxTotalBytes(1024 * 1024); // 1 MB total
factory.setMaxFieldSize(100 * 1024); // 100 KB per field
factory.setMaxValuesPerField(100); // Max 100 values per multi-valued field
// Add to ParseContext
ParseContext context = new ParseContext();
context.set(MetadataWriteLimiterFactory.class, factory);
// Create limited metadata - limits are enforced from the start
Metadata metadata = Metadata.newInstance(context);
Configuration Options
| Setting | Default | Description |
|---|---|---|
|
10 MB |
Maximum total estimated size of all metadata in UTF-16 bytes. When exceeded,
additional metadata is silently dropped and |
|
100 KB |
Maximum size of any single field’s value(s) in UTF-16 bytes. Values exceeding this limit are truncated. |
|
1024 |
Maximum length of metadata key names in UTF-16 bytes. Keys exceeding this limit are truncated. |
|
10 |
Maximum number of values for multi-valued fields. Additional values are dropped. |
|
empty (all) |
If non-empty, only these fields are stored (plus system fields like Content-Type). Use this to extract only the metadata you need. |
|
empty (none) |
These fields are never stored, regardless of other settings. |
|
false |
Whether to store empty or null values. |
JSON Configuration
{
"parsers": ["default-parser"],
"parse-context": {
"standard-metadata-limiter-factory": {
"maxTotalBytes": 1048576,
"maxFieldSize": 102400,
"maxKeySize": 1024,
"maxValuesPerField": 100,
"includeFields": ["dc:title", "dc:creator", "dc:subject"],
"excludeFields": ["pdf:unmappedUnicodeCharsPerPage"]
}
}
}
Always-Included Fields
Certain fields are critical for Tika’s operation and are always allowed, regardless
of includeFields or size limits:
-
Content-Type- Required for parser selection -
Content-Length,Content-Encoding,Content-Disposition -
X-TIKA:content- The extracted text content -
X-TIKA:Parsed-By- Parser chain information -
X-TIKA:WARN:*- Warning metadata -
Access permission fields
Detecting Truncation
When metadata is truncated due to limits, Tika sets the metadata field
X-TIKA:WARN:truncated_metadata to true. You can check for this in your code:
if ("true".equals(metadata.get(TikaCoreProperties.TRUNCATED_METADATA))) {
// Some metadata was dropped or truncated
log.warn("Metadata was truncated for: " + resourceName);
}
Recommendations
-
Always set limits when processing untrusted content
-
Use
includeFieldsto capture only the metadata you need -
Monitor for truncation by checking
X-TIKA:WARN:truncated_metadata -
Combine with process isolation - limits protect against memory issues, but process isolation protects against crashes
-
Test with adversarial files - use Tika’s
MockParserto simulate extreme cases
See Also
-
Robustness - Process isolation and fault tolerance
-
Configuration - General Tika configuration
-
Extracting Embedded Bytes - UnpackConfig for byte extraction