Spooling in Apache Tika
1. Background
1.1. What is Spooling?
Spooling refers to the process of writing an input stream to a temporary file on disk. This benefits certain file formats that can be processed more efficiently with random access to the underlying bytes during detection or parsing.
1.2. Why Some Formats Benefit from Random Access
Several file formats are most efficiently processed with random access vs streaming:
-
OLE2 (Microsoft Office legacy formats): The POI library benefits from reading the file as a random-access structure to navigate the OLE2 container.
-
ZIP-based formats: Container detection benefits from reading the ZIP central directory, which is located at the end of the file. Parsing also benefits from random access.
-
Binary Property Lists (bplist): Apple’s binary plist format benefits from random access for efficient parsing.
-
PDF: While detection works via magic bytes, parsing benefits from random access for the PDF cross-reference table.
1.3. Architectural Decision: Decentralized Spooling
1.3.1. The Solution: Let Components Self-Spool
The current architecture follows a simple principle: each component that needs random access is responsible for obtaining it.
When a detector or parser needs random access, it calls:
Path path = TikaInputStream.get(inputStream).getPath();
// or
File file = TikaInputStream.get(inputStream).getFile();
TikaInputStream handles the spooling transparently based on how it was initialized:
-
Initialized with
Path: The file is used directly for random access. No spooling needed. -
Initialized with
byte[]: The bytes are kept in memory. Spooling only on demand. -
Initialized with
InputStream: WhengetPath()orgetFile()is called, the stream is dynamically buffered to memory first, then spills to a temporary file after a threshold. The temporary file is automatically cleaned up when the stream is closed.
1.3.2. Benefits of Decentralized Spooling
-
Efficiency: Spooling happens only when actually needed, not preemptively.
-
Simplicity: No central configuration of "which types need spooling."
-
Correctness: Each component knows its own requirements.
-
Flexibility: New formats can be added without modifying central spooling logic.
1.4. TikaInputStream Backing Strategies
TikaInputStream uses configurable backing strategies that handle caching and temporary
file management. This means:
-
Repeated calls to
getFile()return the same temporary file (no re-spooling). -
The
rewind()method efficiently resets the stream for re-reading. -
Memory-mapped and disk-backed strategies can be selected based on use case.
2. User Guide
2.1. Default Behavior
By default, Tika handles spooling automatically. You don’t need to configure anything for most use cases. When a detector or parser benefits from random access to a file, it will spool the input stream to a temporary file if necessary.
2.2. SpoolingStrategy for Fine-Grained Control
For advanced use cases, you can use SpoolingStrategy to control spooling behavior.
This is useful when you want to:
-
Restrict which file types are allowed to spool (e.g., for performance reasons)
-
Customize spooling behavior based on metadata or stream properties
2.2.1. Programmatic Configuration
import org.apache.tika.io.SpoolingStrategy;
import org.apache.tika.parser.ParseContext;
// Create a custom spooling strategy
SpoolingStrategy strategy = new SpoolingStrategy();
strategy.setSpoolTypes(Set.of(
MediaType.application("zip"),
MediaType.application("pdf")
));
// Add to parse context
ParseContext context = new ParseContext();
context.set(SpoolingStrategy.class, strategy);
// Parse with the custom context
parser.parse(inputStream, handler, metadata, context);
2.2.2. SpoolingStrategy Methods
// Check if spooling should occur for a given type
boolean shouldSpool(TikaInputStream tis, Metadata metadata, MediaType mediaType)
// Configure which types should be spooled
void setSpoolTypes(Set<MediaType> types)
// Set the media type registry for specialization checking
void setMediaTypeRegistry(MediaTypeRegistry registry)
2.2.3. How Type Matching Works
The shouldSpool() method returns true if:
-
The stream doesn’t already have a backing file (
tis.hasFile()is false), AND -
The media type matches one of the configured spool types
Type matching considers:
-
Exact matches (e.g.,
application/zip) -
Base type matches (e.g.,
application/zipmatchesapplication/zip; charset=utf-8) -
Specializations (e.g.,
application/vnd.oasis.opendocument.textis a specialization ofapplication/zip)
2.3. JSON Configuration
SpoolingStrategy can be configured via JSON in your tika-config.json file.
Place the configuration in the parse-context section:
{
"parse-context": {
"spooling-strategy": {
"spoolTypes": [
"application/zip",
"application/x-tika-msoffice",
"application/pdf"
]
}
}
}
Load the configuration using TikaLoader:
TikaLoader loader = TikaLoader.load(Path.of("tika-config.json"));
ParseContext context = loader.loadParseContext();
// SpoolingStrategy is automatically loaded into the ParseContext
2.4. Best Practices
-
Let Tika handle it: For most applications, the default behavior is optimal. Don’t configure spooling unless you have a specific need.
-
Use TikaInputStream with Path or byte[]: When you have a file, pass the
Pathdirectly toTikaInputStream.get(Path)rather than wrapping aFileInputStream. Similarly, passbyte[]directly rather than wrapping aByteArrayInputStream. This allows TikaInputStream to use efficient backing strategies that avoid unnecessary copying or spooling:// Good: TikaInputStream knows it has a file, can use random access directly TikaInputStream tis = TikaInputStream.get(path); // Bad: TikaInputStream sees an opaque stream, may spool unnecessarily TikaInputStream tis = TikaInputStream.get(new FileInputStream(file)); // Good: TikaInputStream knows it has bytes in memory TikaInputStream tis = TikaInputStream.get(bytes); // Bad: TikaInputStream sees an opaque stream TikaInputStream tis = TikaInputStream.get(new ByteArrayInputStream(bytes)); -
Close streams properly: Use try-with-resources to ensure temporary files are cleaned up:
try (TikaInputStream tis = TikaInputStream.get(inputStream)) { parser.parse(tis, handler, metadata, context); } -
Consider memory vs. disk tradeoffs: For very large files, spooling to disk may be needed. For small files processed in bulk, keeping data in memory may be faster.
TikaInputStreambacking strategies can be tuned for your workload.