Spooling in Apache Tika

1. Background

1.1. What is Spooling?

Spooling refers to the process of writing an input stream to a temporary file on disk. This benefits certain file formats that can be processed more efficiently with random access to the underlying bytes during detection or parsing.

1.2. Why Some Formats Benefit from Random Access

Several file formats are most efficiently processed with random access vs streaming:

  • OLE2 (Microsoft Office legacy formats): The POI library benefits from reading the file as a random-access structure to navigate the OLE2 container.

  • ZIP-based formats: Container detection benefits from reading the ZIP central directory, which is located at the end of the file. Parsing also benefits from random access.

  • Binary Property Lists (bplist): Apple’s binary plist format benefits from random access for efficient parsing.

  • PDF: While detection works via magic bytes, parsing benefits from random access for the PDF cross-reference table.

1.3. Architectural Decision: Decentralized Spooling

1.3.1. The Solution: Let Components Self-Spool

The current architecture follows a simple principle: each component that needs random access is responsible for obtaining it.

When a detector or parser needs random access, it calls:

Path path = TikaInputStream.get(inputStream).getPath();
// or
File file = TikaInputStream.get(inputStream).getFile();

TikaInputStream handles the spooling transparently based on how it was initialized:

  • Initialized with Path: The file is used directly for random access. No spooling needed.

  • Initialized with byte[]: The bytes are kept in memory. Spooling only on demand.

  • Initialized with InputStream: When getPath() or getFile() is called, the stream is dynamically buffered to memory first, then spills to a temporary file after a threshold. The temporary file is automatically cleaned up when the stream is closed.

1.3.2. Benefits of Decentralized Spooling

  1. Efficiency: Spooling happens only when actually needed, not preemptively.

  2. Simplicity: No central configuration of "which types need spooling."

  3. Correctness: Each component knows its own requirements.

  4. Flexibility: New formats can be added without modifying central spooling logic.

1.4. TikaInputStream Backing Strategies

TikaInputStream uses configurable backing strategies that handle caching and temporary file management. This means:

  • Repeated calls to getFile() return the same temporary file (no re-spooling).

  • The rewind() method efficiently resets the stream for re-reading.

  • Memory-mapped and disk-backed strategies can be selected based on use case.

2. User Guide

2.1. Default Behavior

By default, Tika handles spooling automatically. You don’t need to configure anything for most use cases. When a detector or parser benefits from random access to a file, it will spool the input stream to a temporary file if necessary.

2.2. SpoolingStrategy for Fine-Grained Control

For advanced use cases, you can use SpoolingStrategy to control spooling behavior. This is useful when you want to:

  • Restrict which file types are allowed to spool (e.g., for performance reasons)

  • Customize spooling behavior based on metadata or stream properties

2.2.1. Programmatic Configuration

import org.apache.tika.io.SpoolingStrategy;
import org.apache.tika.parser.ParseContext;

// Create a custom spooling strategy
SpoolingStrategy strategy = new SpoolingStrategy();
strategy.setSpoolTypes(Set.of(
    MediaType.application("zip"),
    MediaType.application("pdf")
));

// Add to parse context
ParseContext context = new ParseContext();
context.set(SpoolingStrategy.class, strategy);

// Parse with the custom context
parser.parse(inputStream, handler, metadata, context);

2.2.2. SpoolingStrategy Methods

// Check if spooling should occur for a given type
boolean shouldSpool(TikaInputStream tis, Metadata metadata, MediaType mediaType)

// Configure which types should be spooled
void setSpoolTypes(Set<MediaType> types)

// Set the media type registry for specialization checking
void setMediaTypeRegistry(MediaTypeRegistry registry)

2.2.3. How Type Matching Works

The shouldSpool() method returns true if:

  1. The stream doesn’t already have a backing file (tis.hasFile() is false), AND

  2. The media type matches one of the configured spool types

Type matching considers:

  • Exact matches (e.g., application/zip)

  • Base type matches (e.g., application/zip matches application/zip; charset=utf-8)

  • Specializations (e.g., application/vnd.oasis.opendocument.text is a specialization of application/zip)

2.2.4. Default Spool Types

The default spool types are:

  • application/zip - ZIP archives and ZIP-based formats (OOXML, ODF, EPUB, etc.)

  • application/x-tika-msoffice - OLE2 Microsoft Office formats

  • application/x-bplist - Apple binary property lists

  • application/pdf - PDF documents

2.3. JSON Configuration

SpoolingStrategy can be configured via JSON in your tika-config.json file. Place the configuration in the parse-context section:

{
  "parse-context": {
    "spooling-strategy": {
      "spoolTypes": [
        "application/zip",
        "application/x-tika-msoffice",
        "application/pdf"
      ]
    }
  }
}

Load the configuration using TikaLoader:

TikaLoader loader = TikaLoader.load(Path.of("tika-config.json"));
ParseContext context = loader.loadParseContext();
// SpoolingStrategy is automatically loaded into the ParseContext

2.4. Best Practices

  1. Let Tika handle it: For most applications, the default behavior is optimal. Don’t configure spooling unless you have a specific need.

  2. Use TikaInputStream with Path or byte[]: When you have a file, pass the Path directly to TikaInputStream.get(Path) rather than wrapping a FileInputStream. Similarly, pass byte[] directly rather than wrapping a ByteArrayInputStream. This allows TikaInputStream to use efficient backing strategies that avoid unnecessary copying or spooling:

    // Good: TikaInputStream knows it has a file, can use random access directly
    TikaInputStream tis = TikaInputStream.get(path);
    
    // Bad: TikaInputStream sees an opaque stream, may spool unnecessarily
    TikaInputStream tis = TikaInputStream.get(new FileInputStream(file));
    
    // Good: TikaInputStream knows it has bytes in memory
    TikaInputStream tis = TikaInputStream.get(bytes);
    
    // Bad: TikaInputStream sees an opaque stream
    TikaInputStream tis = TikaInputStream.get(new ByteArrayInputStream(bytes));
  3. Close streams properly: Use try-with-resources to ensure temporary files are cleaned up:

    try (TikaInputStream tis = TikaInputStream.get(inputStream)) {
        parser.parse(tis, handler, metadata, context);
    }
  4. Consider memory vs. disk tradeoffs: For very large files, spooling to disk may be needed. For small files processed in bulk, keeping data in memory may be faster. TikaInputStream backing strategies can be tuned for your workload.