ZIP Detection and Salvaging
Apache Tika uses the DefaultZipContainerDetector to detect ZIP-based file formats including
plain ZIP archives, OOXML documents (docx, xlsx, pptx), ODF documents (odt, ods, odp),
EPUB, JAR files, and many others.
ZIP Salvaging
When a ZIP file is truncated or corrupted in a way that prevents it from being opened normally, Tika can attempt to "salvage" the file by:
-
Streaming through the local file headers (at the beginning of the ZIP)
-
Reconstructing a valid ZIP structure with a proper central directory
This allows Tika to extract content from partially downloaded files, truncated archives, or files with damaged central directories.
When Salvaging Occurs
Salvaging only happens when:
-
The
DefaultZipContainerDetector(or a detector that includes it) is used -
The detector is called with
ParsingIntent.WILL_PARSEin theParseContext(set automatically byAutoDetectParser) -
The ZIP file cannot be opened directly via
ZipFile
Detecting Salvaged Files
When a file is salvaged, Tika sets the following metadata property:
zip:salvaged = true
You can check for this in your code:
import org.apache.tika.metadata.Zip;
boolean wasSalvaged = metadata.getBoolean(Zip.SALVAGED);
Direct Parser Usage
If you call parsers directly without going through AutoDetectParser, or if you
use a custom detector that doesn’t include DefaultZipContainerDetector, the salvaging
behavior will not apply. In these cases:
|
-
Truncated or corrupted ZIP files may fail to parse
-
You may get different results compared to using
AutoDetectParser -
The parser will attempt to open the file directly, which may fail
For consistent behavior with truncated files, use AutoDetectParser:
// Recommended: Uses detection with salvaging
AutoDetectParser parser = new AutoDetectParser();
parser.parse(inputStream, handler, metadata, context);
Rather than calling parsers directly:
// Direct parser: No salvaging support
ZipParser parser = new ZipParser();
parser.parse(inputStream, handler, metadata, context); // May fail on truncated files
Other Detection Hints
The detector also sets additional metadata hints for parsers:
zip:detectorZipFileOpened-
trueif the detector successfully opened the ZIP as aZipFile. TheZipFileobject is available viaTikaInputStream.getOpenContainer()for reuse by parsers. zip:detectorDataDescriptorRequired-
trueif streaming detection required DATA_DESCRIPTOR support. This hint helps parsers choose the correct streaming mode.
Related Topics
-
TikaInputStream and Spooling - How Tika handles file access
-
Robustness - Handling failures when parsing untrusted content