Class DefaultZipContainerDetector

java.lang.Object
org.apache.tika.detect.zip.DefaultZipContainerDetector
All Implemented Interfaces:
Serializable, SelfConfiguring, Detector
Direct Known Subclasses:
StreamingZipContainerDetector

public class DefaultZipContainerDetector extends Object implements Detector
This class is designed to detect subtypes of zip-based file formats. For the sake of efficiency, it also detects archive and compressor formats via commons-compress.

As a first step, it uses commons-compress to detect any archive format supported by commons-compress. If "zip" file is detected, then the ZipContainerDetectors are run to try to identify a subtype.

If an archive format that is not a zip is detected, that mime type is returned.

Finally, if the file is not detected as an archive format, this runs commons-compress' compressor format detector.

For TikaInputStream, file-based detection is used (TikaInputStream handles spilling to disk automatically if needed).

ZIP Salvaging

When a ZIP file cannot be opened directly (truncated or corrupted), and ParsingIntent.WILL_PARSE is present in the ParseContext, this detector will attempt to salvage the file using ZipSalvager. Salvaging reconstructs a valid ZIP structure from the local file headers.

When salvaging succeeds, Zip.SALVAGED is set to true in the metadata, and the salvaged ZipFile is stored in TikaInputStream.getOpenContainer() for reuse by parsers.

Note: If you use parsers directly without this detector (or without AutoDetectParser), salvaging will not occur and truncated files may fail to parse.

See Also:
  • Field Details

  • Constructor Details

    • DefaultZipContainerDetector

      public DefaultZipContainerDetector()
    • DefaultZipContainerDetector

      public DefaultZipContainerDetector(ServiceLoader loader)
    • DefaultZipContainerDetector

      public DefaultZipContainerDetector(List<ZipContainerDetector> zipDetectors)
  • Method Details

    • detect

      public MediaType detect(TikaInputStream tis, Metadata metadata, ParseContext parseContext) throws IOException
      Description copied from interface: Detector
      Detects the content type of the given input document. Returns application/octet-stream if the type of the document can not be detected.

      If the document input stream is not available, then the first argument may be null. Otherwise the detector may read bytes from the start of the stream to help in type detection. The detector is expected to mark the stream before reading any bytes from it, and to reset the stream before returning. The stream must not be closed by the detector.

      The given input metadata is only read, not modified, by the detector.

      Specified by:
      detect in interface Detector
      Parameters:
      tis - document input stream, or null
      metadata - input metadata for the document
      parseContext - the parse context
      Returns:
      detected media type, or application/octet-stream
      Throws:
      IOException - if the document input stream could not be read