Class ZipParser

All Implemented Interfaces:
Serializable, SelfConfiguring, Parser

public class ZipParser extends AbstractArchiveParser
Parser for ZIP and JAR archives using file-based access for complete metadata extraction.

This parser handles:

  • Standard ZIP archives
  • JAR (Java Archive) files
  • Archive and entry comments
  • Unix permissions and file attributes
  • Charset detection for non-Unicode entry names
  • Encryption detection

This parser prefers file-based access (ZipFile) for complete metadata extraction, but falls back to streaming (ZipArchiveInputStream) for edge-case ZIPs that cannot be read as files (e.g., those with data descriptors that overlap the central directory).

Truncated and Corrupted Files

This parser does not perform ZIP salvaging directly. When used with AutoDetectParser, the DefaultZipContainerDetector handles salvaging of truncated/corrupted files and provides the prepared ZipFile via TikaInputStream.getOpenContainer().

Note: If you call this parser directly without going through the detector, truncated or corrupted ZIP files may fail to parse. For best results with untrusted content, use AutoDetectParser.

See Also:
  • Field Details

    • ZIP_SPECIALIZATIONS

      public static final Set<MediaType> ZIP_SPECIALIZATIONS
      Set of media types that are specializations of ZIP (e.g., Office documents, EPUB, APK). Used to avoid overwriting more specific media types with generic "application/zip".
  • Constructor Details

  • Method Details

    • getSupportedTypes

      public Set<MediaType> getSupportedTypes(ParseContext context)
      Description copied from interface: Parser
      Returns the set of media types supported by this parser when used with the given parse context.
      Parameters:
      context - parse context
      Returns:
      immutable set of media types
    • parse

      public void parse(TikaInputStream tis, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
      Description copied from interface: Parser
      Parses a document stream into a sequence of XHTML SAX events. Fills in related document metadata in the given metadata object.

      The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.

      Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.

      handler - handler for the XHTML SAX events (output)
      metadata - document metadata (input and output)
      context - parse context
      Throws:
      IOException - if the document stream could not be read
      SAXException - if the SAX events could not be processed
      TikaException - if the document could not be parsed