Class DefaultZipContainerDetector
- All Implemented Interfaces:
Serializable,SelfConfiguring,Detector
- Direct Known Subclasses:
StreamingZipContainerDetector
As a first step, it uses commons-compress to detect any archive format supported by commons-compress. If "zip" file is detected, then the ZipContainerDetectors are run to try to identify a subtype.
If an archive format that is not a zip is detected, that mime type is returned.
Finally, if the file is not detected as an archive format, this runs commons-compress' compressor format detector.
For TikaInputStream, file-based detection is used (TikaInputStream
handles spilling to disk automatically if needed).
ZIP Salvaging
When a ZIP file cannot be opened directly (truncated or corrupted), and
ParsingIntent.WILL_PARSE is present in the ParseContext,
this detector will attempt to salvage the file using ZipSalvager.
Salvaging reconstructs a valid ZIP structure from the local file headers.
When salvaging succeeds, Zip.SALVAGED is set to true in the
metadata, and the salvaged ZipFile is stored in
TikaInputStream.getOpenContainer() for reuse by parsers.
Note: If you use parsers directly without this detector (or without
AutoDetectParser), salvaging will not occur
and truncated files may fail to parse.
- See Also:
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptiondetect(TikaInputStream tis, Metadata metadata, ParseContext parseContext) Detects the content type of the given input document.
-
Field Details
-
staticZipDetectors
-
-
Constructor Details
-
DefaultZipContainerDetector
public DefaultZipContainerDetector() -
DefaultZipContainerDetector
-
DefaultZipContainerDetector
-
-
Method Details
-
detect
public MediaType detect(TikaInputStream tis, Metadata metadata, ParseContext parseContext) throws IOException Description copied from interface:DetectorDetects the content type of the given input document. Returnsapplication/octet-streamif the type of the document can not be detected.If the document input stream is not available, then the first argument may be
null. Otherwise the detector may read bytes from the start of the stream to help in type detection. The detector is expected to mark the stream before reading any bytes from it, and to reset the stream before returning. The stream must not be closed by the detector.The given input metadata is only read, not modified, by the detector.
- Specified by:
detectin interfaceDetector- Parameters:
tis- document input stream, ornullmetadata- input metadata for the documentparseContext- the parse context- Returns:
- detected media type, or
application/octet-stream - Throws:
IOException- if the document input stream could not be read
-