Apache Tika 2.4.0

The most notable changes in Tika 2.4.0 over the previous release are:

  • NOTE: To save on resources, we no longer include the deeplearning4j dependencies in the tika-dl jar. The dependencies for the tika-dl package must be provided by users. See:https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-dl/pom.xml for the dependencies that must be provided at run-time (TIKA-3676).
  • NOTE: Added prefix "dwg-custom:" to DWG custom metadata properties (TIKA-3731).
  • Add initial, BETA-grade TLS encryption option for tika-server; configuration may change in future releases (TIKA-3719).
  • Allow specification of fetcherName and fetchKey via query parameters in request URI in tika-server (TIKA-3714).
  • Add basic parsers for WARC and WACZ in tika-parsers-standard (TIKA-3697).
  • Add MetadataWriteFilter capability to improve memory profile in Metadata objects (TIKA-3695).
  • Allow configurability of the ContentHandlerDecorator used by the AutoDetectParser (TIKA-3723).
  • Allow configurability of the EmbeddedDocumentExtractor used by the AutoDetectParser (TIKA-3711).
  • Add detection for Frictionless Data packages and WACZ (TIKA-3696).
  • Add detection for DGN files with gratitude and credit to Steven Frew's tika-dgn-detector (TIKA-3721).
  • Add parser for metadata from DGN 8 files via Dan Coldrick (TIKA-3721).
  • Add a fetcher and emitter for Azure blob storage (TIKA-3707).
  • Add detection for files encrypted by Microsoft's Rights Management Service(TIKA-3666).
  • Fixed regression in 2.3.0 that led to more embedded filenames than appropriate being written to the content (TIKA-3711).
  • tika-server now clones forking process' environment variables into forked process (TIKA-3715).
  • Add an optional /eval endpoint for tika-eval profile or compare capabilities in tika-server (TIKA-3689).
  • Add a Parsed-By-Full-Set metadata item to record all parsers that processed a file (TIKA-3716).
  • Add metadata filters for Optimaize and OpenNLP language detectors (TIKA-3717).
  • Upgrade to PDFBox 2.0.26 (TIKA-3726).
  • Upgrade deeplearning4j to 1.0.0-M2 (TIKA-3458 and PR#527).
  • Various dependency upgrades, including POI, dl4j, gson, jackson, twelvemonkeys, log4j2 and others (TIKA-3675 and many PRs from dependabot).

The following people have contributed to Tika 2.4.0 by submitting or commenting on the issues resolved in this release:

  • August Valera
  • beamliu
  • Dan Coldrick
  • Julien Massiera
  • Lewis John McGibbney
  • Nick Burch
  • PJ Fanning
  • Sam Stephens
  • Thierry Guérin
  • Tim Allison
  • Zac Jacobson

See https://s.apache.org/59u4j for more details on these contributions.