Apache Tika 2.5.0

The most notable changes in Tika 2.5.0 over the previous release are:

  • Improved extraction of PDF subset info for PDF/UA, PDF/VT, and PDF/X. NOTE: we no longer append PDF/A information, e.g. 'version="A-1b"'to the 'dc:format'. Users must now get that information from the'pdfa:PDFVersion' key or from 'pdfaid:conformance' and 'pdfaid:part' (TIKA-3844).
  • Avoid infinite loop in bookmark extraction from PDFs (TIKA-3832).
  • Update to slf4j 2.0.1 (TIKA-3842).
  • Added upsert option for the OpenSearch emitter (TIKA-3855).
  • Extract PDF signature information at the document level into the metadata (TIKA-3852).
  • Enable configuration of digests via AutoDetectParserConfig (TIKA-3853).
  • Use commons-io byte array streams via PJ Fanning (TIKA-3843).
  • Upgrade to PDFBox 2.0.27 (TIKA-3866).
  • Upgrade to JempBox 1.8.17 (TIKA-3856).
  • Add extraction of ODF version from ODF files (TIKA-3840).
  • tika-parser-html-commons (BoilerPipeHandler) is no longer aa dependency of tika-parser-html-module. tika-app and tika-server-standard have added a dependency on tika-parser-html-commons. However, users who are managing custom dependencies and who want the BoilerPipeHandler will have to now include the tika-parser-html-commons dependency(TIKA-1484).
  • Add unrar as an optional parser (TIKA-3800).
  • Refactor FuzzingCLI to use PipesParser (TIKA-3799).
  • ServiceLoader's loadServiceProviders() now guaranteesunique classes (TIKA-3797).
  • Fix bug that prevented setting of includeHeadersAndFooters for xls, xlsx, doc and docx via tika-config (TIKA-3796).
  • Fix bug that prevented specification of rendered image type via http header in the PDFParser (TIKA-3794).
  • Fix bug causing some Exif dates to be decoded wrongly on timezones different than UTC (TIKA-3815).
  • Numerous dependency upgrades (TIKA-3795).

The following people have contributed to Tika 2.4.1 by submitting or commenting on the issues resolved in this release:

  • Aurélien Marocco
  • Ben Gilbert
  • Eduardas Kazakas
  • Eugen Caruntu
  • Giorgiana Ciobanu
  • Lakatos Gyula
  • Luís Filipe Nassif
  • Nicholas DiPiazza
  • PJ Fanning
  • Robin Schimpf
  • Tilman Hausherr
  • Tim Allison
  • Yurii

See https://s.apache.org/j2sms for more details on these contributions.