Apache Tika 2.3.0

The most notable changes in Tika 2.3.0 over the previous release are:

  • Upgrade to Apache POI 5.2.0. This is the first upgrade to POI 5.x and represents a major refactoring. Users will experience significantly more logging from the POI parsers (TIKA-3164).
  • Upgrade to log4j2 2.17.1 (TIKA-3638).
  • Improve consistency in reporting package-entry divs acrossall parsers for embedded files (TIKA-3644). This leads to some more text (embedded file names) in files with many embedded attachments.
  • Improve configuration of maps as params for parsers in TikaConfig (TIKA-3645).
  • Improve identification of iWorks 13 files and add parsing for thumbnails, some metadata and attachments (TIKA-3634). Skip handling of .iwa files, which are not yet supported.
  • Limit the default in-memory processing (maxMainMemoryBytes) in the PDFParser to 512MB as in the 1.x branch (TIKA-3642).
  • Added IDML Parser from 1.x series to 2.x series (TIKA-3188).
  • Extract annotation types and subtypes for PDFs into metadata (TIKA-3653).
  • Add metadata value for PDFs that contain 3D annotations (TIKA-3653).
  • Add parser for Translation Memory eXchange (TMX) files (TIKA-3660).
  • Add Bill of Materials (Maven BOM) for centralized module version management (TIKA-3667).

The following people have contributed to Tika 2.3.0 by submitting or commenting on the issues resolved in this release:

  • Bernhard Geisberger
  • Carina Antunes
  • Aman Mishra
  • Aravinth
  • Dave Meikle
  • Dmitrii Kriukov
  • Josh Burchard
  • Kaka Lee
  • Lewis John McGibbney
  • Sergen Bağ
  • Subhajit Das
  • Tim Allison

See https://s.apache.org/syxl5 for more details on these contributions.