Apache Tika 2.2.0

The most notable changes in Tika 2.2.0 over the previous release are:

  • Add support for OneNote files downloaded from O365 (TIKA-3446).
  • Fix logic bug in PipesServer that prevented concatenation of content from attachments (TIKA-3609).
  • Improve extraction of embedded files from MSOffice files created by non-Microsoft tools (TIKA-3526).
  • Added back ability to ignore load errors in TikaConfig (TIKA-3575).
  • Make SecureContentHandler and other parameters configurable in AutoDetectParser programmatically and via tika-config.xml (TIKA-3594).
  • Fix default logging in tika-app in batch mode (TIKA-3589).
  • Fix bug that prevented specifying a config with the long --config= option in tika-app in batch mode (TIKA-3589).
  • Fix thread starvation after numerous restarts in PipesClient (TIKA-3588).
  • Fix race condition when starting multiple forked servers on multiple ports (TIKA-3586).
  • Add timeout per task to be configured via headers for tika-server's legacy endpoints /tika and /rmeta. Note that this timeout greater than taskTimeoutMillis (TIKA-3582).
  • Add metadata item for whether or not a PDF has a collection/is a Portfolio PDF (TIKA-3579).
  • Add detection of ESRI Layer files (TIKA-3570).
  • Add detection of JPEG XL, MARC, ICC profiles, NES-ROM file types(TIKA-3562 and TIKA-3563)
  • Remove duplicate "subject" metadata keys that were intended for backwards compatibility with 1.x only (TIKA-3564).
  • Fix Open Office mime types to be subclasses of application/zipand no longer require OPCPackageDetector-last ordering of zipdetectors (TIKA-3556).
  • Improve robustness and features of the httpfetcher (TIKA-3543)
  • Add optional fetch ranges to FetchEmitTuple to allow range fetching from,e.g. http or s3 (TIKA-3542).

The following people have contributed to Tika 2.2.0 by submitting or commenting on the issues resolved in this release:

  • Abha
  • Andreas Hubold
  • August Valera
  • César Soto Valero
  • dataminer.accolade
  • David Brosius
  • Laura Delmaestro
  • Lewis John McGibbney
  • Luís Filipe Nassif
  • Robin Schimpf
  • Sebastian Nagel
  • Tim Allison

See https://s.apache.org/0pfp7 for more details on these contributions.