Apache Tika 2.1.0

The most notable changes in Tika 2.1.0 over the previous release are:

  • Improved packaging for tika-parsers-extended. Use the tika-parser-scientific-package and tika-parser-sqlite3-package artifacts if you want fat jars with dependencies. (TIKA-3510)
  • Tika app writes UTF-8 when an encoding is not specified; the legacy behavior was UTF-8 on Mac OS, but System default on other OSs (TIKA-3515).
  • Change the default rendering strategy for PDFs from NO_TEXT to ALL (TIKA-3520).Other changes:
  • Fixed bug that pointed to the wrong tessdata directory if the user specified a tesseract path but not also a tessdata path (TIKA-3518).
  • Fixed bug in Icu4j's encoding detector where it would return non-standard names for charsets, e.g. IBM424_rtl is now returned as IBM424 (TIKA-3516).
  • Add a simple UrlFetcher in tika-core as a basic alternative to tika-fetcher-http (TIKA-3527).
  • Add tika-pipes support for Google Cloud Storage (TIKA-3524).
  • Fix markup ordering errors in xhtml output for ODT files (TIKA-2242).
  • Fix serialization of embedded docs in OpenSearch emitter and fix embedded documents not being indexed in some use-cases in the Solr emitter (TIKA-3490).
  • Add pipesClientId system property to PipesServer so that each forked process can log to its own logger (TIKA-3480).
  • Add DateNormalizingMetadataFilter let users ensure that all dates emitted to Solr/OpenSearch are in UTC. Users can configure which timezone they'd like to use in cases where the file format does not store a timezone (TIKA-3496).

The following people have contributed to Tika 2.1.0 by submitting or commenting on the issues resolved in this release:

  • Aashish Chaudhary
  • Abha
  • Albert L.
  • Alessandro De Angelis
  • Ann Burgess
  • Bin Hawking
  • Chaitra Rajappa
  • Chris A. Mattmann
  • Chris Bryant
  • Daniel Bonniot de Ruisselet
  • Dave Meikle
  • David Eric Pugh
  • frank
  • Graham Charters
  • jefferyyuan
  • Jukka Zitting
  • Julian Reschke
  • Kenneth William Krugler
  • Konstantin Gribov
  • Lewis John McGibbney
  • Luís Filipe Nassif
  • Łukasz Ozimek
  • Madhav Sharan
  • Markus Jelsma
  • Michael McCandless
  • Nick Burch
  • Paul Ramirez
  • Peter Kronenberg
  • RameshKalidindi
  • Ravi
  • Ray Gauss II
  • Reinhard Pötz
  • Roberto Benedetti
  • Rupert Westenthaler
  • Sam H
  • Sebastian Nagel
  • Sergey Beryozkin
  • Shubhangi Raut
  • Thomas Mortagne
  • Tilman Hausherr
  • Tim Allison
  • Tyler Bui-Palsulich
  • Uwe Schindler
  • Yaniv Kunda

See https://s.apache.org/h8ik6 for more details on these contributions.