Apache Tika 2.1.0
The most notable changes in Tika 2.1.0 over the previous release are:
- Improved packaging for tika-parsers-extended. Use the tika-parser-scientific-package and tika-parser-sqlite3-package artifacts if you want fat jars with dependencies. (TIKA-3510)
- Tika app writes UTF-8 when an encoding is not specified; the legacy behavior was UTF-8 on Mac OS, but System default on other OSs (TIKA-3515).
- Change the default rendering strategy for PDFs from NO_TEXT to ALL (TIKA-3520).Other changes:
- Fixed bug that pointed to the wrong tessdata directory if the user specified a tesseract path but not also a tessdata path (TIKA-3518).
- Fixed bug in Icu4j's encoding detector where it would return non-standard names for charsets, e.g. IBM424_rtl is now returned as IBM424 (TIKA-3516).
- Add a simple UrlFetcher in tika-core as a basic alternative to tika-fetcher-http (TIKA-3527).
- Add tika-pipes support for Google Cloud Storage (TIKA-3524).
- Fix markup ordering errors in xhtml output for ODT files (TIKA-2242).
- Fix serialization of embedded docs in OpenSearch emitter and fix embedded documents not being indexed in some use-cases in the Solr emitter (TIKA-3490).
- Add pipesClientId system property to PipesServer so that each forked process can log to its own logger (TIKA-3480).
- Add DateNormalizingMetadataFilter let users ensure that all dates emitted to Solr/OpenSearch are in UTC. Users can configure which timezone they'd like to use in cases where the file format does not store a timezone (TIKA-3496).
The following people have contributed to Tika 2.1.0 by submitting or commenting on the issues resolved in this release:
- Aashish Chaudhary
- Abha
- Albert L.
- Alessandro De Angelis
- Ann Burgess
- Bin Hawking
- Chaitra Rajappa
- Chris A. Mattmann
- Chris Bryant
- Daniel Bonniot de Ruisselet
- Dave Meikle
- David Eric Pugh
- frank
- Graham Charters
- jefferyyuan
- Jukka Zitting
- Julian Reschke
- Kenneth William Krugler
- Konstantin Gribov
- Lewis John McGibbney
- Luís Filipe Nassif
- Łukasz Ozimek
- Madhav Sharan
- Markus Jelsma
- Michael McCandless
- Nick Burch
- Paul Ramirez
- Peter Kronenberg
- RameshKalidindi
- Ravi
- Ray Gauss II
- Reinhard Pötz
- Roberto Benedetti
- Rupert Westenthaler
- Sam H
- Sebastian Nagel
- Sergey Beryozkin
- Shubhangi Raut
- Thomas Mortagne
- Tilman Hausherr
- Tim Allison
- Tyler Bui-Palsulich
- Uwe Schindler
- Yaniv Kunda
See https://s.apache.org/h8ik6 for more details on these contributions.