Apache Tika 1.19

The most notable changes in Tika 1.19 over the previous release are:

  • Require Java 8 (TIKA-2679).
  • Enable building with Java 11 (TIKA-2668)
  • Add an option to make tika-server robust against infinite loops, OOMs, and memory leaks (TIKA-2725).
  • Allow configuration of the Tesseract parser via the standard tika-config.xml options (TIKA-2705).
  • Improve handling of empty cells across table-based formats (TIKA-2479).
  • Add a Standards compliant HTML encoding detector via Gerard Bouchar (TIKA-2673).
  • Improved XML parsing -- limited default entity expansions to 20. To raise this limit, add -Djdk.xml.entityExpansionLimit=XXX to your commandline.
  • Mime magic improvements for Olympus RAW (TIKA-2658), interpreted server-side languages via HTTP (TIKA-2648), MHTML (TIKA-2723)
  • Add absolute timeout to ForkParser rather than testing for active (TIKA-2656).
  • Make the RecursiveParserWrapper work with the ForkParser (TIKA-2655).
  • Allow the ForkParser to specify a directory containing tika-app.jar for use by the ForkServer. This allows users to keep most of the parser dependencies out of their code; and it allows for an easy addition of optional jars for Parser dependencies, such as the xerial sqlite jar (TIKA-2653).
  • Use a pool for SAXParsers and DOMBuilders rather than creatinga new parser/builder for every parse. For better performance, set XMLReaderUtils.setPoolSize() to the number of threads you're using with Tika (TIKA-2645).
  • Add the RecursiveParserWrapperHandler to improve the RecursiveParserWrapperAPI slightly (TIKA-2644).
  • Upgraded to Commons-Compress 1.18 (TIKA-2707).
  • Upgraded to Apache POI 4.0.0 (TIKA-2552).
  • Upgraded to Apache PDFBox 2.0.11 (TIKA-2681).
  • Upgraded to deeplearning4j 1.0.0-beta2 (TIKA-2672).
  • Upgraded jmatio to 1.4 (TIKA-2667)
  • Upgraded Apache Lucene to 7.4.0 in tika-eval and tika-examples (TIKA-2695).
  • Upgraded junrar to 1.0.1 (TIKA-2664).
  • Numerous other upgrades (TIKA-2692).
  • Excluded Spring as a transitive dependency (TIKA-2721).

The following people have contributed to Tika 1.19 by submitting or commenting on the issues resolved in this release:

  • Abhijit Rajwade
  • Adam Rauch
  • Andreas Meier
  • Annie Didier
  • Celpan Valeria
  • Chris A. Mattmann
  • Gerard Bouchar
  • Hans Brende
  • Karanjeet Singh
  • Karl Wright
  • Ken Krugler
  • Konstantin Gribov
  • Lewis John McGibbney
  • Sebastian Nagel
  • Slava G
  • Thorsten Schäfer
  • Tim Allison
  • Vincent van Donselaar
  • Yuriy Koval
  • Yury Kats

See https://s.apache.org/dG8B for more details on these contributions.