Apache Tika 1.16

The most notable changes in Tika 1.16 over the previous release are:

  • Exclude jj2000 from edu.ucar grip to avoid potential license conflicts with ASL 2.0.
  • Add Age recognition using Ensemble model for Linear regression and Apache OpenNLP Maximum Entropy. Tika can now detect age from text (TIKA-1988).
  • Add Tika Deep Learning support for the VGG16 model for Very Deep Convolutional Networks for Large-Scale Image Recognition. Now Tika supports both Inception v3/v4 and VGG16 based image recognition (TIKA-2298).
  • Extract macros from PPT (TIKA-2089).
  • Extract absolute path for last saved location when available in .xlsx and .xlsb (TIKA-2335).
  • Rename SentimentParser to SentimentAnalysisParser to prevent conflict with dependency (TIKA-2368).
  • tika-app now extracts inline images in PDFs by default, and it includes a warning to users that this is not the default behavior elsewhere in Tika (TIKA-2374).
  • Allow configurability of warnings for problems during parser initialization (TIKA-2389).
  • Update to Jackcess 2.1.8 (TIKA-2380).
  • Upgrade to POI 3.17-beta1 (TIKA-2336).
  • Remove non-ASL-2.0-compatible org.json (TIKA-1804).
  • Allow extraction of script elements in HTML as embedded "MACRO". Users must turn this on via TikaConfig (TIKA-2391).
  • Allow users to turn off extraction of headers and footers from .doc, .docx, .xls, .xlsx, .xlsb (TIKA-2362)
  • Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
  • Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).
  • Fix bug in tika-server that led to an attempt to close the input stream twice (TIKA-2384).
  • Enable base32 encoding of digests and enable BouncyCastle implementations of digest algorithms (TIKA-2386).
  • Canonical Mimetype of WAVE audio changed to match RFC 2361 defined version, audio/vnd.wave, older audio/x-wav remains as an alias
  • Upgrade "provided" xerial 3.19.3 (TIKA-2412).
  • Upgrade Gson to 2.8.1 (TIKA-2414).
  • Upgrade mime4j to 0.8.1 (TIKA-2413).
  • Mime magic improvements for GraphViz (TIKA-2422), HTML files which claim to be XML but aren't quite valid XML (TIKA-2419) and QuickTime / MP4 (TIKA-2418)

    The following people have contributed to Tika 1.16 by submitting or commenting on the issues resolved in this release:

  • Adam Estrada
  • Alessandro Scaldaferro
  • Avtar Singh
  • Bob Paulin
  • Chris A. Mattmann
  • Chris Bamford
  • Christopher Creutzig
  • Claus Ibsen
  • Dave Kincaid
  • gil cattaneo
  • Jorge Spinsanti
  • Nick Burch
  • Nick C
  • Sebastian Nagel
  • Seva Alekseyev
  • Steve Reynolds
  • Tim Allison

See https://s.apache.org/Lpem for more details on these contributions.