Apache Tika 1.16
The most notable changes in Tika 1.16 over the previous release are:
- Exclude jj2000 from edu.ucar grip to avoid potential license conflicts with ASL 2.0.
- Add Age recognition using Ensemble model for Linear regression and Apache OpenNLP Maximum Entropy. Tika can now detect age from text (TIKA-1988).
- Add Tika Deep Learning support for the VGG16 model for Very Deep Convolutional Networks for Large-Scale Image Recognition. Now Tika supports both Inception v3/v4 and VGG16 based image recognition (TIKA-2298).
- Extract macros from PPT (TIKA-2089).
- Extract absolute path for last saved location when available in .xlsx and .xlsb (TIKA-2335).
- Rename SentimentParser to SentimentAnalysisParser to prevent conflict with dependency (TIKA-2368).
- tika-app now extracts inline images in PDFs by default, and it includes a warning to users that this is not the default behavior elsewhere in Tika (TIKA-2374).
- Allow configurability of warnings for problems during parser initialization (TIKA-2389).
- Update to Jackcess 2.1.8 (TIKA-2380).
- Upgrade to POI 3.17-beta1 (TIKA-2336).
- Remove non-ASL-2.0-compatible org.json (TIKA-1804).
- Allow extraction of script elements in HTML as embedded "MACRO". Users must turn this on via TikaConfig (TIKA-2391).
- Allow users to turn off extraction of headers and footers from .doc, .docx, .xls, .xlsx, .xlsb (TIKA-2362)
- Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
- Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).
- Fix bug in tika-server that led to an attempt to close the input stream twice (TIKA-2384).
- Enable base32 encoding of digests and enable BouncyCastle implementations of digest algorithms (TIKA-2386).
- Canonical Mimetype of WAVE audio changed to match RFC 2361 defined version, audio/vnd.wave, older audio/x-wav remains as an alias
- Upgrade "provided" xerial 3.19.3 (TIKA-2412).
- Upgrade Gson to 2.8.1 (TIKA-2414).
- Upgrade mime4j to 0.8.1 (TIKA-2413).
- Mime magic improvements for GraphViz (TIKA-2422), HTML files which claim to be XML but aren't quite valid XML (TIKA-2419) and QuickTime / MP4 (TIKA-2418)
The following people have contributed to Tika 1.16 by submitting or commenting on the issues resolved in this release:
- Adam Estrada
- Alessandro Scaldaferro
- Avtar Singh
- Bob Paulin
- Chris A. Mattmann
- Chris Bamford
- Christopher Creutzig
- Claus Ibsen
- Dave Kincaid
- gil cattaneo
- Jorge Spinsanti
- Nick Burch
- Nick C
- Sebastian Nagel
- Seva Alekseyev
- Steve Reynolds
- Tim Allison
See https://s.apache.org/Lpem for more details on these contributions.