Apache Tika 1.14

The most notable changes in Tika 1.14 over the previous release are:

  • Extract all headers from MSG/RFC822 (TIKA-2122).
  • 9.1 (TIKA-2113).
  • Extract PDF DocInfo metadata into separate keys to preventoverwriting by XMP metadata (TIKA-2057).
  • Re-enable fileUrl for tika-server (TIKA-2081). If you choose,to use this feature, beware of the security vulnerabilities!See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271
  • Add Tesseract's hOCR output format as an option, via Eric Pugh(TIKA-2093)
  • Extract macros from MSOffice files (TIKA-2069).
  • Maintain passed-in mime in TXTParser (TIKA-2047).
  • Upgrade to POI.3-15 (TIKA-2013).
  • 0.3 (TIKA-2051).
  • Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255and TIKA-2078)
  • Tika now is integrated with the Tensorflow library from Googleand it can use its Inception v3 image classification model toidentify objects in images (TIKA-1993).
  • Parser configuration is now type-safe and parameters for parserscan have assigned types (TIKA-1508, TIKA-1986).
  • Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040).
  • Upgrade ICU4J charset detection components to fix multithreadingbug (TIKA-2041).
  • 1.4 (TIKA-2039).
  • Maintain more significant digits in cells of "General" formatin XLS and XLSX (TIKA-2025).
  • Avoid mark/reset issues when extracting or detecting embedded resourcesin RFC822 emails (TIKA-2037).
  • Improving accuracy of Tesseract for better extraction of numericand alphanumeric text from images (TIKA-2021, TIKA-2031).
  • Improve extraction of embedded documents from PPT, PPTX and XLSX(TIKA-2026).
  • Add parser for applefile (AppleSingle) (TIKA-2022).
  • Add mime types, mime magic and/or globs for:
  • Endnote Import File (TIKA-2011)
  • DJVU files (TIKA-2009)
  • MS Owner File (TIKA-2008)
  • Windows Media Metafile (TIKA-2004)
  • iCal and vCalendar (TIKA-2006)
  • MBOX (TIKA-2042)
  • Stata DTA (TIKA-2064)
  • Add configurable maximum threshold for number of events extractedfrom the XMP Media Management Schema in JempboxExtractor (TIKA-1999).
  • Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994).
  • Add mime detection via Nick C and parser for DBF files (TIKA-1513).
  • Add mime detection and parsers for MSOffice 2003 XML Wordand Excel formats (TIKA-1958).
  • Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454).

The following people have contributed to Tika 1.14 by submitting or commenting on the issues resolved in this release:

  • Aeham Abushwashi
  • Alan Hunter
  • Alexander Kazakov
  • Chris A. Mattmann
  • Chris Knott
  • Egbert
  • Eli Trucco
  • Eric Pugh
  • Jean Coudon
  • Jeff Swindle
  • John Dougrez-Lewis
  • John Haynes
  • Joseph Naegele
  • Josh Cummings
  • Ken Krugler
  • Kukushkin Alexander
  • Lewis John McGibbney
  • Luis Filipe Nassif
  • Matthias Pigulla
  • Nam-Quang Tran
  • Nilay Chheda
  • Philipp Steinkrueger
  • Sara Miller
  • Sebastian Iturra
  • Thamme Gowda
  • Tilman Hausherr
  • Tim Allison
  • Tim Barrett
  • Vjeran Marcinko
  • Yahav Amsalem
  • Zarana Parekh

See https://s.apache.org/TRWa for more details on these contributions.