Apache Tika

Apache Tika 1.14

The most notable changes in Tika 1.14 over the previous release are:

Extract all headers from MSG/RFC822 (TIKA-2122).
9.1 (TIKA-2113).
Extract PDF DocInfo metadata into separate keys to preventoverwriting by XMP metadata (TIKA-2057).
Re-enable fileUrl for tika-server (TIKA-2081). If you choose,to use this feature, beware of the security vulnerabilities!See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271
Add Tesseract's hOCR output format as an option, via Eric Pugh(TIKA-2093)
Extract macros from MSOffice files (TIKA-2069).
Maintain passed-in mime in TXTParser (TIKA-2047).
Upgrade to POI.3-15 (TIKA-2013).
0.3 (TIKA-2051).
Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255and TIKA-2078)
Tika now is integrated with the Tensorflow library from Googleand it can use its Inception v3 image classification model toidentify objects in images (TIKA-1993).
Parser configuration is now type-safe and parameters for parserscan have assigned types (TIKA-1508, TIKA-1986).
Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040).
Upgrade ICU4J charset detection components to fix multithreadingbug (TIKA-2041).
1.4 (TIKA-2039).
Maintain more significant digits in cells of "General" formatin XLS and XLSX (TIKA-2025).
Avoid mark/reset issues when extracting or detecting embedded resourcesin RFC822 emails (TIKA-2037).
Improving accuracy of Tesseract for better extraction of numericand alphanumeric text from images (TIKA-2021, TIKA-2031).
Improve extraction of embedded documents from PPT, PPTX and XLSX(TIKA-2026).
Add parser for applefile (AppleSingle) (TIKA-2022).
Add mime types, mime magic and/or globs for:
Endnote Import File (TIKA-2011)
DJVU files (TIKA-2009)
MS Owner File (TIKA-2008)
Windows Media Metafile (TIKA-2004)
iCal and vCalendar (TIKA-2006)
MBOX (TIKA-2042)
Stata DTA (TIKA-2064)
Add configurable maximum threshold for number of events extractedfrom the XMP Media Management Schema in JempboxExtractor (TIKA-1999).
Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994).
Add mime detection via Nick C and parser for DBF files (TIKA-1513).
Add mime detection and parsers for MSOffice 2003 XML Wordand Excel formats (TIKA-1958).
Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454).

The following people have contributed to Tika 1.14 by submitting or commenting on the issues resolved in this release:

Aeham Abushwashi
Alan Hunter
Alexander Kazakov
Chris A. Mattmann
Chris Knott
Egbert
Eli Trucco
Eric Pugh
Jean Coudon
Jeff Swindle
John Dougrez-Lewis
John Haynes
Joseph Naegele
Josh Cummings
Ken Krugler
Kukushkin Alexander
Lewis John McGibbney
Luis Filipe Nassif
Matthias Pigulla
Nam-Quang Tran
Nilay Chheda
Philipp Steinkrueger
Sara Miller
Sebastian Iturra
Thamme Gowda
Tilman Hausherr
Tim Allison
Tim Barrett
Vjeran Marcinko
Yahav Amsalem
Zarana Parekh

Apache Tika 1.14

Documentation

The Apache Software Foundation

Books about Tika