Apache Tika

Apache Tika 1.17

The most notable changes in Tika 1.17 over the previous release are:

This will be the last version that supports Java 7. The next version will require Java 8.
Fix thread-safety in ChmExtractor (TIKA-2519).
Upgrade cxf to 3.0.16 (TIKA-2516).
Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213).
Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512).
Cache TikaConfig in EmbeddedDocumentUtil for better performance in documents with large number of attachments (TIKA-2511).
Extract media files from ooxml (TIKA-2510).
Standardize the way the Image and Video captioning dockers and extraction work (TIKA-2400, Github-208)
Upgrade to xmpcore 5.1.3 (TIKA-2034).
Upgrade to metadata-extractor 2.10.1 (TIKA-2486).
Upgrade to OpenNLP 1.8.3 (TIKA-2502).
Upgrade to Jackson 2.9.2 (TIKA-2501).
Catch potential NPE in getting InputStream for attachments in PST file (TIKA-2488).
Upgrade to PDFBox 2.0.8 (TIKA-2489).
Allow configuration of markLimit in EncodingDetectors via tika-config.xml (TIKA-2485).
RFC822Parser now selects the best alternative for multipart/alternative body components. This aligns with the behavior of the OutlookParser (TIKA-2478). Users can select legacy behavior via the "extractAllAlternatives" parameter in the RFC822 parser definition in tika-config.xml.
Narrow mime detection for ms-owner files and add detectionfor .nls files (TIKA-2469).
Fix bug in CharsetDetector that led to different detected charsets depending on whether user setText with a byte[] or an InputStream via Sean Story (TIKA-2475).
Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466).
Upgrade to POI 3.17 (TIKA-2429).
Enabling extraction of standard references from text (TIKA-2449).
Load external custom mimetypes XML from system property tika.custom-mimetypes (TIKA-2460).
Extract number of tiffs in a multi-page tiff (TIKA-2451).
Fix detection of emails extracted from mbox (TIKA-2456).
Add OverrideDetector and allow PSTParser to specify body content typeas text or html -- to avoid incorrect auto-detection of rfc/mbox, etc. (TIKA-2454)
AutoDetectParser throws ZeroByteFileException for zero-byte files after detection on the file extension (TIKA-2450).
Extract phonetic runs in docx with experimental SAX parser (TIKA-2448).
Extract phonetic runs from xls and allow users to turn off extraction of phonetic runs in both xls and xlsx (TIKA-2440).
OOXML locale should be set by POI's LocaleUtil not Locale.getDefault(). Fix unit tests to be robust against different locales in OOXML and ExcelParser (TIKA-2438).
Tika now has support for automatic image captioning, that combines Computer Vision and Natural Language Processing to automatically generate a readable caption for an image(TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189).
Add TestCorruptedFiles to allow devs to test parsers against corrupted input files (TIKA-2430).
Correct Mimetype definition for Windows batch files (CMD and BAT) which are the same (TIKA-2445)
PSDParser memory use improvements (TIKA-2447)
Add underline extraction from Word documents (doc/docx) via Stuart Hendren as well as strike through extraction in docx (TIKA-2347, Github-173)
The following people have contributed to Tika 1.17 by submitting or commenting on the issues resolved in this release:
Aashish Chaudhary
Abhijit Rajwade
Advokat
Albert L.
Alessandro De Angelis
Aman R Mathur
Ann Burgess
Bin Hawking
Bob Paulin
Chris A. Mattmann
Chris Bryant
Chris Wilson
Daniel Bonniot de Ruisselet
Dave Meikle
Dillon Welch
Dustin Spicuzza
Eamonn Saunders
frank
Giuseppe Totaro
Jan Burkhardt
jefferyyuan
Julian Reschke
Karl Buchta
Karl Richter
Ken Krugler
Konstantin Gribov
Lewis John McGibbney
Luis Filipe Nassif
Łukasz Ozimek
Madhav Sharan
Markus Jelsma
Matthew Caruana Galizia
Michael McCandless
Mike Cantrell
Nick Burch
Paul Ramirez
Peter Weiss
RameshKalidindi
Ravi
Ray Gauss II
Reinhard Schwab
Robert Letzler
Robert Munteanu
Roberto Benedetti
Rupert Westenthaler
Sam H
Sergey Beryozkin
Sergey Tsalkov
Stefano Fornari
Stuart Hendren
Takahiro Ochi
Thamme Gowda
Thejan Wijesinghe
Thomas Mortagne
Tilman Hausherr
Tim Allison
Tyler Palsulich
TzeKai Lee
Uwe Schindler
Yaniv Kunda

See https://s.apache.org/bX5z for more details on these contributions.

Apache Tika 1.17

Documentation

The Apache Software Foundation

Books about Tika