Apache Tika 1.17
The most notable changes in Tika 1.17 over the previous release are:
- This will be the last version that supports Java 7. The next version will require Java 8.
- Fix thread-safety in ChmExtractor (TIKA-2519).
- Upgrade cxf to 3.0.16 (TIKA-2516).
- Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213).
- Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512).
- Cache TikaConfig in EmbeddedDocumentUtil for better performance in documents with large number of attachments (TIKA-2511).
- Extract media files from ooxml (TIKA-2510).
- Standardize the way the Image and Video captioning dockers and extraction work (TIKA-2400, Github-208)
- Upgrade to xmpcore 5.1.3 (TIKA-2034).
- Upgrade to metadata-extractor 2.10.1 (TIKA-2486).
- Upgrade to OpenNLP 1.8.3 (TIKA-2502).
- Upgrade to Jackson 2.9.2 (TIKA-2501).
- Catch potential NPE in getting InputStream for attachments in PST file (TIKA-2488).
- Upgrade to PDFBox 2.0.8 (TIKA-2489).
- Allow configuration of markLimit in EncodingDetectors via tika-config.xml (TIKA-2485).
- RFC822Parser now selects the best alternative for multipart/alternative body components. This aligns with the behavior of the OutlookParser (TIKA-2478). Users can select legacy behavior via the "extractAllAlternatives" parameter in the RFC822 parser definition in tika-config.xml.
- Narrow mime detection for ms-owner files and add detectionfor .nls files (TIKA-2469).
- Fix bug in CharsetDetector that led to different detected charsets depending on whether user setText with a byte[] or an InputStream via Sean Story (TIKA-2475).
- Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466).
- Upgrade to POI 3.17 (TIKA-2429).
- Enabling extraction of standard references from text (TIKA-2449).
- Load external custom mimetypes XML from system property tika.custom-mimetypes (TIKA-2460).
- Extract number of tiffs in a multi-page tiff (TIKA-2451).
- Fix detection of emails extracted from mbox (TIKA-2456).
- Add OverrideDetector and allow PSTParser to specify body content typeas text or html -- to avoid incorrect auto-detection of rfc/mbox, etc. (TIKA-2454)
- AutoDetectParser throws ZeroByteFileException for zero-byte files after detection on the file extension (TIKA-2450).
- Extract phonetic runs in docx with experimental SAX parser (TIKA-2448).
- Extract phonetic runs from xls and allow users to turn off extraction of phonetic runs in both xls and xlsx (TIKA-2440).
- OOXML locale should be set by POI's LocaleUtil not Locale.getDefault(). Fix unit tests to be robust against different locales in OOXML and ExcelParser (TIKA-2438).
- Tika now has support for automatic image captioning, that combines Computer Vision and Natural Language Processing to automatically generate a readable caption for an image(TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189).
- Add TestCorruptedFiles to allow devs to test parsers against corrupted input files (TIKA-2430).
- Correct Mimetype definition for Windows batch files (CMD and BAT) which are the same (TIKA-2445)
- PSDParser memory use improvements (TIKA-2447)
- Add underline extraction from Word documents (doc/docx) via Stuart Hendren as well as strike through extraction in docx (TIKA-2347, Github-173)
The following people have contributed to Tika 1.17 by submitting or commenting on the issues resolved in this release:
- Aashish Chaudhary
- Abhijit Rajwade
- Advokat
- Albert L.
- Alessandro De Angelis
- Aman R Mathur
- Ann Burgess
- Bin Hawking
- Bob Paulin
- Chris A. Mattmann
- Chris Bryant
- Chris Wilson
- Daniel Bonniot de Ruisselet
- Dave Meikle
- Dillon Welch
- Dustin Spicuzza
- Eamonn Saunders
- frank
- Giuseppe Totaro
- Jan Burkhardt
- jefferyyuan
- Julian Reschke
- Karl Buchta
- Karl Richter
- Ken Krugler
- Konstantin Gribov
- Lewis John McGibbney
- Luis Filipe Nassif
- Łukasz Ozimek
- Madhav Sharan
- Markus Jelsma
- Matthew Caruana Galizia
- Michael McCandless
- Mike Cantrell
- Nick Burch
- Paul Ramirez
- Peter Weiss
- RameshKalidindi
- Ravi
- Ray Gauss II
- Reinhard Schwab
- Robert Letzler
- Robert Munteanu
- Roberto Benedetti
- Rupert Westenthaler
- Sam H
- Sergey Beryozkin
- Sergey Tsalkov
- Stefano Fornari
- Stuart Hendren
- Takahiro Ochi
- Thamme Gowda
- Thejan Wijesinghe
- Thomas Mortagne
- Tilman Hausherr
- Tim Allison
- Tyler Palsulich
- TzeKai Lee
- Uwe Schindler
- Yaniv Kunda
See https://s.apache.org/bX5z for more details on these contributions.