Apache Tika 1.17

The most notable changes in Tika 1.17 over the previous release are:

  • This will be the last version that supports Java 7. The next version will require Java 8.
  • Fix thread-safety in ChmExtractor (TIKA-2519).
  • Upgrade cxf to 3.0.16 (TIKA-2516).
  • Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213).
  • Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512).
  • Cache TikaConfig in EmbeddedDocumentUtil for better performance in documents with large number of attachments (TIKA-2511).
  • Extract media files from ooxml (TIKA-2510).
  • Standardize the way the Image and Video captioning dockers and extraction work (TIKA-2400, Github-208)
  • Upgrade to xmpcore 5.1.3 (TIKA-2034).
  • Upgrade to metadata-extractor 2.10.1 (TIKA-2486).
  • Upgrade to OpenNLP 1.8.3 (TIKA-2502).
  • Upgrade to Jackson 2.9.2 (TIKA-2501).
  • Catch potential NPE in getting InputStream for attachments in PST file (TIKA-2488).
  • Upgrade to PDFBox 2.0.8 (TIKA-2489).
  • Allow configuration of markLimit in EncodingDetectors via tika-config.xml (TIKA-2485).
  • RFC822Parser now selects the best alternative for multipart/alternative body components. This aligns with the behavior of the OutlookParser (TIKA-2478). Users can select legacy behavior via the "extractAllAlternatives" parameter in the RFC822 parser definition in tika-config.xml.
  • Narrow mime detection for ms-owner files and add detectionfor .nls files (TIKA-2469).
  • Fix bug in CharsetDetector that led to different detected charsets depending on whether user setText with a byte[] or an InputStream via Sean Story (TIKA-2475).
  • Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466).
  • Upgrade to POI 3.17 (TIKA-2429).
  • Enabling extraction of standard references from text (TIKA-2449).
  • Load external custom mimetypes XML from system property tika.custom-mimetypes (TIKA-2460).
  • Extract number of tiffs in a multi-page tiff (TIKA-2451).
  • Fix detection of emails extracted from mbox (TIKA-2456).
  • Add OverrideDetector and allow PSTParser to specify body content typeas text or html -- to avoid incorrect auto-detection of rfc/mbox, etc. (TIKA-2454)
  • AutoDetectParser throws ZeroByteFileException for zero-byte files after detection on the file extension (TIKA-2450).
  • Extract phonetic runs in docx with experimental SAX parser (TIKA-2448).
  • Extract phonetic runs from xls and allow users to turn off extraction of phonetic runs in both xls and xlsx (TIKA-2440).
  • OOXML locale should be set by POI's LocaleUtil not Locale.getDefault(). Fix unit tests to be robust against different locales in OOXML and ExcelParser (TIKA-2438).
  • Tika now has support for automatic image captioning, that combines Computer Vision and Natural Language Processing to automatically generate a readable caption for an image(TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189).
  • Add TestCorruptedFiles to allow devs to test parsers against corrupted input files (TIKA-2430).
  • Correct Mimetype definition for Windows batch files (CMD and BAT) which are the same (TIKA-2445)
  • PSDParser memory use improvements (TIKA-2447)
  • Add underline extraction from Word documents (doc/docx) via Stuart Hendren as well as strike through extraction in docx (TIKA-2347, Github-173)

    The following people have contributed to Tika 1.17 by submitting or commenting on the issues resolved in this release:

  • Aashish Chaudhary
  • Abhijit Rajwade
  • Advokat
  • Albert L.
  • Alessandro De Angelis
  • Aman R Mathur
  • Ann Burgess
  • Bin Hawking
  • Bob Paulin
  • Chris A. Mattmann
  • Chris Bryant
  • Chris Wilson
  • Daniel Bonniot de Ruisselet
  • Dave Meikle
  • Dillon Welch
  • Dustin Spicuzza
  • Eamonn Saunders
  • frank
  • Giuseppe Totaro
  • Jan Burkhardt
  • jefferyyuan
  • Julian Reschke
  • Karl Buchta
  • Karl Richter
  • Ken Krugler
  • Konstantin Gribov
  • Lewis John McGibbney
  • Luis Filipe Nassif
  • Łukasz Ozimek
  • Madhav Sharan
  • Markus Jelsma
  • Matthew Caruana Galizia
  • Michael McCandless
  • Mike Cantrell
  • Nick Burch
  • Paul Ramirez
  • Peter Weiss
  • RameshKalidindi
  • Ravi
  • Ray Gauss II
  • Reinhard Schwab
  • Robert Letzler
  • Robert Munteanu
  • Roberto Benedetti
  • Rupert Westenthaler
  • Sam H
  • Sergey Beryozkin
  • Sergey Tsalkov
  • Stefano Fornari
  • Stuart Hendren
  • Takahiro Ochi
  • Thamme Gowda
  • Thejan Wijesinghe
  • Thomas Mortagne
  • Tilman Hausherr
  • Tim Allison
  • Tyler Palsulich
  • TzeKai Lee
  • Uwe Schindler
  • Yaniv Kunda

See https://s.apache.org/bX5z for more details on these contributions.