Apache Tika 1.18

The most notable changes in Tika 1.18 over the previous release are:

  • Upgrade to Jackson 2.9.5 (TIKA-2634).
  • Add support for brotli (TIKA-2621).
  • Upgrade PDFBox to 2.0.9 and include new jbig2-imageio from org.apache.pdfbox (TIKA-2579 and TIKA-2607).
  • Support for TIFF images in PDF files (TIKA-2338)
  • Detection of full encrypted 7z files (TIKA-2568)
  • Various new mimes and typo fixes in tika-mimetypes.xml via Andreas Meier (TIKA-2527).
  • Revert to listenForAllRecords=false in ExcelExtractor via Grigoriy Alekseev (TIKA-2590)
  • Add workaround to identify TIFFs that might confuse commons-compress's tar detection via Daniel Schmidt(TIKA-2591)
  • Ignore non-IANA supported charsets in HTML meta-headers during charset detection in HTMLEncodingDetectorvia Andreas Meier (TIKA-2592)
  • Add detection and parsing of zstd (if user provides com.github.luben:zstd-jni) via Andreas Meier (TIKA-2576)
  • Allow for RFC822 detection for files starting with "dkim-" and/or "x-" via Andreas Meier (TIKA-2578 and TIKA-2587)
  • Extract xlsx files embedded in OLE objects within PPT and PPTX via Brian McColgan (TIKA-2588).
  • Extract files embedded in HTML and javascript inside HTML that are stored in the Data URI scheme (TIKA-2563).
  • Extract text from grouped text boxes in PPT (TIKA-2569).
  • Extract language metadata item from PDF files via Matt Sheppard (TIKA-2559)
  • RFC822 with multipart/mixed, first text element should be treated as the main body of the email, not an attachment (TIKA-2547).
  • Swap out com.tdunning:json for com.github.openjson:openjson to avoid jar conflicts (TIKA-2556).
  • No longer hardcode HtmlParser for XML files in tika-server (TIKA-2551).
  • Require Java 8 (TIKA-2553).
  • Add a parser for XPS (TIKA-2524).
  • Mime magic for Dolby Digital AC3 and EAC3 files
  • Fixed bug where TesseractOCRParser ignores configured ImageMagickPath, and set rotation script to ignore Python warnings (TIKA-2509)
  • Upgrade geo-apis to 3.0.1 (TIKA-2535).
  • Added local Docker image build using dockerfile-maven-plugin to allow images to be built from source (TIKA-1518).

    The following people have contributed to Tika 1.18 by submitting or commenting on the issues resolved in this release:

  • Andreas Meier
  • Andrei Rebegea
  • Anto
  • Asela
  • Brian McColgan
  • daniel schmidt
  • Dave Meikle
  • David Pilato
  • Ewan Mellor
  • Grigoriy Alekseev
  • Guillaume Smet
  • Julian Reschke
  • Konstantin Gribov
  • Luis Filipe Nassif
  • Manolo Caracuel
  • Marc Prudhommeaux
  • Matt Sheppard
  • Nick Burch
  • Nicolas Belisle
  • Nik Everett
  • Ohad R
  • Peter Davies
  • Richard A
  • Richard Jones
  • Sasha Goodman
  • Stefan Sveen
  • Tim Allison

See https://s.apache.org/CJNU for more details on these contributions.