Apache Tika 1.22

The most notable changes in Tika 1.22 over the previous release are:

  • NOTE: Known regression: PDFBOX-4587 -- PDF passwords with codepoints between 0xF000 and 0XF0000 will cause an exception.
  • Add parser for HWP v5 files via SooMyung Lee (soomyung) and JinSup Kim (ddoleye) (TIKA-2909).
  • Fix order of closing streams to avoid "Failed to close temporary resource" exception in TesseractOCRParser (TIKA-2908).
  • Improve AutoDetectReader performance by caching the encoding detector (TIKA-1568).
  • Prevent RTFParser from outputting illegal tag combinations (TIKA-2889).
  • Fix RereadableInputStream to release all resources (TIKA-2903).
  • Implement custom language identifier in the tika-eval module based on OpenNLP's language detector; add 18 languages and add common wordslists for all 121 languages (TIKA-2790).
  • Fix NPE in MimeTypesReader.releaseParser() via Eamonn Saunders (TIKA-2896).
  • Fix RTFParser to extract more content (TIKA-2883).
  • Add clientSubmitTime to the metadata extracted from PST files (TIKA-2898).

The following people have contributed to Tika 1.22 by submitting or commenting on the issues resolved in this release:

  • Andrzej Bialecki
  • Eamonn Saunders
  • Kevin Ng
  • Luis Filipe Nassif
  • Marichi Gupta
  • Mike Cantrell
  • Pandurang
  • Paul Woods
  • Peter Fassev
  • Richard Lehane
  • Rohit Sureshrao Shelhalkar
  • Sebb
  • T Craig
  • T. Schmidt
  • Tim Allison
  • ddoleye
  • mungeol heo
  • soomyung

See https://s.apache.org/zpngc for more details on these contributions.