Apache Tika 1.23

The most notable changes in Tika 1.23 over the previous release are:

  • NOTE: The PDFParser now relies on OCRDPI to render page images when users configure OCR on rendered page images. This will have the effect of increasing rendered image size (TIKA-2624).
  • NOTE: tika-server no longer returns 415 for file types for which there is no parser.
  • NOTE: tika-server's /rmeta endpoint now returns 200 if there is a parse exception to align its behavior with tika-app in batch mode. The stacktrace is stored as a metadata value.
  • Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002).
  • Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630).
  • Upgrade to POI 4.1.1 (TIKA-2851).
  • Upgrade to PDFBox 2.0.17 (TIKA-2951).
  • Ensure that the PDFParser respects custom configuration of Tesseractfrom tika-config.xml via Eric Pugh (TIKA-2970).
  • Add parser for XLIFF v1.2 files (TIKA-2975).
  • Add mime type detection support for WebAssembly (TIKA-2894),HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988);and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989).
  • Add an XLZ Parser (TIKA-2976).

The following people have contributed to Tika 1.23 by submitting or commenting on the issues resolved in this release:

  • Christian Ribeaud
  • Chris Z
  • Dan Becker
  • Dave Meikle
  • David Eric Pugh
  • Ewan Mellor
  • Felix Sonntag
  • Feng Jiao Jiang
  • Fredrik Söderström
  • Kim Ju Young
  • Kyle DuPont
  • Luís Filipe Nassif
  • Luke Butters
  • Pascal Essiembre
  • Peng Cheng
  • Roman Ivanov
  • Sergey Beryozkin
  • Tilman Hausherr
  • Tim Allison
  • Yahav Amsalem

See https://s.apache.org/asrx3 for more details on these contributions.