Apache Tika 1.23
The most notable changes in Tika 1.23 over the previous release are:
- NOTE: The PDFParser now relies on OCRDPI to render page images when users configure OCR on rendered page images. This will have the effect of increasing rendered image size (TIKA-2624).
- NOTE: tika-server no longer returns 415 for file types for which there is no parser.
- NOTE: tika-server's /rmeta endpoint now returns 200 if there is a parse exception to align its behavior with tika-app in batch mode. The stacktrace is stored as a metadata value.
- Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002).
- Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630).
- Upgrade to POI 4.1.1 (TIKA-2851).
- Upgrade to PDFBox 2.0.17 (TIKA-2951).
- Ensure that the PDFParser respects custom configuration of Tesseractfrom tika-config.xml via Eric Pugh (TIKA-2970).
- Add parser for XLIFF v1.2 files (TIKA-2975).
- Add mime type detection support for WebAssembly (TIKA-2894),HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988);and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989).
- Add an XLZ Parser (TIKA-2976).
The following people have contributed to Tika 1.23 by submitting or commenting on the issues resolved in this release:
- Christian Ribeaud
- Chris Z
- Dan Becker
- Dave Meikle
- David Eric Pugh
- Ewan Mellor
- Felix Sonntag
- Feng Jiao Jiang
- Fredrik Söderström
- Kim Ju Young
- Kyle DuPont
- Luís Filipe Nassif
- Luke Butters
- Pascal Essiembre
- Peng Cheng
- Roman Ivanov
- Sergey Beryozkin
- Tilman Hausherr
- Tim Allison
- Yahav Amsalem
See https://s.apache.org/asrx3 for more details on these contributions.