Apache Tika 1.4
The most notable changes in Tika 1.4 over the previous release are:
- Removed a test HTML file with a poorly chosen GPL text in it (TIKA-1129).
- Improvements to tika-server to allow it to produce text/html and text/xml content (TIKA-1126, TIKA-1127).
- Improvements were made to the Compressor Parser to handle g'zipped files that require the decompressConcatenated option set to true (TIKA-1096).
- Addressed a typographic error that was preventing from detection of awk files (TIKA-1081).
- Added a new end-point to Tika's JAX-RS REST server that only detects the media-type based on a small portion of the document submitted (TIKA-1047).
- RTF: Ordered and unordered lists are now extracted (TIKA-1062).
- MP3: Audio duration is now extracted (TIKA-991)
- Java .class files: upgraded from ASM 3.1 to ASM 4.1 for parsing the Java bytecodes (TIKA-1053).
- Mime Types: Definitions extended to optionally include Link (URL) and UTI, along with details for several common formats (TIKA-1012 / TIKA-1083)
- Exceptions when parsing OLE10 embedded documents, when parsing summary information from Office documents, and when saving embedded documennts in TikaCLI are now logged instead of aborting extraction (TIKA-1074)
- MS Word: line tabular character is now replaced with newline (TIKA-1128)
- XML: ElementMetadataHandlers can now optionally accept duplicate and empty values (TIKA-1133)
The following people have contributed to Tika 1.4 by submitting or commenting on the issues resolved in this release:
- Axel Dörfler
- Bernhard Berger
- Chris A. Mattmann
- Dave Meikle
- David Morana
- Giuseppe Totaro
- Gregory Chanan
- Jérémie Lesage
- Jukka Zitting
- Konstantin Privezentsev
- Lee Graber
- Lewis John McGibbney
- Marco Quaranta
- Markus Jelsma
- Michael McCandless
- Nick Burch
- Raimund Merkert
- Ray Gauss II
- Ryan McKinley
- T. Schmidt
- Vincent Massol
See http://s.apache.org/JPY for more details on these contributions.