Apache Tika 1.4

The most notable changes in Tika 1.4 over the previous release are:

  • Removed a test HTML file with a poorly chosen GPL text in it (TIKA-1129).
  • Improvements to tika-server to allow it to produce text/html and text/xml content (TIKA-1126, TIKA-1127).
  • Improvements were made to the Compressor Parser to handle g'zipped files that require the decompressConcatenated option set to true (TIKA-1096).
  • Addressed a typographic error that was preventing from detection of awk files (TIKA-1081).
  • Added a new end-point to Tika's JAX-RS REST server that only detects the media-type based on a small portion of the document submitted (TIKA-1047).
  • RTF: Ordered and unordered lists are now extracted (TIKA-1062).
  • MP3: Audio duration is now extracted (TIKA-991)
  • Java .class files: upgraded from ASM 3.1 to ASM 4.1 for parsing the Java bytecodes (TIKA-1053).
  • Mime Types: Definitions extended to optionally include Link (URL) and UTI, along with details for several common formats (TIKA-1012 / TIKA-1083)
  • Exceptions when parsing OLE10 embedded documents, when parsing summary information from Office documents, and when saving embedded documennts in TikaCLI are now logged instead of aborting extraction (TIKA-1074)
  • MS Word: line tabular character is now replaced with newline (TIKA-1128)
  • XML: ElementMetadataHandlers can now optionally accept duplicate and empty values (TIKA-1133)

    The following people have contributed to Tika 1.4 by submitting or commenting on the issues resolved in this release:

    • Axel Dörfler
    • Bernhard Berger
    • Chris A. Mattmann
    • Dave Meikle
    • David Morana
    • Giuseppe Totaro
    • Gregory Chanan
    • Jérémie Lesage
    • Jukka Zitting
    • Konstantin Privezentsev
    • Lee Graber
    • Lewis John McGibbney
    • Marco Quaranta
    • Markus Jelsma
    • Michael McCandless
    • Nick Burch
    • Raimund Merkert
    • Ray Gauss II
    • Ryan McKinley
    • T. Schmidt
    • Vincent Massol

    See http://s.apache.org/JPY for more details on these contributions.