Apache Tika 1.6

The most notable changes in Tika 1.6 over the previous release are:

  • Parse output should indicate which Parser was actually used (TIKA-674).
  • Use the forbidden-apis Maven plugin to check for unsafe Java operations (TIKA-1387).
  • Created an ExternalTranslator class to interface with command line Translators (TIKA-1385).
  • Created a MosesTranslator as a subclass of ExternalTranslator that calls the Moses Decoder machine translation program (TIKA-1385).
  • Created the tika-example module. It will have examples of how to use the main Tika interfaces (TIKA-1390).
  • Upgraded to Commons Compress 1.8.1 (TIKA-1275).
  • Upgraded to POI 3.11-beta1 (TIKA-1380).
  • Tika now extracts SDTCell content from tables in .docx files (TIKA-1317).
  • Tika now supports detection of the Persian/Farsi language. (TIKA-1337).
  • The Tika Detector interface is now exposed through the JAX-RS server (TIKA-1335, TIKA-1336).
  • Tika now has support for parsing binary Matlab files as part of our larger effort to increase the number of scientific data formats supported. (TIKA-1327).
  • The Tika Server URLs for the unpacker resources have been changed, to bring them under a common prefix. The mapping is /unpacker/id -> /unpack/id /all/id -> /unpack/all/id (TIKA-1324).
  • Added module and core Tika interface for translating text between languages and added a default implementation that call's Microsoft's translate service (TIKA-1319).
  • Added an Translator implementation that calls Lingo24's Premium Machine Translation API (TIKA-1381).
  • Made RTFParser's list handling slightly more robust against corrupt list metadata (TIKA-1305).
  • Fixed bug in CLI json output (TIKA-1291/ TIKA-1310).
  • Added ability to turn off image extraction from PDFs. Users must now turn on this capability via the PDFParserConfig. (TIKA-1294).
  • Upgrade to PDFBox 1.8.6 (TIKA-1290, TIKA-1231, TIKA-1233, TIKA-1352).
  • Zip Container Detection for DWFX and XPS formats, which are OPC based (TIKA-1204, TIKA-1221).
  • Added a user facing welcome page to the Tika Server, which says what it is, and a very brief summary of what is available. (TIKA-1269).
  • Added Tika Server endpoints to list the available mime types, Parsers and Detectors, similar to the --list-foo methods on the Tika CLI App (TIKA-1270).
  • Improvements to NetCDF and HDF parsing to mimic the output of ncdump and extract text dimensions and spatial and variable information from scientific data files (TIKA-1265).
  • Extract attachments from RTF files (TIKA-1010).
  • Support Outlook Personal Folders File Format *.pst (TIKA-623).
  • Added mime entries for additional Ogg based formats (TIKA-1259).
  • Updated the Ogg Vorbis plugin to v0.4, which adds detection for a wider range of Ogg formats, and parsers for more Ogg Audio ones (TIKA-1113).
  • PDF: Images in PDF documents can now be extracted as embedded resources. (TIKA-1268).
  • Fixed RuntimeException thrown for certain Word Documents (TIKA-1251).
  • CLI: TikaCLI now has another option: --list-parser-details-apt, which outputs the list of supported parsers in APT format. This is used to generate the list on the formats page (TIKA-411).

    The following people have contributed to Tika 1.6 by submitting or commenting on the issues resolved in this release:

    • Alexander Chow
    • Amit Gupta
    • Andreas
    • Andreas Hubold
    • Andrzej Bialecki
    • Ann Burgess
    • Avi
    • Boris Naguet
    • Chris A. Mattmann
    • Chris Bamford
    • Christian Reuschling
    • Cservenak, Tamas
    • Damiano
    • Dave Meikle
    • Erik Hetzner
    • Fabian Lange
    • Hassan Akram
    • Hong-Thai Nguyen
    • Jonathan Evans
    • Jukka Zitting
    • Kaijian Xu
    • Ken Krugler
    • Konstantin Gribov
    • Lewis John McGibbney
    • Luis Filipe Nassif
    • Marco Quaranta
    • Martin Kalcher
    • Matthias Krueger
    • Matthieu Neamar
    • Nick Burch
    • Nicolas Gavalda
    • Omid Pourhadi
    • Pradeep Singh
    • Ray Gauss II
    • Sasa Milenkovic
    • Sebastian Nagel
    • Sergey Beryozkin
    • Steffen
    • Steve R
    • Tim Allison
    • Tran Nam Quang
    • Tyler Palsulich
    • Vladimir Glina

    See http://s.apache.org/ojn for more details on these contributions.