Apache Tika 1.6
The most notable changes in Tika 1.6 over the previous release are:
- Parse output should indicate which Parser was actually used (TIKA-674).
- Use the forbidden-apis Maven plugin to check for unsafe Java operations (TIKA-1387).
- Created an ExternalTranslator class to interface with command line Translators (TIKA-1385).
- Created a MosesTranslator as a subclass of ExternalTranslator that calls the Moses Decoder machine translation program (TIKA-1385).
- Created the tika-example module. It will have examples of how to use the main Tika interfaces (TIKA-1390).
- Upgraded to Commons Compress 1.8.1 (TIKA-1275).
- Upgraded to POI 3.11-beta1 (TIKA-1380).
- Tika now extracts SDTCell content from tables in .docx files (TIKA-1317).
- Tika now supports detection of the Persian/Farsi language. (TIKA-1337).
- The Tika Detector interface is now exposed through the JAX-RS server (TIKA-1335, TIKA-1336).
- Tika now has support for parsing binary Matlab files as part of our larger effort to increase the number of scientific data formats supported. (TIKA-1327).
- The Tika Server URLs for the unpacker resources have been changed, to bring them under a common prefix. The mapping is /unpacker/id -> /unpack/id /all/id -> /unpack/all/id (TIKA-1324).
- Added module and core Tika interface for translating text between languages and added a default implementation that call's Microsoft's translate service (TIKA-1319).
- Added an Translator implementation that calls Lingo24's Premium Machine Translation API (TIKA-1381).
- Made RTFParser's list handling slightly more robust against corrupt list metadata (TIKA-1305).
- Fixed bug in CLI json output (TIKA-1291/ TIKA-1310).
- Added ability to turn off image extraction from PDFs. Users must now turn on this capability via the PDFParserConfig. (TIKA-1294).
- Upgrade to PDFBox 1.8.6 (TIKA-1290, TIKA-1231, TIKA-1233, TIKA-1352).
- Zip Container Detection for DWFX and XPS formats, which are OPC based (TIKA-1204, TIKA-1221).
- Added a user facing welcome page to the Tika Server, which says what it is, and a very brief summary of what is available. (TIKA-1269).
- Added Tika Server endpoints to list the available mime types, Parsers and Detectors, similar to the --list-foo methods on the Tika CLI App (TIKA-1270).
- Improvements to NetCDF and HDF parsing to mimic the output of ncdump and extract text dimensions and spatial and variable information from scientific data files (TIKA-1265).
- Extract attachments from RTF files (TIKA-1010).
- Support Outlook Personal Folders File Format *.pst (TIKA-623).
- Added mime entries for additional Ogg based formats (TIKA-1259).
- Updated the Ogg Vorbis plugin to v0.4, which adds detection for a wider range of Ogg formats, and parsers for more Ogg Audio ones (TIKA-1113).
- PDF: Images in PDF documents can now be extracted as embedded resources. (TIKA-1268).
- Fixed RuntimeException thrown for certain Word Documents (TIKA-1251).
- CLI: TikaCLI now has another option: --list-parser-details-apt, which outputs the list of supported parsers in APT format. This is used to generate the list on the formats page (TIKA-411).
The following people have contributed to Tika 1.6 by submitting or commenting on the issues resolved in this release:
- Alexander Chow
- Amit Gupta
- Andreas
- Andreas Hubold
- Andrzej Bialecki
- Ann Burgess
- Avi
- Boris Naguet
- Chris A. Mattmann
- Chris Bamford
- Christian Reuschling
- Cservenak, Tamas
- Damiano
- Dave Meikle
- Erik Hetzner
- Fabian Lange
- Hassan Akram
- Hong-Thai Nguyen
- Jonathan Evans
- Jukka Zitting
- Kaijian Xu
- Ken Krugler
- Konstantin Gribov
- Lewis John McGibbney
- Luis Filipe Nassif
- Marco Quaranta
- Martin Kalcher
- Matthias Krueger
- Matthieu Neamar
- Nick Burch
- Nicolas Gavalda
- Omid Pourhadi
- Pradeep Singh
- Ray Gauss II
- Sasa Milenkovic
- Sebastian Nagel
- Sergey Beryozkin
- Steffen
- Steve R
- Tim Allison
- Tran Nam Quang
- Tyler Palsulich
- Vladimir Glina
See http://s.apache.org/ojn for more details on these contributions.