Apache Tika 1.7
The most notable changes in Tika 1.7 over the previous release are:
- Fixed resource leak in OutlookPSTParser that caused TikaException when invoked via AutoDetectParser on Windows (TIKA-1506).
- HTML tags are properly stripped from content by FeedParser (TIKA-1500).
- Tika Server support for selecting a single metadata key; wrapped MetadataEP into MetadataResource (TIKA-1499).
- Tika Server support for JSON and XMP views of metadata (TIKA-1497).
- Tika Parent uses dependency management to keep duplicate dependencies in different modules the same version (TIKA-1384).
- Upgraded slf4j to version 1.7.7 (TIKA-1496).
- Tika Server support for RecursiveParserWrapper's JSON output (endpoint=rmeta) equivalent to (TIKA-1451's) -J option in tika-app (TIKA-1498).
- Tika Server support for providing the password for files on a per-request basis through the Password http header (TIKA-1494).
- Simple support for the BPG (Better Portable Graphics) image format (TIKA-1491, TIKA-1495).
- Prevent exceptions from being thrown for some malformed mp3 files (TIKA-1218).
- Reformat pom.xml files to use two spaces per indent (TIKA-1475).
- Fix warning of slf4j logger on Tika Server startup (TIKA-1472).
- Tika CLI and GUI now have option to view JSON rendering of output of RecursiveParserWrapper (TIKA-1451).
- Tika now integrates the Geospatial Data Abstraction Library (GDAL) for parsing hundreds of geospatial formats (TIKA-605, TIKA-1503).
- ExternalParsers can now use Regexs to specify dynamic keys (TIKA-1441).
- Thread safety issues in ImageMetadataExtractor were resolved (TIKA-1369).
- The ForkParser service is now registered in Activator (TIKA-1354).
- The Rome Library was upgraded to version 1.5 (TIKA-1435).
- Add markup for files embedded in PDFs (TIKA-1427).
- Extract files embedded in annotations in PDFS (TIKA-1433).
- Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).
- Add RecursiveParserWrapper (aka Jukka's and Nick's) RecursiveMetadataParser (TIKA-1329).
- Add example for how to dump TikaConfig to XML (TIKA-1418).
- Allow users to specify a tika config file for tika-app (TIKA-1426).
- PackageParser includes the last-modified date from the archive in the metadata, when handling embedded entries (TIKA-1246).
- Created a new Tesseract OCR Parser to extract text from images. Requires installation of Tesseract before use (TIKA-93).
- Basic parser for older Excel formats, such as Excel 4, 5 and 95, which can get simple text, and metadata for Excel 5+95 (TIKA-1490).
The following people have contributed to Tika 1.7 by submitting or commenting on the issues resolved in this release:
- Aimee Dev
- Alexander Chow
- Amit Gupta
- Andreas
- Andreas Hubold
- Andrzej Bialecki
- Ann Burgess
- Avi
- Boris Naguet
- Chetan Laddha
- Chris A. Mattmann
- Chris Bamford
- Christian Reuschling
- Cservenak, Tamas
- Damiano
- Dave Meikle
- Erik Hetzner
- Fabian Lange
- Hassan Akram
- Hong-Thai Nguyen
- James Baker
- Jonathan Evans
- Jukka Zitting
- Kaijian Xu
- Ken Krugler
- Konstantin Gribov
- Lewis John McGibbney
- Luis Filipe Nassif
- Marco Quaranta
- Martin Kalcher
- Matthias Krueger
- Matthieu Neamar
- Nick Burch
- Nicolas Gavalda
- Omid Pourhadi
- Pradeep Singh
- Ray Gauss II
- Sasa Milenkovic
- Sebastian Nagel
- Sergey Beryozkin
- Steffen
- Steve R
- Tadeu Alves
- Tim Allison
- Tran Nam Quang
- Tyler Palsulich
- Vladimir Glina
- William Palmer
See http://s.apache.org/a8m for more details on these contributions.