Apache Tika 1.7

The most notable changes in Tika 1.7 over the previous release are:

  • Fixed resource leak in OutlookPSTParser that caused TikaException when invoked via AutoDetectParser on Windows (TIKA-1506).
  • HTML tags are properly stripped from content by FeedParser (TIKA-1500).
  • Tika Server support for selecting a single metadata key; wrapped MetadataEP into MetadataResource (TIKA-1499).
  • Tika Server support for JSON and XMP views of metadata (TIKA-1497).
  • Tika Parent uses dependency management to keep duplicate dependencies in different modules the same version (TIKA-1384).
  • Upgraded slf4j to version 1.7.7 (TIKA-1496).
  • Tika Server support for RecursiveParserWrapper's JSON output (endpoint=rmeta) equivalent to (TIKA-1451's) -J option in tika-app (TIKA-1498).
  • Tika Server support for providing the password for files on a per-request basis through the Password http header (TIKA-1494).
  • Simple support for the BPG (Better Portable Graphics) image format (TIKA-1491, TIKA-1495).
  • Prevent exceptions from being thrown for some malformed mp3 files (TIKA-1218).
  • Reformat pom.xml files to use two spaces per indent (TIKA-1475).
  • Fix warning of slf4j logger on Tika Server startup (TIKA-1472).
  • Tika CLI and GUI now have option to view JSON rendering of output of RecursiveParserWrapper (TIKA-1451).
  • Tika now integrates the Geospatial Data Abstraction Library (GDAL) for parsing hundreds of geospatial formats (TIKA-605, TIKA-1503).
  • ExternalParsers can now use Regexs to specify dynamic keys (TIKA-1441).
  • Thread safety issues in ImageMetadataExtractor were resolved (TIKA-1369).
  • The ForkParser service is now registered in Activator (TIKA-1354).
  • The Rome Library was upgraded to version 1.5 (TIKA-1435).
  • Add markup for files embedded in PDFs (TIKA-1427).
  • Extract files embedded in annotations in PDFS (TIKA-1433).
  • Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).
  • Add RecursiveParserWrapper (aka Jukka's and Nick's) RecursiveMetadataParser (TIKA-1329).
  • Add example for how to dump TikaConfig to XML (TIKA-1418).
  • Allow users to specify a tika config file for tika-app (TIKA-1426).
  • PackageParser includes the last-modified date from the archive in the metadata, when handling embedded entries (TIKA-1246).
  • Created a new Tesseract OCR Parser to extract text from images. Requires installation of Tesseract before use (TIKA-93).
  • Basic parser for older Excel formats, such as Excel 4, 5 and 95, which can get simple text, and metadata for Excel 5+95 (TIKA-1490).

The following people have contributed to Tika 1.7 by submitting or commenting on the issues resolved in this release:

  • Aimee Dev
  • Alexander Chow
  • Amit Gupta
  • Andreas
  • Andreas Hubold
  • Andrzej Bialecki
  • Ann Burgess
  • Avi
  • Boris Naguet
  • Chetan Laddha
  • Chris A. Mattmann
  • Chris Bamford
  • Christian Reuschling
  • Cservenak, Tamas
  • Damiano
  • Dave Meikle
  • Erik Hetzner
  • Fabian Lange
  • Hassan Akram
  • Hong-Thai Nguyen
  • James Baker
  • Jonathan Evans
  • Jukka Zitting
  • Kaijian Xu
  • Ken Krugler
  • Konstantin Gribov
  • Lewis John McGibbney
  • Luis Filipe Nassif
  • Marco Quaranta
  • Martin Kalcher
  • Matthias Krueger
  • Matthieu Neamar
  • Nick Burch
  • Nicolas Gavalda
  • Omid Pourhadi
  • Pradeep Singh
  • Ray Gauss II
  • Sasa Milenkovic
  • Sebastian Nagel
  • Sergey Beryozkin
  • Steffen
  • Steve R
  • Tadeu Alves
  • Tim Allison
  • Tran Nam Quang
  • Tyler Palsulich
  • Vladimir Glina
  • William Palmer

See http://s.apache.org/a8m for more details on these contributions.