Apache Tika 2.8.0

The most notable changes in Tika 2.8.0 over the previous release are:

  • Enable counting and/or parsing of incremental updates in PDFs. This is an experimental feature and may change in later releases (TIKA-4017).
  • Fixed bug that prevented the the loading of CompositeExternalParser in tika-app and tika-server-standard. This parser will call exiftool and ffmpeg if those are installed, as was the behavior in Tika 1.x. Exclude org.apache.tika.parser.external.CompositeExternalParserif you do not want this behavior (TIKA-4022).
  • Geotopic parser moved back to o.a.t.parser.geo (TIKA-4009).
    • Removed the shading of tika-parsers-standard-module (TIKA-4038).
    • Enable optional extraction of file system metadata in FileSystemFetcher (TIKA-4035).
    • Allow pretty printing in FileSystemEmitter (TIKA-4034).
    • Add detection for and a new mime type for older postscript-based Adobe Illustrator "application/illustrator+ps" files (TIKA-3971).
    • Add magic detection for canon raw file types: crw, cr2 and cr3 (TIKA-3991).
    • Add detection for ONIX message files (TIKA-4011).
    • Add detection and a parser for ActiveMime files (TIKA-3987).
    • Add extraction of rendition layout value and version from Epub (TIKA-4013).
    • Improve embedded file extraction from PDFs (TIKA-4012).
    • Improve metadata extraction from WARCs (TIKA-4018).
    • Update to PDFBox 2.0.28 (TIKA-4016).
    • Users may now avoid the ZeroByteFileException via asetting on the AutoDetectParserConfig (TIKA-3976).
    • Fix bug in closing a elements in the presence of b elementsin RTF files (TIKA-3972).
    • Improve extraction of embedded file names in .docx (TIKA-3968).
  • Normalize author, title, subject and description to their Dublin Core properties in the HTMLParser (TIKA-3963).

The following people have contributed to Tika 2.8.0 by submitting or commenting on the issues resolved in this release:

  • Amit Pandey
  • Chris Mattmann
  • Gregory Lepore
  • Josh Burchard
  • Tayseer Sabha
  • Thomas Ledoux
  • Tilman Hausherr
  • Tim Allison

See https://s.apache.org/sigxx for more details on these contributions.