Apache Tika 1.11

The most notable changes in Tika 1.11 over the previous release are:

  • Java7 API support for allowing java.nio.file.Path as method arguments was added to Tika and to ParsingReader, TikaFileTypeDetector, and to Tika Config (TIKA-1745, TIKA-1746, TIKA-1751).
  • MIME support was added for WebVTT: The Web Video Text Tracks Format files (TIKA-1772).
  • MIME magic improved to ensure emails detected as message/rfc822 (TIKA-1771).
  • Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility with Bouncy Castle (TIKA-1736).
  • Make div and other markup more consistent between PPT and PPTX (TIKA-1755).
  • Parse multiple authors from MSOffice's semi-colon delimited author field (TIKA-1765).
  • Include CTAKESConfig.properties within tika-parsers resources by default (TIKA-1741).
  • Prevent infinite recursion when processing inline images in PDF files by limiting extraction of duplicate images within the same page (TIKA-1742).
  • Upgrade to POI 3.13-final (via Andreas Beeker) (TIKA-1707).
  • Upgraded tika-batch to use Path throughout (TIKA-1747 and (TIKA-1754).
  • Upgraded to Path in TikaInputStream (via Yaniv Kunda) (TIKA-1744).
  • Changed default content handler type for "/rmeta" in tika-server to "xml" to align with "-J" option in tika-app. Clients can now specify handler types via PathParam. (TIKA-1716).
  • The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data for machine learning from PDF files is now integrated as a Tika parser (TIKA-1699, TIKA-1712).
  • The ability to specify the Tesseract Config Path was added to the OCR Parser (TIKA-1703).
  • Upgraded to ASM 5.0.4 (TIKA-1705).
  • Corrected Tika Config XML detector definition explicit loading of MimeTypes (TIKA-1708)
  • In Tika Parsers, Batch, Server, App and Examples, use Apache Commons IO instead of inlined ex-Commons classes, and the Java 7 Standard Charset definitions (TIKA-1710)
  • Upgraded to Commons Compress 1.10, which enables zlib compressed archives support (TIKA-1718)

The following people have contributed to Tika 1.11 by submitting or commenting on the issues resolved in this release:

  • Alexander Widera
  • Bob Paulin
  • Chris A. Mattmann
  • Christian Wolfe
  • Jeremy B. Merrill
  • Jukka Zitting
  • Justin Palmer
  • Konstantin Gribov
  • Lewis John McGibbney
  • Nick Burch
  • Sujen Shah
  • Tim Allison
  • Yaniv Kunda

See http://s.apache.org/fSj for more details on these contributions.