Apache Tika 1.11
The most notable changes in Tika 1.11 over the previous release are:
- Java7 API support for allowing java.nio.file.Path as method arguments was added to Tika and to ParsingReader, TikaFileTypeDetector, and to Tika Config (TIKA-1745, TIKA-1746, TIKA-1751).
- MIME support was added for WebVTT: The Web Video Text Tracks Format files (TIKA-1772).
- MIME magic improved to ensure emails detected as message/rfc822 (TIKA-1771).
- Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility with Bouncy Castle (TIKA-1736).
- Make div and other markup more consistent between PPT and PPTX (TIKA-1755).
- Parse multiple authors from MSOffice's semi-colon delimited author field (TIKA-1765).
- Include CTAKESConfig.properties within tika-parsers resources by default (TIKA-1741).
- Prevent infinite recursion when processing inline images in PDF files by limiting extraction of duplicate images within the same page (TIKA-1742).
- Upgrade to POI 3.13-final (via Andreas Beeker) (TIKA-1707).
- Upgraded tika-batch to use Path throughout (TIKA-1747 and (TIKA-1754).
- Upgraded to Path in TikaInputStream (via Yaniv Kunda) (TIKA-1744).
- Changed default content handler type for "/rmeta" in tika-server to "xml" to align with "-J" option in tika-app. Clients can now specify handler types via PathParam. (TIKA-1716).
- The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data for machine learning from PDF files is now integrated as a Tika parser (TIKA-1699, TIKA-1712).
- The ability to specify the Tesseract Config Path was added to the OCR Parser (TIKA-1703).
- Upgraded to ASM 5.0.4 (TIKA-1705).
- Corrected Tika Config XML detector definition explicit loading of MimeTypes (TIKA-1708)
- In Tika Parsers, Batch, Server, App and Examples, use Apache Commons IO instead of inlined ex-Commons classes, and the Java 7 Standard Charset definitions (TIKA-1710)
- Upgraded to Commons Compress 1.10, which enables zlib compressed archives support (TIKA-1718)
The following people have contributed to Tika 1.11 by submitting or commenting on the issues resolved in this release:
- Alexander Widera
- Bob Paulin
- Chris A. Mattmann
- Christian Wolfe
- Jeremy B. Merrill
- Jukka Zitting
- Justin Palmer
- Konstantin Gribov
- Lewis John McGibbney
- Nick Burch
- Sujen Shah
- Tim Allison
- Yaniv Kunda
See http://s.apache.org/fSj for more details on these contributions.