Apache Tika

Apache Tika 1.15

The most notable changes in Tika 1.15 over the previous release are:

Tika now has a module for Deep Learning powered by theDL4J toolkit. The initial included model is for InceptionV3and so using this module, natively in Java, Tika can useDeep learning for metadata/text extraction from Images usingthe power of the Inception model (Github-165).
A new parser for sentiment analysis using a categorical(multi-class, anry, sad, neutral, like, love) and binary(positive/negative) was added leveraging the USC datascience work (TIKA-2016).
Tika now has the ability to automatically detect objects in videos,using OpenCV and Tensorflow (TIKA-2322).
Change default behavior to parse embedded documents even if the userforgets to specify a Parser.class in the ParseContext (TIKA-2096).Users who wish to parse only the container document should setan EmptyParser as the Parser.class in the ParseContext.
Change default behavior of Office Parsers to _not_ extractMacros. User needs to setExtractMacros to "true" (TIKA-2302).
Added tika-eval module (TIKA-1332).
Unified logging across Tika: SLF4J as logging API, Apache Log4j asimplementation with JCL and JUL bridges in standalone tools liketika-app, tika-batch and tika-server (TIKA-2245).
Add parser for XLSB files (TIKA-1195).
Add parsers for EMF/WMF files (TIKA-2246/TIKA-2247).
Add parsers for WordPerfect and QuattroPro (.qpw) files.Contributed by Pascal Essiembre (TIKA-1946 and TIKA-2228).
Add experimental SAX parser for .pptx files. To select this parser,set useSAXPptxExtractor(true) on OfficeParserConfig (TIKA-2210).
Add experimental SAX parser for .docx files. To select this parser,set useSAXDocxExtractor(true) on OfficeParserConfig (TIKA-1321, TIKA-2191).
Add mime detection and parser for Word 2006ML format (TIKA-2179).
Bug fix for WordPerfect via Pascal Essiembre (TIKA-2352).
Added "text-main" equivalent option to tika-server via/tika/main (TIKA-2343).
Enabled configuration of the EncodingDetector used byparsers that extend AbstractEncodingDetectorParser (TIKA-2273).
Prevent easily preventable OOMs for both detection and parsingof some compression formats (TIKA-2330).
Extract images and thumbnails from ODT via Sam Bayer (TIKA-2295).
Fix potential NPE in FeedParser via Julien Nioche (TIKA-2269).
Official mime types for BMP, EMF and WMF have been registered withIANA, so switch to these (image/bmp image/emf image/wmf) (TIKA-2250)
Be more parsimonious with BufferedInputStreams via Josh Hight(TIKA-2244).
Enable handling of hyphenated language codes in TesseractOCRParservia Graham Russell (TIKA-2231).
Improve style tags in ODT (TIKA-2242).
Add container detection for embedded MSEquation files (TIKA-2238).
Add parsing of JBIG2 and extraction of JBIG2 from PDFs whenrequired dependencies are added to class path by user.Contributed by Pascal Essiembre (TIKA-2232).
Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser(TIKA-2224).
Add configurability of "preserve-interword-spacing" toTesseractOCRParser (TIKA-2190).
Upgrade PDFBox to 2.0.6 and JempBox 1.8.13 (TIKA-2361.
Refactor MockParser to consolidate service loadingand mime types into tika-core/src/test (TIKA-2195).
Enabled extraction of embedded objects from headers, footers,footnotes, endnotes and comments in legacy .docx parser (TIKA-2192).
Allow extraction of PDActions (including Javascript) fromPDFs (TIKA-2090). This is turned off by default. Usersmust setExtractActions(true) on the PDFParserConfig.
Change default behavior in experimental .docx parser to ignoredeleted text to align with .doc (TIKA-2187).
Upgrade to Apache POI 3.16 (TIKA-2116, TIKA-2181, TIKA-2329).
Allow configuration of timeout for ForkParser (TIKA-2170).
Add extraction of .jpx inline images from PDFs when required dependencies are added by user to class path (TIKA-2175).
Add .jpx, .jp2, .ppm to formats handled by Tesseract (TIKA-2174).
Upgrade "provided" Sqlite to 3.16.1 (TIKA-2334).
Upgrade CXF version to 3.0.12 (TIKA-2292).
Add Lingo24 Language Detector (TIKA-2297).
Further mime magic for WebVTT (TIKA-1772)
Extend support for increased PSM options up to 13 for modernversions of Tesseract (TIKA-2357).

The following people have contributed to Tika 1.15 by submitting or commenting on the issues resolved in this release:

Adam Carroll
Aeham Abushwashi
Anastasija Mensikova
Bipul Kumar
Chris A. Mattmann
Dave Meikle
David Pilato
Fabio
Frederic Ronny
Jan Van Raemdonck
Jasper Hafkenscheid
Jorge Spinsanti
Joshua Hight
Julian
Julien Nioche
Ken Krugler
Kevin Oberlag
Konstantin Gribov
Laszlo Marai
Lewis John McGibbney
Luis Filipe Nassif
Madhav Sharan
Matthew Caruana Galizia
Michal Hlavac
Mike Liu
Nick Burch
Nick C
Nino Skopac
Panagiotis Mpailis
Pascal Essiembre
Peter Weiss
Robin Schimpf
Sean Story
senthil
Sergey Beryozkin
Seva Alekseyev
Thamme Gowda
Thomas Galla
Tim Allison
Tim Kingsbury

See https://s.apache.org/XowY for more details on these contributions.

Apache Tika 1.15

Documentation

The Apache Software Foundation

Books about Tika