Apache Tika 1.15

The most notable changes in Tika 1.15 over the previous release are:

  • Tika now has a module for Deep Learning powered by theDL4J toolkit. The initial included model is for InceptionV3and so using this module, natively in Java, Tika can useDeep learning for metadata/text extraction from Images usingthe power of the Inception model (Github-165).
  • A new parser for sentiment analysis using a categorical(multi-class, anry, sad, neutral, like, love) and binary(positive/negative) was added leveraging the USC datascience work (TIKA-2016).
  • Tika now has the ability to automatically detect objects in videos,using OpenCV and Tensorflow (TIKA-2322).
  • Change default behavior to parse embedded documents even if the userforgets to specify a Parser.class in the ParseContext (TIKA-2096).Users who wish to parse only the container document should setan EmptyParser as the Parser.class in the ParseContext.
  • Change default behavior of Office Parsers to _not_ extractMacros. User needs to setExtractMacros to "true" (TIKA-2302).
  • Added tika-eval module (TIKA-1332).
  • Unified logging across Tika: SLF4J as logging API, Apache Log4j asimplementation with JCL and JUL bridges in standalone tools liketika-app, tika-batch and tika-server (TIKA-2245).
  • Add parser for XLSB files (TIKA-1195).
  • Add parsers for EMF/WMF files (TIKA-2246/TIKA-2247).
  • Add parsers for WordPerfect and QuattroPro (.qpw) files.Contributed by Pascal Essiembre (TIKA-1946 and TIKA-2228).
  • Add experimental SAX parser for .pptx files. To select this parser,set useSAXPptxExtractor(true) on OfficeParserConfig (TIKA-2210).
  • Add experimental SAX parser for .docx files. To select this parser,set useSAXDocxExtractor(true) on OfficeParserConfig (TIKA-1321, TIKA-2191).
  • Add mime detection and parser for Word 2006ML format (TIKA-2179).
  • Bug fix for WordPerfect via Pascal Essiembre (TIKA-2352).
  • Added "text-main" equivalent option to tika-server via/tika/main (TIKA-2343).
  • Enabled configuration of the EncodingDetector used byparsers that extend AbstractEncodingDetectorParser (TIKA-2273).
  • Prevent easily preventable OOMs for both detection and parsingof some compression formats (TIKA-2330).
  • Extract images and thumbnails from ODT via Sam Bayer (TIKA-2295).
  • Fix potential NPE in FeedParser via Julien Nioche (TIKA-2269).
  • Official mime types for BMP, EMF and WMF have been registered withIANA, so switch to these (image/bmp image/emf image/wmf) (TIKA-2250)
  • Be more parsimonious with BufferedInputStreams via Josh Hight(TIKA-2244).
  • Enable handling of hyphenated language codes in TesseractOCRParservia Graham Russell (TIKA-2231).
  • Improve style tags in ODT (TIKA-2242).
  • Add container detection for embedded MSEquation files (TIKA-2238).
  • Add parsing of JBIG2 and extraction of JBIG2 from PDFs whenrequired dependencies are added to class path by user.Contributed by Pascal Essiembre (TIKA-2232).
  • Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser(TIKA-2224).
  • Add configurability of "preserve-interword-spacing" toTesseractOCRParser (TIKA-2190).
  • Upgrade PDFBox to 2.0.6 and JempBox 1.8.13 (TIKA-2361.
  • Refactor MockParser to consolidate service loadingand mime types into tika-core/src/test (TIKA-2195).
  • Enabled extraction of embedded objects from headers, footers,footnotes, endnotes and comments in legacy .docx parser (TIKA-2192).
  • Allow extraction of PDActions (including Javascript) fromPDFs (TIKA-2090). This is turned off by default. Usersmust setExtractActions(true) on the PDFParserConfig.
  • Change default behavior in experimental .docx parser to ignoredeleted text to align with .doc (TIKA-2187).
  • Upgrade to Apache POI 3.16 (TIKA-2116, TIKA-2181, TIKA-2329).
  • Allow configuration of timeout for ForkParser (TIKA-2170).
  • Add extraction of .jpx inline images from PDFs when required dependencies are added by user to class path (TIKA-2175).
  • Add .jpx, .jp2, .ppm to formats handled by Tesseract (TIKA-2174).
  • Upgrade "provided" Sqlite to 3.16.1 (TIKA-2334).
  • Upgrade CXF version to 3.0.12 (TIKA-2292).
  • Add Lingo24 Language Detector (TIKA-2297).
  • Further mime magic for WebVTT (TIKA-1772)
  • Extend support for increased PSM options up to 13 for modernversions of Tesseract (TIKA-2357).

The following people have contributed to Tika 1.15 by submitting or commenting on the issues resolved in this release:

  • Adam Carroll
  • Aeham Abushwashi
  • Anastasija Mensikova
  • Bipul Kumar
  • Chris A. Mattmann
  • Dave Meikle
  • David Pilato
  • Fabio
  • Frederic Ronny
  • Jan Van Raemdonck
  • Jasper Hafkenscheid
  • Jorge Spinsanti
  • Joshua Hight
  • Julian
  • Julien Nioche
  • Ken Krugler
  • Kevin Oberlag
  • Konstantin Gribov
  • Laszlo Marai
  • Lewis John McGibbney
  • Luis Filipe Nassif
  • Madhav Sharan
  • Matthew Caruana Galizia
  • Michal Hlavac
  • Mike Liu
  • Nick Burch
  • Nick C
  • Nino Skopac
  • Panagiotis Mpailis
  • Pascal Essiembre
  • Peter Weiss
  • Robin Schimpf
  • Sean Story
  • senthil
  • Sergey Beryozkin
  • Seva Alekseyev
  • Thamme Gowda
  • Thomas Galla
  • Tim Allison
  • Tim Kingsbury

See https://s.apache.org/XowY for more details on these contributions.