Apache Tika 1.15
The most notable changes in Tika 1.15 over the previous release are:
- Tika now has a module for Deep Learning powered by theDL4J toolkit. The initial included model is for InceptionV3and so using this module, natively in Java, Tika can useDeep learning for metadata/text extraction from Images usingthe power of the Inception model (Github-165).
- A new parser for sentiment analysis using a categorical(multi-class, anry, sad, neutral, like, love) and binary(positive/negative) was added leveraging the USC datascience work (TIKA-2016).
- Tika now has the ability to automatically detect objects in videos,using OpenCV and Tensorflow (TIKA-2322).
- Change default behavior to parse embedded documents even if the userforgets to specify a Parser.class in the ParseContext (TIKA-2096).Users who wish to parse only the container document should setan EmptyParser as the Parser.class in the ParseContext.
- Change default behavior of Office Parsers to _not_ extractMacros. User needs to setExtractMacros to "true" (TIKA-2302).
- Added tika-eval module (TIKA-1332).
- Unified logging across Tika: SLF4J as logging API, Apache Log4j asimplementation with JCL and JUL bridges in standalone tools liketika-app, tika-batch and tika-server (TIKA-2245).
- Add parser for XLSB files (TIKA-1195).
- Add parsers for EMF/WMF files (TIKA-2246/TIKA-2247).
- Add parsers for WordPerfect and QuattroPro (.qpw) files.Contributed by Pascal Essiembre (TIKA-1946 and TIKA-2228).
- Add experimental SAX parser for .pptx files. To select this parser,set useSAXPptxExtractor(true) on OfficeParserConfig (TIKA-2210).
- Add experimental SAX parser for .docx files. To select this parser,set useSAXDocxExtractor(true) on OfficeParserConfig (TIKA-1321, TIKA-2191).
- Add mime detection and parser for Word 2006ML format (TIKA-2179).
- Bug fix for WordPerfect via Pascal Essiembre (TIKA-2352).
- Added "text-main" equivalent option to tika-server via/tika/main (TIKA-2343).
- Enabled configuration of the EncodingDetector used byparsers that extend AbstractEncodingDetectorParser (TIKA-2273).
- Prevent easily preventable OOMs for both detection and parsingof some compression formats (TIKA-2330).
- Extract images and thumbnails from ODT via Sam Bayer (TIKA-2295).
- Fix potential NPE in FeedParser via Julien Nioche (TIKA-2269).
- Official mime types for BMP, EMF and WMF have been registered withIANA, so switch to these (image/bmp image/emf image/wmf) (TIKA-2250)
- Be more parsimonious with BufferedInputStreams via Josh Hight(TIKA-2244).
- Enable handling of hyphenated language codes in TesseractOCRParservia Graham Russell (TIKA-2231).
- Improve style tags in ODT (TIKA-2242).
- Add container detection for embedded MSEquation files (TIKA-2238).
- Add parsing of JBIG2 and extraction of JBIG2 from PDFs whenrequired dependencies are added to class path by user.Contributed by Pascal Essiembre (TIKA-2232).
- Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser(TIKA-2224).
- Add configurability of "preserve-interword-spacing" toTesseractOCRParser (TIKA-2190).
- Upgrade PDFBox to 2.0.6 and JempBox 1.8.13 (TIKA-2361.
- Refactor MockParser to consolidate service loadingand mime types into tika-core/src/test (TIKA-2195).
- Enabled extraction of embedded objects from headers, footers,footnotes, endnotes and comments in legacy .docx parser (TIKA-2192).
- Allow extraction of PDActions (including Javascript) fromPDFs (TIKA-2090). This is turned off by default. Usersmust setExtractActions(true) on the PDFParserConfig.
- Change default behavior in experimental .docx parser to ignoredeleted text to align with .doc (TIKA-2187).
- Upgrade to Apache POI 3.16 (TIKA-2116, TIKA-2181, TIKA-2329).
- Allow configuration of timeout for ForkParser (TIKA-2170).
- Add extraction of .jpx inline images from PDFs when required dependencies are added by user to class path (TIKA-2175).
- Add .jpx, .jp2, .ppm to formats handled by Tesseract (TIKA-2174).
- Upgrade "provided" Sqlite to 3.16.1 (TIKA-2334).
- Upgrade CXF version to 3.0.12 (TIKA-2292).
- Add Lingo24 Language Detector (TIKA-2297).
- Further mime magic for WebVTT (TIKA-1772)
- Extend support for increased PSM options up to 13 for modernversions of Tesseract (TIKA-2357).
The following people have contributed to Tika 1.15 by submitting or commenting on the issues resolved in this release:
- Adam Carroll
- Aeham Abushwashi
- Anastasija Mensikova
- Bipul Kumar
- Chris A. Mattmann
- Dave Meikle
- David Pilato
- Fabio
- Frederic Ronny
- Jan Van Raemdonck
- Jasper Hafkenscheid
- Jorge Spinsanti
- Joshua Hight
- Julian
- Julien Nioche
- Ken Krugler
- Kevin Oberlag
- Konstantin Gribov
- Laszlo Marai
- Lewis John McGibbney
- Luis Filipe Nassif
- Madhav Sharan
- Matthew Caruana Galizia
- Michal Hlavac
- Mike Liu
- Nick Burch
- Nick C
- Nino Skopac
- Panagiotis Mpailis
- Pascal Essiembre
- Peter Weiss
- Robin Schimpf
- Sean Story
- senthil
- Sergey Beryozkin
- Seva Alekseyev
- Thamme Gowda
- Thomas Galla
- Tim Allison
- Tim Kingsbury
See https://s.apache.org/XowY for more details on these contributions.