Apache Tika 1.12

The most notable changes in Tika 1.12 over the previous release are:

  • Support for iFrames and element link extraction is provided inthe link Content Handler (TIKA-1835).
  • Slide notes are now linked to the slide XHTML in the PPT output(TIKA-1840).
  • JSON tests in Tika server were updated to remove impossible casts(Github-73).
  • Fix bug in GeoTopicParser where NER is reused instead of instantiatedwith each request (TIKA-1834).
  • 5.1 && Downgrade Rome dependency to 0.9 to avoidnasty NPE (TIKA-1820, TIKA-1516)
  • The NamedEntityParser was enhanced to generate text contentin addition to metadata (TIKA-1815, TIKA-1816).
  • A significant speed-up is made to the GeoTopicParser byusing the new REST server capabilities from Lucene GeoGazetteer (TIKA-1803).
  • A parser to compute motion properties in Videos, e.g.,Histogram of Oriented Gradients and Histogram of Optical Flowsusing the Pooled Time Series algorithm, was added (TIKA-1798).
  • Provide NamedEntityParser which exposes Named Entity Recognitionfrom OpenNLP and Stanford NER providers (TIKA-1787, Github-61,Github-62).
  • Allow XHTMLContentHandler to pass attributes of html elementvia Markus Jelsma (TIKA-1782).
  • Fix regression with spacing in PPT via Andreas Beeker (TIKA-1777).
  • Tika Facade parse methods for Path and File added which take aMetadata object, to mirror the existing InputStream one (Github-60)

    The following people have contributed to Tika 1.12 by submitting or commenting on the issues resolved in this release:

    • Bob Paulin
    • Chris A. Mattmann
    • Ken Krugler
    • Lewis John McGibbney
    • Madhav Sharan
    • Markus Jelsma
    • Nick Burch
    • Roberto Benedetti
    • Thamme Gowda N
    • Tim Allison
    • Vjeran Marcinko
    • Yueheng He

    See https://s.apache.org/wDlx for more details on these contributions.