Apache Tika 3.0.0-BETA

The most notable changes in Tika 3.0.0 over the previous release are:

  • BREAKING CHANGES
  • Require Java 11 (TIKA-4128).
  • The boilerpipe handler has been moved to the tika-handler-boiler-pipe package (TIKA-4138).
  • We've migrated HTML parsing to the JSoup parser instead of TagSoup. If you have a custom configuration on the HTMLParser, you'll need to change that to o.a.t.p.html.JSoupParser (TIKA-1599).
  • Removed xerces2 as a dependency (TIKA-4135).
  • Tika will look for "custom-mimetypes.xml" directly on the classpath, NOT under "/org/apache/tika/mime/". (TIKA-4147). Other Changes/Updates
  • Upgrade to PDFBox 3.0.1 (TIKA-3347)
  • Deprecated AbstractParser for removal in 4.x (TIKA-4132).
  • Fix bug in DateUtils that stripped timezone information fromincoming Calendar objects (TIKA-4126).

The following people have contributed to Tika 2.9.1 by submitting or commenting on the issues resolved in this release:

  • Cassandra Xia
  • Desmond David
  • Florent Valdelievre
  • Kenneth William Krugler
  • Maxim Solodovnik
  • NW Brad
  • RaahulUmapathy
  • Sandeep Kulkarni
  • Thorsten Heit
  • Tilman Hausherr
  • Tim Allison

See https://s.apache.org/15jlf for more details on these contributions.