Apache Tika 4.0.0-beta-1

The most notable changes in Tika 4.0.0-beta-1 over the previous release are:

Breaking Changes

  • The default content handler is now Markdown. tika-app, tika-server (the /tika and /rmeta endpoints), and the async/pipes CLI now emit Markdown content by default instead of XHTML/XML (plain text for the async CLI). Request the previous format explicitly, e.g. tika-app -x/--xml, the server /tika/xml and /rmeta/xml paths (or the X-Tika-Handler header), and the async CLI --handler x (TIKA-4663).

New Features

  • tika-app and tika-server can load extra jars (additional EncodingDetectors, Parsers, etc.) from the directory named by the -Dtika.extras.dir system property, without repackaging the application. Off by default; the directory is a trusted code location whose contents run with full process privileges. The extra jars are also forwarded onto forked pipes/server worker processes, so they are available where parsing actually happens (TIKA-4755).
  • More granular, default-deny capability flags for tika-server and tika-grpc. tika-server's enableUnsecureFeatures is split into allowPipes (gates the /pipes and /async endpoints) and allowPerRequestConfig (gates the /config endpoints and the multipart config part); the /status endpoint is no longer gated and is enabled simply by listing it under endpoints. tika-grpc gains the same allowPerRequestConfig flag plus allowComponentModifications (gates runtime Save/Delete of fetchers and pipes iterators). All flags default to false, so an out-of-the-box tika-grpc server no longer accepts per-request configuration or runtime store mutations (TIKA-4764).
  • Add a maxPages option to PDFParserConfig to cap how many pages are processed (default -1, no limit); processing stops early once the limit is reached, skipping text extraction and font/content-stream work for the remaining pages (via Julien Nioche) (Github-2803).

Other Changes

  • Release artifacts are now channel-specific. Maven Central gets slim per-module jars (plus pom, sources and javadoc); the Apache dist area gets runnable zip distributions (tika-app, tika-server-standard, tika-eval-app) and drop-in pf4j plugin zips; Docker Hub gets ready-to-run images. Fat/shaded artifacts no longer go to Maven Central, and release staging is validated so a missing artifact fails the build (TIKA-4733).
  • Improved charset (encoding) detection and junk/garbage-text detection, including more efficient common-token lookups via bloom filters (TIKA-4731, TIKA-4745, TIKA-4754).
  • Dependency upgrades, including Jetty 11 -> 12.0.36, CXF 4.0 -> 4.1.7, plus routine library updates (TIKA-4327).

The following people have contributed to Tika 4.0.0-beta-1 by submitting or commenting on the issues resolved in this release:

  • Aashish Tudu
  • Adrian Bird
  • Alexander Veit
  • Chengxin Xu
  • Claude Warren
  • David Frizelle
  • Eric Schoen
  • Francesco
  • Ghiles OUAREZKI
  • Grigorii Ioffe
  • Iachimoe
  • james
  • Julien Nioche
  • Justin Deoliveira
  • Klara Mazurak
  • Konrad Windszus
  • Laura Delmaestro
  • Lawrence Moorehead
  • Leszek Sliwko
  • Lewis John McGibbney
  • Manish S N
  • Matt Dutton
  • Nino Skopac
  • Olivier Ceulemans
  • Peter Hoogendijk
  • Pleeplop
  • Ruairidh Williamson
  • Sandeep Kulkarni
  • Sebastian Nagel
  • Shawn Rutledge
  • Stephen H
  • Steven Huypens
  • Subbu
  • Tiancheng Dai
  • Tilman Hausherr
  • Tim Allison
  • Tim Barrett
  • Tom Brisland
  • Valery Yatsynovich
  • V. S.

See https://s.apache.org/wlh7f for more details on these contributions.