Apache Tika 4.0.0-beta-1
The most notable changes in Tika 4.0.0-beta-1 over the previous release are:
Breaking Changes
- The default content handler is now Markdown. tika-app, tika-server (the /tika and /rmeta endpoints), and the async/pipes CLI now emit Markdown content by default instead of XHTML/XML (plain text for the async CLI). Request the previous format explicitly, e.g. tika-app -x/--xml, the server /tika/xml and /rmeta/xml paths (or the X-Tika-Handler header), and the async CLI --handler x (TIKA-4663).
New Features
- tika-app and tika-server can load extra jars (additional EncodingDetectors, Parsers, etc.) from the directory named by the -Dtika.extras.dir system property, without repackaging the application. Off by default; the directory is a trusted code location whose contents run with full process privileges. The extra jars are also forwarded onto forked pipes/server worker processes, so they are available where parsing actually happens (TIKA-4755).
- More granular, default-deny capability flags for tika-server and tika-grpc. tika-server's enableUnsecureFeatures is split into allowPipes (gates the /pipes and /async endpoints) and allowPerRequestConfig (gates the /config endpoints and the multipart config part); the /status endpoint is no longer gated and is enabled simply by listing it under endpoints. tika-grpc gains the same allowPerRequestConfig flag plus allowComponentModifications (gates runtime Save/Delete of fetchers and pipes iterators). All flags default to false, so an out-of-the-box tika-grpc server no longer accepts per-request configuration or runtime store mutations (TIKA-4764).
- Add a maxPages option to PDFParserConfig to cap how many pages are processed (default -1, no limit); processing stops early once the limit is reached, skipping text extraction and font/content-stream work for the remaining pages (via Julien Nioche) (Github-2803).
Other Changes
- Release artifacts are now channel-specific. Maven Central gets slim per-module jars (plus pom, sources and javadoc); the Apache dist area gets runnable zip distributions (tika-app, tika-server-standard, tika-eval-app) and drop-in pf4j plugin zips; Docker Hub gets ready-to-run images. Fat/shaded artifacts no longer go to Maven Central, and release staging is validated so a missing artifact fails the build (TIKA-4733).
- Improved charset (encoding) detection and junk/garbage-text detection, including more efficient common-token lookups via bloom filters (TIKA-4731, TIKA-4745, TIKA-4754).
- Dependency upgrades, including Jetty 11 -> 12.0.36, CXF 4.0 -> 4.1.7, plus routine library updates (TIKA-4327).
The following people have contributed to Tika 4.0.0-beta-1 by submitting or commenting on the issues resolved in this release:
- Aashish Tudu
- Adrian Bird
- Alexander Veit
- Chengxin Xu
- Claude Warren
- David Frizelle
- Eric Schoen
- Francesco
- Ghiles OUAREZKI
- Grigorii Ioffe
- Iachimoe
- james
- Julien Nioche
- Justin Deoliveira
- Klara Mazurak
- Konrad Windszus
- Laura Delmaestro
- Lawrence Moorehead
- Leszek Sliwko
- Lewis John McGibbney
- Manish S N
- Matt Dutton
- Nino Skopac
- Olivier Ceulemans
- Peter Hoogendijk
- Pleeplop
- Ruairidh Williamson
- Sandeep Kulkarni
- Sebastian Nagel
- Shawn Rutledge
- Stephen H
- Steven Huypens
- Subbu
- Tiancheng Dai
- Tilman Hausherr
- Tim Allison
- Tim Barrett
- Tom Brisland
- Valery Yatsynovich
- V. S.
See https://s.apache.org/wlh7f for more details on these contributions.


