Security Model
Parsing is dangerous. Bad things can happen when parsing untrusted data. See, for example, our Security page, which documents fixed vulnerabilities in Tika and its dependencies. These vulnerabilities include, among others: denial of service, XML external entity injection/server-side request forgery, command injection and deserialization of untrusted objects.
Apache Tika is primarily designed to work with trusted/sanitized data. Users are responsible for handling crashes and other consequences from parsing untrusted data. See the Robustness of Apache Tika for guidance on how to run Tika more safely.
The project does not view denial of service issues as security issues. Nevertheless, we do appreciate reports and pull requests to harden the codebase against denial of service and all vulnerabilities.
Mime detection and content extraction are both inherently challenging tasks and prone to errors. We advise against trusting without verification either mime detection or content extraction in high risk applications such as, for example, cross-domain filtering or search.
Tika is not designed to identify or render safe files that are crafted to trigger direct vulnerabilities or to create parser differentials (such as with polyglots, chimeras, schizophrenic files or ...).
Files can be crafted to evade detection, hinder analysis or otherwise cause mayhem in countless ways.
Running tika-server adds its own security risks. Depending on the settings and the loaded modules (tika-pipes, for example), it is possible that a client could have read and write access at the same level as the user running the application. We strongly encourage defense in depth with tika-server, including, for example, isolating access to its endpoints, setting up two-way TLS, and limiting tika-server's user permissions among other standard security practices.