Security Model

Parsing is dangerous. Bad things can happen when parsing untrusted data. Apache Tika is primarily designed to work with trusted/sanitized data. Users are responsible for handling crashes and other consequences from parsing untrusted data. See the Robustness of Apache Tika for guidance on how to run Tika more safely.

Further, mime detection and content extraction are both inherently challenging and prone to errors. We advise against trusting without verification either mime detection or content extraction in high risk applications such as, for example, cross-domain filtering or search.

Tika is not designed to identify or render safe files that are crafted to create parser differentials (such as with polyglots, chimeras, schizophrenic files or ...).

Files can be crafted to evade detection, hinder analysis or otherwise cause mayhem in countless ways.

Running tika-server adds its own security risks. Depending on the settings and what modules are loaded (tika-pipes, for example), it is possible to grant read and write access at the same level as the user running the application. We strongly encourage defense in depth with tika-server, including isolating access to its endpoints, setting up two-way TLS, and limiting its user permissions.

The project makes every effort to prevent Denial of Service attacks and other software vulnerabilities, and we welcome reports and example proof-of-concept files. Some Denial of Service attacks are not easily fixed, and users need to take precautions when parsing untrusted data.

We welcome suggestions and pull requests for hardening the code base.

See our Security page for fixed vulnerabilities.