Apache Tika

Security Model

Parsing is dangerous. Bad things can happen when parsing untrusted data. Apache Tika is primarily designed to work with trusted/sanitized data. Users are responsible for handling crashes and other consequences from parsing untrusted data. See the Robustness of Apache Tika for guidance on how to run Tika more safely.

Mime detection and content extraction are both inherently challenging tasks and prone to errors. We advise against trusting without verification either mime detection or content extraction in high risk applications such as, for example, cross-domain filtering or search.

Tika is not designed to identify or render safe files that are crafted to trigger direct vulnerabilities or to create parser differentials (such as with polyglots, chimeras, schizophrenic files or ...).

Files can be crafted to evade detection, hinder analysis or otherwise cause mayhem in countless ways.

Running tika-server adds its own security risks. Depending on the settings and what modules are loaded (tika-pipes, for example), it is possible to grant read and write access at the same level as the user running the application. We strongly encourage defense in depth with tika-server, including isolating access to its endpoints, setting up two-way TLS, and limiting its user permissions.

Users need to take precautions when parsing untrusted data. We welcome suggestions and pull requests for hardening the code base.

Security Model

Documentation

The Apache Software Foundation

Books about Tika