Security
This page covers security considerations when using Apache Tika.
Security Model
Apache Tika’s security model describes the trust boundaries and assumptions that govern how Tika processes content. Understanding this model is essential for deploying Tika securely.
Known Vulnerabilities
For information about known security vulnerabilities (CVEs) in Apache Tika and their remediation, please see:
External Command Security
Apache Tika can be configured to use external system commands for certain operations,
such as the FileCommandDetector and ExternalParser components.
| External command configuration should only be performed by trusted administrators. Never allow untrusted users to configure command paths or arguments. |
Security Best Practices
-
Restrict configuration access: Only allow administrators to modify Tika configuration files that specify external commands.
-
Use absolute paths: Always configure external commands with absolute paths to prevent PATH manipulation attacks.
-
Sandbox execution: Consider running Tika in a container or sandbox environment to limit the impact of any command execution vulnerabilities.
-
Audit command configuration: Regularly review configured external commands and their arguments.
ExternalParser-Specific Risks
-
checkCommandLine runs at type-query time: If configured, the check command executes the first time
getSupportedTypes()is called — not at parse time. This means merely querying which parsers are available can trigger process execution. -
stderr information leakage: External programs often write file paths, system usernames, version strings, and internal errors to stderr. By default,
returnStderrisfalseto prevent this data from leaking into metadata. If you enablereturnStderr, be aware that the raw stderr content will be stored in the document’s metadata and may be visible to end users. -
Buffer limits: The
maxStdOutandmaxStdErrsettings control how much process output is captured in memory. Set these to reasonable values for your deployment to prevent memory exhaustion from misbehaving external programs.
Credential Handling
Password Storage in Memory
Tika stores some credentials as Java String objects, which remain in memory until garbage collected. For environments with strict security requirements:
-
Use environment variables: Configure credentials via environment variables rather than configuration files where possible.
-
Use secret managers: Integrate with HashiCorp Vault, AWS Secrets Manager, or similar services for production deployments.
-
Enable encryption: Use the AES encryption option in
HttpClientFactoryfor stored passwords. -
Minimize credential scope: Use credentials with minimum necessary privileges and rotate them regularly.