Security

This page covers security considerations when using Apache Tika.

Security Model

Apache Tika’s security model describes the trust boundaries and assumptions that govern how Tika processes content. Understanding this model is essential for deploying Tika securely.

Known Vulnerabilities

For information about known security vulnerabilities (CVEs) in Apache Tika and their remediation, please see:

External Command Security

Apache Tika can be configured to use external system commands for certain operations, such as the FileCommandDetector and ExternalParser components.

External command configuration should only be performed by trusted administrators. Never allow untrusted users to configure command paths or arguments.

Security Best Practices

  1. Restrict configuration access: Only allow administrators to modify Tika configuration files that specify external commands.

  2. Use absolute paths: Always configure external commands with absolute paths to prevent PATH manipulation attacks.

  3. Sandbox execution: Consider running Tika in a container or sandbox environment to limit the impact of any command execution vulnerabilities.

  4. Audit command configuration: Regularly review configured external commands and their arguments.

ExternalParser-Specific Risks

  • checkCommandLine runs at type-query time: If configured, the check command executes the first time getSupportedTypes() is called — not at parse time. This means merely querying which parsers are available can trigger process execution.

  • stderr information leakage: External programs often write file paths, system usernames, version strings, and internal errors to stderr. By default, returnStderr is false to prevent this data from leaking into metadata. If you enable returnStderr, be aware that the raw stderr content will be stored in the document’s metadata and may be visible to end users.

  • Buffer limits: The maxStdOut and maxStdErr settings control how much process output is captured in memory. Set these to reasonable values for your deployment to prevent memory exhaustion from misbehaving external programs.

Affected Components

  • FileCommandDetector: Uses the system file command for MIME type detection

  • ExternalParser: Executes configured external programs to extract content

  • ExternalEmbedder: Uses external tools to embed content

Credential Handling

Password Storage in Memory

Tika stores some credentials as Java String objects, which remain in memory until garbage collected. For environments with strict security requirements:

  1. Use environment variables: Configure credentials via environment variables rather than configuration files where possible.

  2. Use secret managers: Integrate with HashiCorp Vault, AWS Secrets Manager, or similar services for production deployments.

  3. Enable encryption: Use the AES encryption option in HttpClientFactory for stored passwords.

  4. Minimize credential scope: Use credentials with minimum necessary privileges and rotate them regularly.