Security

Table of Contents

Security Model
Known Vulnerabilities
External Command Security
Credential Handling
- Password Storage in Memory

This page covers security considerations when using Apache Tika.

Security Model

Apache Tika’s security model describes the trust boundaries and assumptions that govern how Tika processes content. Understanding this model is essential for deploying Tika securely.

Apache Tika Security Model

Known Vulnerabilities

For information about known security vulnerabilities (CVEs) in Apache Tika and their remediation, please see:

Apache Tika Security Vulnerabilities

External Command Security

Apache Tika can be configured to use external system commands for certain operations, such as the FileCommandDetector and ExternalParser components.

External command configuration should only be performed by trusted administrators. Never allow untrusted users to configure command paths or arguments.

Security Best Practices

Restrict configuration access: Only allow administrators to modify Tika configuration files that specify external commands.
Use absolute paths: Always configure external commands with absolute paths to prevent PATH manipulation attacks.
Sandbox execution: Consider running Tika in a container or sandbox environment to limit the impact of any command execution vulnerabilities.
Audit command configuration: Regularly review configured external commands and their arguments.

ExternalParser-Specific Risks

checkCommandLine runs at type-query time: If configured, the check command executes the first time getSupportedTypes() is called — not at parse time. This means merely querying which parsers are available can trigger process execution.
stderr information leakage: External programs often write file paths, system usernames, version strings, and internal errors to stderr. By default, returnStderr is false to prevent this data from leaking into metadata. If you enable returnStderr, be aware that the raw stderr content will be stored in the document’s metadata and may be visible to end users.
Buffer limits: The maxStdOut and maxStdErr settings control how much process output is captured in memory. Set these to reasonable values for your deployment to prevent memory exhaustion from misbehaving external programs.

Affected Components

FileCommandDetector: Uses the system file command for MIME type detection
ExternalParser: Executes configured external programs to extract content
ExternalEmbedder: Uses external tools to embed content

Credential Handling

Password Storage in Memory

Tika stores some credentials as Java String objects, which remain in memory until garbage collected. For environments with strict security requirements:

Use environment variables: Configure credentials via environment variables rather than configuration files where possible.
Use secret managers: Integrate with HashiCorp Vault, AWS Secrets Manager, or similar services for production deployments.
Enable encryption: Use the AES encryption option in HttpClientFactory for stored passwords.
Minimize credential scope: Use credentials with minimum necessary privileges and rotate them regularly.