Configuring Digesters
Tika can compute cryptographic digests (hashes) of documents during parsing. This is useful for document deduplication, integrity verification, and forensic analysis.
Overview
Digesters compute hash values of document content and store them in metadata. The digest value
is stored with a key like X-TIKA:digest:SHA256 (for HEX encoding) or X-TIKA:digest:SHA256:BASE32
(for non-default encodings).
Tika provides two digester implementations:
-
CommonsDigesterFactory - Uses Apache Commons Codec. Supports MD2, MD5, SHA1, SHA256, SHA384, SHA512.
-
BouncyCastleDigesterFactory - Uses BouncyCastle provider. Supports all Commons algorithms plus SHA3-256, SHA3-384, SHA3-512.
JSON Configuration
Configure digesters in the parse-context section of your tika-config.json.
Basic Example with CommonsDigester
This example configures multiple digest algorithms:
{
"parse-context": {
"commons-digester-factory": {
"digests": [
{ "algorithm": "MD5" },
{ "algorithm": "SHA256" },
{ "algorithm": "SHA512" }
]
}
}
}
Using BouncyCastle for SHA3 Algorithms
For SHA3 algorithms, use the BouncyCastle digester:
{
"parse-context": {
"bouncy-castle-digester-factory": {
"digests": [
{ "algorithm": "MD5" },
{ "algorithm": "SHA256" },
{ "algorithm": "SHA3_512" }
]
}
}
}
Custom Encoding
By default, digest values are encoded as lowercase hexadecimal. You can specify BASE32 or BASE64 encoding:
{
"parse-context": {
"commons-digester-factory": {
"digests": [
{ "algorithm": "SHA256", "encoding": "BASE32" },
{ "algorithm": "MD5" }
]
}
}
}
Non-default encodings include the encoding in the metadata key: X-TIKA:digest:SHA256:BASE32.
Skip Container Document Digest
When processing documents with embedded content (e.g., a ZIP file with PDFs inside), you may
want to digest only the embedded documents, not the container. Set skipContainerDocumentDigest
to true:
{
"parse-context": {
"commons-digester-factory": {
"digests": [
{ "algorithm": "MD5" }
],
"skipContainerDocumentDigest": true
}
}
}
Supported Algorithms
| Algorithm | CommonsDigester | BouncyCastleDigester |
|---|---|---|
MD2 |
Yes |
Yes |
MD5 |
Yes |
Yes |
SHA1 |
Yes |
Yes |
SHA256 |
Yes |
Yes |
SHA384 |
Yes |
Yes |
SHA512 |
Yes |
Yes |
SHA3_256 |
No |
Yes |
SHA3_384 |
No |
Yes |
SHA3_512 |
No |
Yes |
Supported Encodings
-
HEX (default) - Lowercase hexadecimal
-
BASE32 - RFC 4648 Base32
-
BASE64 - RFC 4648 Base64
Programmatic Configuration
You can also configure digesters programmatically via ParseContext:
// See: CommonsDigesterFactory.java
CommonsDigesterFactory factory = new CommonsDigesterFactory();
factory.setDigests(Arrays.asList(
new DigestDef(DigestDef.Algorithm.SHA256),
new DigestDef(DigestDef.Algorithm.MD5, DigestDef.Encoding.BASE32)
));
factory.setSkipContainerDocumentDigest(true);
ParseContext context = new ParseContext();
context.set(DigesterFactory.class, factory);
// Use with AutoDetectParser
AutoDetectParser parser = new AutoDetectParser();
parser.parse(inputStream, handler, metadata, context);
See CommonsDigesterFactory.java and BouncyCastleDigesterFactory.java for implementation details.
Command Line Usage
When using the Tika CLI (tika-app), you can enable digesting with the --digest flag:
java -jar tika-app.jar --digest=SHA256 document.pdf
This computes a SHA256 digest of the document. The digest value appears in the metadata output.
Related Classes
-
DigesterFactory - Factory interface
-
DigestDef - Algorithm and encoding definition
-
Digester - Digester interface
-
DigestConfigTest - Test examples