JDBC Plugin
The JDBC plugin (tika-pipes-jdbc) provides emitter, iterator, and reporter interfaces for relational databases. The plugin is JDBC-driver-agnostic: any database with a working JDBC driver on the plugin’s classpath should work.
| Interface | Component name | Class |
|---|---|---|
Emitter |
|
|
Iterator |
|
|
Reporter |
|
|
JDBC Drivers
The plugin does not bundle drivers. Drop the JDBC driver JAR for your database into the plugin’s lib/ directory alongside tika-pipes-jdbc.jar so the plugin class loader can find it. Tested drivers include H2, PostgreSQL, MySQL, SQLite, and SQL Server.
JDBC Emitter (jdbc-emitter)
Writes parsed documents into a relational table. The emitter uses a prepared statement built from the insert template; the emit key is always the first bound parameter, followed by one parameter per entry in keys.
{
"emitters": {
"jdbce": {
"jdbc-emitter": {
"connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
"createTable": "create table parsed_docs (path varchar(512) primary key, title varchar(1024), author varchar(512), content_length bigint, modified timestamp)",
"insert": "insert into parsed_docs (path, title, author, content_length, modified) values (?,?,?,?,?)",
"keys": {
"dc:title": "string",
"dc:creator": "string",
"Content-Length": "long",
"dcterms:modified": "timestamp"
},
"maxRetries": 0,
"maxStringLength": 64000,
"attachmentStrategy": "FIRST_ONLY",
"multivaluedFieldStrategy": "CONCATENATE",
"multivaluedFieldDelimiter": ", "
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
JDBC connection URL (validated non-blank). Example: |
|
required |
Prepared-statement |
|
optional |
DDL executed once at startup. Use this to create the destination table if it does not already exist. |
|
optional |
DDL executed once at startup, after |
|
optional |
SQL executed every time a new connection is opened (e.g., pragma statements for SQLite). |
|
|
Number of times to retry a failed insert before giving up. |
|
|
String columns longer than this are truncated. Set to |
|
required |
Ordered map of metadata-field-name → SQL-type. Types: |
|
|
How embedded documents are written. One of: * |
|
|
How multi-valued metadata fields are handled. One of: * |
|
|
Separator used by |
JDBC Iterator (jdbc-pipes-iterator)
Walks rows returned by a SELECT statement, emitting one FetchEmitTuple per row.
{
"pipes-iterator": {
"jdbc-pipes-iterator": {
"connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
"select": "select id, source_path, output_path from docs_to_parse where status = 'PENDING'",
"idColumn": "id",
"fetchKeyColumn": "source_path",
"emitKeyColumn": "output_path",
"fetchSize": 1000,
"queryTimeoutSeconds": 60,
"fetcherId": "fsf",
"emitterId": "jdbce"
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
JDBC connection URL. |
|
required |
SELECT statement to enumerate. |
|
optional |
Column whose value becomes the iterator’s row identifier. |
|
optional |
Column whose value becomes the fetch key on each emitted tuple. |
|
optional |
Column whose value becomes the emit key on each emitted tuple. |
|
optional |
Columns for range-based fetch keys (advanced). |
|
|
JDBC |
|
|
JDBC statement timeout. |
|
required |
IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract. |
JDBC Reporter (jdbc-reporter)
Writes per-document processing status to a SQL table. Records are buffered in memory and flushed periodically.
{
"pipes-reporters": {
"jdbc-reporter": {
"connectionString": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
"includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
"tableName": "tika_reporter_status",
"createTable": false,
"reportWithinMs": 5000,
"cacheSize": 500
}
}
}
pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
JDBC connection URL for the status database. |
|
empty (all reported) |
Set of |
|
empty |
Set of |
|
|
Status table name. |
|
|
If |
|
no default |
Custom prepared-statement template for inserting/updating status rows. If unset, the reporter uses |
|
no default |
SQL executed each time a connection is opened (e.g., SQLite pragmas). |
|
empty |
Names of the variables to bind to each |
|
|
Milliseconds between batched flushes from the in-memory cache to the database. |
|
|
Maximum in-memory cache size before a flush is forced. |
Complete Pipeline Example
The example below combines a JDBC iterator (reading work items from one table), a filesystem fetcher (reading the actual document bytes), a JDBC emitter (writing parsed metadata to a results table), and a JDBC reporter (recording per-document outcomes).
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "/data/input",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"jdbce": {
"jdbc-emitter": {
"connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
"createTable": "create table parsed_docs (path varchar(512) primary key, title varchar(1024), author varchar(512), content_length bigint, modified timestamp)",
"insert": "insert into parsed_docs (path, title, author, content_length, modified) values (?,?,?,?,?)",
"keys": {
"dc:title": "string",
"dc:creator": "string",
"Content-Length": "long",
"dcterms:modified": "timestamp"
},
"attachmentStrategy": "FIRST_ONLY",
"multivaluedFieldStrategy": "CONCATENATE"
}
}
},
"pipes-iterator": {
"jdbc-pipes-iterator": {
"connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
"select": "select id, source_path, output_path from docs_to_parse where status = 'PENDING'",
"idColumn": "id",
"fetchKeyColumn": "source_path",
"emitKeyColumn": "output_path",
"fetcherId": "fsf",
"emitterId": "jdbce"
}
},
"pipes-reporters": {
"jdbc-reporter": {
"connectionString": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
"includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"]
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4
}
}
Notes
-
H2 (
jdbc:h2:mem:…) is convenient for testing — no setup required — but the schema is lost when the process exits. -
The emitter’s
keysmap preserves insertion order (it’s aLinkedHashMapin Java). When writing the JSON, list the keys in the same order as the?placeholders ininsert. -
For high-throughput inserts, point
maxRetriesat a small positive number so transient connection failures don’t drop documents. -
Bind variables are typed by the SQL type declared in
keys, not by the metadata value’s Java type. Mismatches between SQL type and column type cause inserts to fail — coordinatecreateTablewithkeys.