JDBC Plugin

Table of Contents

JDBC Drivers
JDBC Emitter (jdbc-emitter)
- Configuration
JDBC Iterator (jdbc-pipes-iterator)
- Configuration
JDBC Reporter (jdbc-reporter)
- Configuration
Complete Pipeline Example
Notes

The JDBC plugin (tika-pipes-jdbc) provides emitter, iterator, and reporter interfaces for relational databases. The plugin is JDBC-driver-agnostic: any database with a working JDBC driver on the plugin’s classpath should work.

Interface Component name Class

Interface	Component name	Class
Emitter	`jdbc-emitter`	`JDBCEmitter`
Iterator	`jdbc-pipes-iterator`	`JDBCPipesIterator`
Reporter	`jdbc-reporter`	`JDBCPipesReporter`

Emitter

jdbc-emitter

JDBCEmitter

Iterator

jdbc-pipes-iterator

JDBCPipesIterator

Reporter

jdbc-reporter

JDBCPipesReporter

JDBC Drivers

The plugin does not bundle drivers. Drop the JDBC driver JAR for your database into the plugin’s lib/ directory alongside tika-pipes-jdbc.jar so the plugin class loader can find it. Tested drivers include H2, PostgreSQL, MySQL, SQLite, and SQL Server.

JDBC Emitter (`jdbc-emitter`)

Writes parsed documents into a relational table. The emitter uses a prepared statement built from the insert template; the emit key is always the first bound parameter, followed by one parameter per entry in keys.

{
  "emitters": {
    "jdbce": {
      "jdbc-emitter": {
        "connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
        "createTable": "create table parsed_docs (path varchar(512) primary key, title varchar(1024), author varchar(512), content_length bigint, modified timestamp)",
        "insert": "insert into parsed_docs (path, title, author, content_length, modified) values (?,?,?,?,?)",
        "keys": {
          "dc:title": "string",
          "dc:creator": "string",
          "Content-Length": "long",
          "dcterms:modified": "timestamp"
        },
        "maxRetries": 0,
        "maxStringLength": 64000,
        "attachmentStrategy": "FIRST_ONLY",
        "multivaluedFieldStrategy": "CONCATENATE",
        "multivaluedFieldDelimiter": ", "
      }
    }
  }
}

Configuration

Field Default Description

Field	Default	Description
`connection`	required	JDBC connection URL (validated non-blank). Example: `jdbc:postgresql://db.example.com:5432/tika`.
`insert`	required	Prepared-statement `INSERT` template. Must use `?` placeholders. The first placeholder receives the emit key; subsequent placeholders receive values from `keys` in order.
`createTable`	optional	DDL executed once at startup. Use this to create the destination table if it does not already exist.
`alterTable`	optional	DDL executed once at startup, after `createTable`. Use for indexes or migrations.
`postConnection`	optional	SQL executed every time a new connection is opened (e.g., pragma statements for SQLite).
`maxRetries`	`0`	Number of times to retry a failed insert before giving up.
`maxStringLength`	`64000`	String columns longer than this are truncated. Set to `-1` to disable.
`keys`	required	Ordered map of metadata-field-name → SQL-type. Types: `string`, `int`, `long`, `bigint`, `boolean`, `timestamp`. The order matters — it must match the order of `?` placeholders in `insert`.
`attachmentStrategy`	`FIRST_ONLY`	How embedded documents are written. One of: * `FIRST_ONLY` — only the parent document is inserted; attachments are dropped. * `ALL` — every document (parent and attachments) gets its own row.
`multivaluedFieldStrategy`	`CONCATENATE`	How multi-valued metadata fields are handled. One of: * `FIRST_ONLY` — keep only the first value. * `CONCATENATE` — join values with `multivaluedFieldDelimiter`.
`multivaluedFieldDelimiter`	`", "`	Separator used by `CONCATENATE`.

connection

required

JDBC connection URL (validated non-blank). Example: jdbc:postgresql://db.example.com:5432/tika.

insert

required

Prepared-statement INSERT template. Must use ? placeholders. The first placeholder receives the emit key; subsequent placeholders receive values from keys in order.

createTable

optional

DDL executed once at startup. Use this to create the destination table if it does not already exist.

alterTable

optional

DDL executed once at startup, after createTable. Use for indexes or migrations.

postConnection

optional

SQL executed every time a new connection is opened (e.g., pragma statements for SQLite).

maxRetries

0

Number of times to retry a failed insert before giving up.

maxStringLength

64000

String columns longer than this are truncated. Set to -1 to disable.

keys

required

Ordered map of metadata-field-name → SQL-type. Types: string, int, long, bigint, boolean, timestamp. The order matters — it must match the order of ? placeholders in insert.

attachmentStrategy

FIRST_ONLY

How embedded documents are written. One of:

* FIRST_ONLY — only the parent document is inserted; attachments are dropped. * ALL — every document (parent and attachments) gets its own row.

multivaluedFieldStrategy

CONCATENATE

How multi-valued metadata fields are handled. One of:

* FIRST_ONLY — keep only the first value. * CONCATENATE — join values with multivaluedFieldDelimiter.

multivaluedFieldDelimiter

", "

Separator used by CONCATENATE.

JDBC Iterator (`jdbc-pipes-iterator`)

Walks rows returned by a SELECT statement, emitting one FetchEmitTuple per row.

{
  "pipes-iterator": {
    "jdbc-pipes-iterator": {
      "connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
      "select": "select id, source_path, output_path from docs_to_parse where status = 'PENDING'",
      "idColumn": "id",
      "fetchKeyColumn": "source_path",
      "emitKeyColumn": "output_path",
      "fetchSize": 1000,
      "queryTimeoutSeconds": 60,
      "fetcherId": "fsf",
      "emitterId": "jdbce"
    }
  }
}

Configuration

Field Default Description

Field	Default	Description
`connection`	required	JDBC connection URL.
`select`	required	SELECT statement to enumerate.
`idColumn`	optional	Column whose value becomes the iterator’s row identifier.
`fetchKeyColumn`	optional	Column whose value becomes the fetch key on each emitted tuple.
`emitKeyColumn`	optional	Column whose value becomes the emit key on each emitted tuple.
`fetchKeyRangeStartColumn` / `fetchKeyRangeEndColumn`	optional	Columns for range-based fetch keys (advanced).
`fetchSize`	`-1`	JDBC `fetchSize` hint. `-1` lets the driver choose.
`queryTimeoutSeconds`	`-1`	JDBC statement timeout. `-1` means no timeout.
`fetcherId` / `emitterId`	required	IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

connection

required

JDBC connection URL.

select

required

SELECT statement to enumerate.

idColumn

optional

Column whose value becomes the iterator’s row identifier.

fetchKeyColumn

optional

Column whose value becomes the fetch key on each emitted tuple.

emitKeyColumn

optional

Column whose value becomes the emit key on each emitted tuple.

fetchKeyRangeStartColumn / fetchKeyRangeEndColumn

optional

Columns for range-based fetch keys (advanced).

fetchSize

-1

JDBC fetchSize hint. -1 lets the driver choose.

queryTimeoutSeconds

-1

JDBC statement timeout. -1 means no timeout.

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

JDBC Reporter (`jdbc-reporter`)

Writes per-document processing status to a SQL table. Records are buffered in memory and flushed periodically.

{
  "pipes-reporters": {
    "jdbc-reporter": {
      "connectionString": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
      "includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
      "tableName": "tika_reporter_status",
      "createTable": false,
      "reportWithinMs": 5000,
      "cacheSize": 500
    }
  }
}

pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.

Configuration

Field Default Description

Field	Default	Description
`connectionString`	required	JDBC connection URL for the status database.
`includes`	empty (all reported)	Set of `RESULT_STATUS` names to include (e.g., `PARSE_SUCCESS`, `PARSE_EXCEPTION`).
`excludes`	empty	Set of `RESULT_STATUS` names to skip. Applied after `includes`.
`tableName`	`tika_status`	Status table name.
`createTable`	`true`	If `true`, drop the existing status table (if any) and recreate it on startup. Set to `false` to preserve an existing table.
`reportSql`	no default	Custom prepared-statement template for inserting/updating status rows. If unset, the reporter uses `insert into <tableName> (id, status, timestamp) values (?,?,?)`. Coordinate with `reportVariables` when overriding.
`postConnectionSql`	no default	SQL executed each time a connection is opened (e.g., SQLite pragmas).
`reportVariables`	empty	Names of the variables to bind to each `?` placeholder in `reportSql`, in order. Available names: `id`, `status`, `timestamp`. Only needed when overriding `reportSql`.
`reportWithinMs`	`10000`	Milliseconds between batched flushes from the in-memory cache to the database.
`cacheSize`	`100`	Maximum in-memory cache size before a flush is forced.

connectionString

required

JDBC connection URL for the status database.

includes

empty (all reported)

Set of RESULT_STATUS names to include (e.g., PARSE_SUCCESS, PARSE_EXCEPTION).

excludes

empty

Set of RESULT_STATUS names to skip. Applied after includes.

tableName

tika_status

Status table name.

createTable

true

If true, drop the existing status table (if any) and recreate it on startup. Set to false to preserve an existing table.

reportSql

no default

Custom prepared-statement template for inserting/updating status rows. If unset, the reporter uses insert into <tableName> (id, status, timestamp) values (?,?,?). Coordinate with reportVariables when overriding.

postConnectionSql

no default

SQL executed each time a connection is opened (e.g., SQLite pragmas).

reportVariables

empty

Names of the variables to bind to each ? placeholder in reportSql, in order. Available names: id, status, timestamp. Only needed when overriding reportSql.

reportWithinMs

10000

Milliseconds between batched flushes from the in-memory cache to the database.

cacheSize

100

Maximum in-memory cache size before a flush is forced.

Complete Pipeline Example

The example below combines a JDBC iterator (reading work items from one table), a filesystem fetcher (reading the actual document bytes), a JDBC emitter (writing parsed metadata to a results table), and a JDBC reporter (recording per-document outcomes).

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "/data/input",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "jdbce": {
      "jdbc-emitter": {
        "connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
        "createTable": "create table parsed_docs (path varchar(512) primary key, title varchar(1024), author varchar(512), content_length bigint, modified timestamp)",
        "insert": "insert into parsed_docs (path, title, author, content_length, modified) values (?,?,?,?,?)",
        "keys": {
          "dc:title": "string",
          "dc:creator": "string",
          "Content-Length": "long",
          "dcterms:modified": "timestamp"
        },
        "attachmentStrategy": "FIRST_ONLY",
        "multivaluedFieldStrategy": "CONCATENATE"
      }
    }
  },
  "pipes-iterator": {
    "jdbc-pipes-iterator": {
      "connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
      "select": "select id, source_path, output_path from docs_to_parse where status = 'PENDING'",
      "idColumn": "id",
      "fetchKeyColumn": "source_path",
      "emitKeyColumn": "output_path",
      "fetcherId": "fsf",
      "emitterId": "jdbce"
    }
  },
  "pipes-reporters": {
    "jdbc-reporter": {
      "connectionString": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
      "includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"]
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

H2 (jdbc:h2:mem:…) is convenient for testing — no setup required — but the schema is lost when the process exits.
The emitter’s keys map preserves insertion order (it’s a LinkedHashMap in Java). When writing the JSON, list the keys in the same order as the ? placeholders in insert.
For high-throughput inserts, point maxRetries at a small positive number so transient connection failures don’t drop documents.
Bind variables are typed by the SQL type declared in keys, not by the metadata value’s Java type. Mismatches between SQL type and column type cause inserts to fail — coordinate createTable with keys.

JDBC Plugin

JDBC Drivers

JDBC Emitter (jdbc-emitter)

Configuration

JDBC Iterator (jdbc-pipes-iterator)

Configuration

JDBC Reporter (jdbc-reporter)

Configuration

Complete Pipeline Example

Notes

JDBC Emitter (`jdbc-emitter`)

JDBC Iterator (`jdbc-pipes-iterator`)

JDBC Reporter (`jdbc-reporter`)