JDBC Plugin

The JDBC plugin (tika-pipes-jdbc) provides emitter, iterator, and reporter interfaces for relational databases. The plugin is JDBC-driver-agnostic: any database with a working JDBC driver on the plugin’s classpath should work.

Interface Component name Class

Emitter

jdbc-emitter

JDBCEmitter

Iterator

jdbc-pipes-iterator

JDBCPipesIterator

Reporter

jdbc-reporter

JDBCPipesReporter

JDBC Drivers

The plugin does not bundle drivers. Drop the JDBC driver JAR for your database into the plugin’s lib/ directory alongside tika-pipes-jdbc.jar so the plugin class loader can find it. Tested drivers include H2, PostgreSQL, MySQL, SQLite, and SQL Server.

JDBC Emitter (jdbc-emitter)

Writes parsed documents into a relational table. The emitter uses a prepared statement built from the insert template; the emit key is always the first bound parameter, followed by one parameter per entry in keys.

{
  "emitters": {
    "jdbce": {
      "jdbc-emitter": {
        "connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
        "createTable": "create table parsed_docs (path varchar(512) primary key, title varchar(1024), author varchar(512), content_length bigint, modified timestamp)",
        "insert": "insert into parsed_docs (path, title, author, content_length, modified) values (?,?,?,?,?)",
        "keys": {
          "dc:title": "string",
          "dc:creator": "string",
          "Content-Length": "long",
          "dcterms:modified": "timestamp"
        },
        "maxRetries": 0,
        "maxStringLength": 64000,
        "attachmentStrategy": "FIRST_ONLY",
        "multivaluedFieldStrategy": "CONCATENATE",
        "multivaluedFieldDelimiter": ", "
      }
    }
  }
}

Configuration

Field Default Description

connection

required

JDBC connection URL (validated non-blank). Example: jdbc:postgresql://db.example.com:5432/tika.

insert

required

Prepared-statement INSERT template. Must use ? placeholders. The first placeholder receives the emit key; subsequent placeholders receive values from keys in order.

createTable

optional

DDL executed once at startup. Use this to create the destination table if it does not already exist.

alterTable

optional

DDL executed once at startup, after createTable. Use for indexes or migrations.

postConnection

optional

SQL executed every time a new connection is opened (e.g., pragma statements for SQLite).

maxRetries

0

Number of times to retry a failed insert before giving up.

maxStringLength

64000

String columns longer than this are truncated. Set to -1 to disable.

keys

required

Ordered map of metadata-field-name → SQL-type. Types: string, int, long, bigint, boolean, timestamp. The order matters — it must match the order of ? placeholders in insert.

attachmentStrategy

FIRST_ONLY

How embedded documents are written. One of:

* FIRST_ONLY — only the parent document is inserted; attachments are dropped. * ALL — every document (parent and attachments) gets its own row.

multivaluedFieldStrategy

CONCATENATE

How multi-valued metadata fields are handled. One of:

* FIRST_ONLY — keep only the first value. * CONCATENATE — join values with multivaluedFieldDelimiter.

multivaluedFieldDelimiter

", "

Separator used by CONCATENATE.

JDBC Iterator (jdbc-pipes-iterator)

Walks rows returned by a SELECT statement, emitting one FetchEmitTuple per row.

{
  "pipes-iterator": {
    "jdbc-pipes-iterator": {
      "connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
      "select": "select id, source_path, output_path from docs_to_parse where status = 'PENDING'",
      "idColumn": "id",
      "fetchKeyColumn": "source_path",
      "emitKeyColumn": "output_path",
      "fetchSize": 1000,
      "queryTimeoutSeconds": 60,
      "fetcherId": "fsf",
      "emitterId": "jdbce"
    }
  }
}

Configuration

Field Default Description

connection

required

JDBC connection URL.

select

required

SELECT statement to enumerate.

idColumn

optional

Column whose value becomes the iterator’s row identifier.

fetchKeyColumn

optional

Column whose value becomes the fetch key on each emitted tuple.

emitKeyColumn

optional

Column whose value becomes the emit key on each emitted tuple.

fetchKeyRangeStartColumn / fetchKeyRangeEndColumn

optional

Columns for range-based fetch keys (advanced).

fetchSize

-1

JDBC fetchSize hint. -1 lets the driver choose.

queryTimeoutSeconds

-1

JDBC statement timeout. -1 means no timeout.

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

JDBC Reporter (jdbc-reporter)

Writes per-document processing status to a SQL table. Records are buffered in memory and flushed periodically.

{
  "pipes-reporters": {
    "jdbc-reporter": {
      "connectionString": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
      "includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"],
      "tableName": "tika_reporter_status",
      "createTable": false,
      "reportWithinMs": 5000,
      "cacheSize": 500
    }
  }
}

pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.

Configuration

Field Default Description

connectionString

required

JDBC connection URL for the status database.

includes

empty (all reported)

Set of RESULT_STATUS names to include (e.g., PARSE_SUCCESS, PARSE_EXCEPTION).

excludes

empty

Set of RESULT_STATUS names to skip. Applied after includes.

tableName

tika_status

Status table name.

createTable

true

If true, drop the existing status table (if any) and recreate it on startup. Set to false to preserve an existing table.

reportSql

no default

Custom prepared-statement template for inserting/updating status rows. If unset, the reporter uses insert into <tableName> (id, status, timestamp) values (?,?,?). Coordinate with reportVariables when overriding.

postConnectionSql

no default

SQL executed each time a connection is opened (e.g., SQLite pragmas).

reportVariables

empty

Names of the variables to bind to each ? placeholder in reportSql, in order. Available names: id, status, timestamp. Only needed when overriding reportSql.

reportWithinMs

10000

Milliseconds between batched flushes from the in-memory cache to the database.

cacheSize

100

Maximum in-memory cache size before a flush is forced.

Complete Pipeline Example

The example below combines a JDBC iterator (reading work items from one table), a filesystem fetcher (reading the actual document bytes), a JDBC emitter (writing parsed metadata to a results table), and a JDBC reporter (recording per-document outcomes).

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "/data/input",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "jdbce": {
      "jdbc-emitter": {
        "connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
        "createTable": "create table parsed_docs (path varchar(512) primary key, title varchar(1024), author varchar(512), content_length bigint, modified timestamp)",
        "insert": "insert into parsed_docs (path, title, author, content_length, modified) values (?,?,?,?,?)",
        "keys": {
          "dc:title": "string",
          "dc:creator": "string",
          "Content-Length": "long",
          "dcterms:modified": "timestamp"
        },
        "attachmentStrategy": "FIRST_ONLY",
        "multivaluedFieldStrategy": "CONCATENATE"
      }
    }
  },
  "pipes-iterator": {
    "jdbc-pipes-iterator": {
      "connection": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
      "select": "select id, source_path, output_path from docs_to_parse where status = 'PENDING'",
      "idColumn": "id",
      "fetchKeyColumn": "source_path",
      "emitKeyColumn": "output_path",
      "fetcherId": "fsf",
      "emitterId": "jdbce"
    }
  },
  "pipes-reporters": {
    "jdbc-reporter": {
      "connectionString": "jdbc:h2:mem:tika;DB_CLOSE_DELAY=-1",
      "includes": ["PARSE_SUCCESS", "PARSE_EXCEPTION", "OOM", "TIMEOUT"]
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

  • H2 (jdbc:h2:mem:…​) is convenient for testing — no setup required — but the schema is lost when the process exits.

  • The emitter’s keys map preserves insertion order (it’s a LinkedHashMap in Java). When writing the JSON, list the keys in the same order as the ? placeholders in insert.

  • For high-throughput inserts, point maxRetries at a small positive number so transient connection failures don’t drop documents.

  • Bind variables are typed by the SQL type declared in keys, not by the metadata value’s Java type. Mismatches between SQL type and column type cause inserts to fail — coordinate createTable with keys.