Apache Solr Plugin

Table of Contents

Connection Modes
Solr Emitter (solr-emitter)
- Configuration
Solr Iterator (solr-pipes-iterator)
- Configuration
Complete Pipeline Example
Notes

The Apache Solr plugin (tika-pipes-solr) provides an emitter (writes parsed docs to a Solr collection) and an iterator (enumerates documents already in Solr for re-processing).

Interface Component name Class

Interface	Component name	Class
Emitter	`solr-emitter`	`SolrEmitter`
Iterator	`solr-pipes-iterator`	`SolrPipesIterator`

Emitter

solr-emitter

SolrEmitter

Iterator

solr-pipes-iterator

SolrPipesIterator

Connection Modes

Both components support two ways of locating a Solr cluster — pick exactly one:

Direct URLs (solrUrls) — list one or more node URLs. Use this for standalone Solr or for SolrCloud when you want to bypass ZooKeeper for routing.
ZooKeeper (solrZkHosts + solrZkChroot) — list the ZooKeeper ensemble; Solr discovers nodes via ZK. Use this for SolrCloud deployments.

The emitter’s validate() enforces the XOR: setting neither or both raises TikaConfigException.

Solr Emitter (`solr-emitter`)

Writes parsed documents to a Solr collection.

{
  "emitters": {
    "solre": {
      "solr-emitter": {
        "solrCollection": "tika-docs",
        "solrUrls": ["http://solr1.example.com:8983/solr", "http://solr2.example.com:8983/solr"],
        "idField": "id",
        "commitWithin": 1000,
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "ADD",
        "embeddedFileFieldName": "embedded",
        "connectionTimeoutMillis": 10000,
        "socketTimeoutMillis": 60000
      }
    }
  }
}

For SolrCloud with ZooKeeper-based routing, use solrZkHosts (and optionally solrZkChroot) instead of solrUrls:

{
  "emitters": {
    "solre": {
      "solr-emitter": {
        "solrCollection": "tika-docs",
        "solrZkHosts": ["zk1.example.com:2181", "zk2.example.com:2181", "zk3.example.com:2181"],
        "solrZkChroot": "/solr",
        "idField": "id",
        "commitWithin": 1000,
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "ADD"
      }
    }
  }
}

Configuration

Field Default Description

Field	Default	Description
`solrCollection`	required	Solr collection (or core) name (validated non-blank).
`solrUrls`	required (XOR)	List of node URLs, e.g., `["http://solr1.example.com:8983/solr"]`. Mutually exclusive with `solrZkHosts`.
`solrZkHosts`	required (XOR)	List of ZooKeeper hosts, e.g., `["zk1.example.com:2181"]`. Mutually exclusive with `solrUrls`.
`solrZkChroot`	optional	ZooKeeper chroot, when using `solrZkHosts`.
`idField`	`id`	Field in the emitted JSON document that holds the Solr `_id`.
`commitWithin`	`1000`	Solr `commitWithin` value, in milliseconds.
`connectionTimeoutMillis`	`10000`	HTTP connect timeout.
`socketTimeoutMillis`	`60000`	HTTP socket read timeout.
`attachmentStrategy`	`PARENT_CHILD`	How attached/embedded documents are indexed. One of: * `SEPARATE_DOCUMENTS` — each attachment becomes its own top-level document. * `PARENT_CHILD` — attachments are nested under the parent.
`updateStrategy`	`ADD`	How existing documents are handled. One of: * `ADD` — replaces any existing document at the same `_id`. * `UPDATE_MUST_EXIST` — fails if no document exists at that `_id`. * `UPDATE_MUST_NOT_EXIST` — fails if a document already exists at that `_id`.
`embeddedFileFieldName`	`embedded`	Field name used to hold embedded-file content (used by `PARENT_CHILD`).
`userName` / `password` / `authScheme`	optional	HTTP basic auth credentials.
`proxyHost` / `proxyPort`	optional	Optional outbound HTTP proxy.

solrCollection

required

Solr collection (or core) name (validated non-blank).

solrUrls

required (XOR)

List of node URLs, e.g., ["http://solr1.example.com:8983/solr"]. Mutually exclusive with solrZkHosts.

solrZkHosts

required (XOR)

List of ZooKeeper hosts, e.g., ["zk1.example.com:2181"]. Mutually exclusive with solrUrls.

solrZkChroot

optional

ZooKeeper chroot, when using solrZkHosts.

idField

id

Field in the emitted JSON document that holds the Solr _id.

commitWithin

1000

Solr commitWithin value, in milliseconds.

connectionTimeoutMillis

10000

HTTP connect timeout.

socketTimeoutMillis

60000

HTTP socket read timeout.

attachmentStrategy

PARENT_CHILD

How attached/embedded documents are indexed. One of:

* SEPARATE_DOCUMENTS — each attachment becomes its own top-level document. * PARENT_CHILD — attachments are nested under the parent.

updateStrategy

ADD

How existing documents are handled. One of:

* ADD — replaces any existing document at the same _id. * UPDATE_MUST_EXIST — fails if no document exists at that _id. * UPDATE_MUST_NOT_EXIST — fails if a document already exists at that _id.

embeddedFileFieldName

embedded

Field name used to hold embedded-file content (used by PARENT_CHILD).

userName / password / authScheme

optional

HTTP basic auth credentials.

proxyHost / proxyPort

optional

Optional outbound HTTP proxy.

Solr Iterator (`solr-pipes-iterator`)

Enumerates documents already in a Solr collection and emits one FetchEmitTuple per matching document. Useful for re-parsing existing documents — e.g., after a parser bug fix or a Tika upgrade.

{
  "pipes-iterator": {
    "solr-pipes-iterator": {
      "solrCollection": "tika-docs",
      "solrUrls": ["http://solr1.example.com:8983/solr"],
      "filters": ["status:NEEDS_REPARSE"],
      "idField": "id",
      "rows": 5000,
      "connectionTimeoutMillis": 10000,
      "socketTimeoutMillis": 60000,
      "fetcherId": "fsf",
      "emitterId": "solre"
    }
  }
}

Configuration

Field Default Description

Field	Default	Description
`solrCollection`	required	Solr collection to iterate.
`solrUrls` / `solrZkHosts` / `solrZkChroot`	required (XOR)	Connection mode — see Connection Modes.
`filters`	empty	List of Solr filter queries to scope the iteration (e.g., `["status:NEEDS_REPARSE"]`).
`idField`	no default	Solr field used as the iterator’s row identifier.
`parsingIdField` / `failCountField` / `sizeFieldName` / `additionalFields`	optional	Extra Solr fields surfaced into the `FetchEmitTuple` metadata. Advanced; usually unset.
`rows`	`5000`	Page size for the underlying Solr query.
`connectionTimeoutMillis`	`10000`	HTTP connect timeout.
`socketTimeoutMillis`	`60000`	HTTP socket read timeout.
`userName` / `password` / `authScheme` / `proxyHost` / `proxyPort`	optional	Same as the emitter.
`fetcherId` / `emitterId`	required	IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

solrCollection

required

Solr collection to iterate.

solrUrls / solrZkHosts / solrZkChroot

required (XOR)

Connection mode — see Connection Modes.

filters

empty

List of Solr filter queries to scope the iteration (e.g., ["status:NEEDS_REPARSE"]).

idField

no default

Solr field used as the iterator’s row identifier.

parsingIdField / failCountField / sizeFieldName / additionalFields

optional

Extra Solr fields surfaced into the FetchEmitTuple metadata. Advanced; usually unset.

rows

5000

Page size for the underlying Solr query.

connectionTimeoutMillis

10000

HTTP connect timeout.

socketTimeoutMillis

60000

HTTP socket read timeout.

userName / password / authScheme / proxyHost / proxyPort

optional

Same as the emitter.

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

Complete Pipeline Example

The example below combines a filesystem iterator/fetcher with the Solr emitter — the common pattern for ingesting a directory of documents into Solr.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "/data/input",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "solre": {
      "solr-emitter": {
        "solrCollection": "tika-docs",
        "solrUrls": ["http://solr1.example.com:8983/solr"],
        "idField": "id",
        "commitWithin": 1000,
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "ADD"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/input",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "solre"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

The Solr plugin uses SolrJ (solr-solrj). HTTP/2 transport is used when available.
For re-parsing workflows, point a solr-pipes-iterator at the same collection a solr-emitter writes to, but use UPDATE_MUST_EXIST on the emitter to avoid creating phantom rows.
commitWithin is a soft guarantee — Solr may delay commits under load. For strict ordering, configure auto-commits on the Solr side and leave commitWithin at its default.

Apache Solr Plugin

Connection Modes

Solr Emitter (solr-emitter)

Configuration

Solr Iterator (solr-pipes-iterator)

Configuration

Complete Pipeline Example

Notes

Solr Emitter (`solr-emitter`)

Solr Iterator (`solr-pipes-iterator`)