Apache Solr Plugin

The Apache Solr plugin (tika-pipes-solr) provides an emitter (writes parsed docs to a Solr collection) and an iterator (enumerates documents already in Solr for re-processing).

Interface Component name Class

Emitter

solr-emitter

SolrEmitter

Iterator

solr-pipes-iterator

SolrPipesIterator

Connection Modes

Both components support two ways of locating a Solr cluster — pick exactly one:

  • Direct URLs (solrUrls) — list one or more node URLs. Use this for standalone Solr or for SolrCloud when you want to bypass ZooKeeper for routing.

  • ZooKeeper (solrZkHosts + solrZkChroot) — list the ZooKeeper ensemble; Solr discovers nodes via ZK. Use this for SolrCloud deployments.

The emitter’s validate() enforces the XOR: setting neither or both raises TikaConfigException.

Solr Emitter (solr-emitter)

Writes parsed documents to a Solr collection.

{
  "emitters": {
    "solre": {
      "solr-emitter": {
        "solrCollection": "tika-docs",
        "solrUrls": ["http://solr1.example.com:8983/solr", "http://solr2.example.com:8983/solr"],
        "idField": "id",
        "commitWithin": 1000,
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "ADD",
        "embeddedFileFieldName": "embedded",
        "connectionTimeoutMillis": 10000,
        "socketTimeoutMillis": 60000
      }
    }
  }
}

For SolrCloud with ZooKeeper-based routing, use solrZkHosts (and optionally solrZkChroot) instead of solrUrls:

{
  "emitters": {
    "solre": {
      "solr-emitter": {
        "solrCollection": "tika-docs",
        "solrZkHosts": ["zk1.example.com:2181", "zk2.example.com:2181", "zk3.example.com:2181"],
        "solrZkChroot": "/solr",
        "idField": "id",
        "commitWithin": 1000,
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "ADD"
      }
    }
  }
}

Configuration

Field Default Description

solrCollection

required

Solr collection (or core) name (validated non-blank).

solrUrls

required (XOR)

List of node URLs, e.g., ["http://solr1.example.com:8983/solr"]. Mutually exclusive with solrZkHosts.

solrZkHosts

required (XOR)

List of ZooKeeper hosts, e.g., ["zk1.example.com:2181"]. Mutually exclusive with solrUrls.

solrZkChroot

optional

ZooKeeper chroot, when using solrZkHosts.

idField

id

Field in the emitted JSON document that holds the Solr _id.

commitWithin

1000

Solr commitWithin value, in milliseconds.

connectionTimeoutMillis

10000

HTTP connect timeout.

socketTimeoutMillis

60000

HTTP socket read timeout.

attachmentStrategy

PARENT_CHILD

How attached/embedded documents are indexed. One of:

* SEPARATE_DOCUMENTS — each attachment becomes its own top-level document. * PARENT_CHILD — attachments are nested under the parent.

updateStrategy

ADD

How existing documents are handled. One of:

* ADD — replaces any existing document at the same _id. * UPDATE_MUST_EXIST — fails if no document exists at that _id. * UPDATE_MUST_NOT_EXIST — fails if a document already exists at that _id.

embeddedFileFieldName

embedded

Field name used to hold embedded-file content (used by PARENT_CHILD).

userName / password / authScheme

optional

HTTP basic auth credentials.

proxyHost / proxyPort

optional

Optional outbound HTTP proxy.

Solr Iterator (solr-pipes-iterator)

Enumerates documents already in a Solr collection and emits one FetchEmitTuple per matching document. Useful for re-parsing existing documents — e.g., after a parser bug fix or a Tika upgrade.

{
  "pipes-iterator": {
    "solr-pipes-iterator": {
      "solrCollection": "tika-docs",
      "solrUrls": ["http://solr1.example.com:8983/solr"],
      "filters": ["status:NEEDS_REPARSE"],
      "idField": "id",
      "rows": 5000,
      "connectionTimeoutMillis": 10000,
      "socketTimeoutMillis": 60000,
      "fetcherId": "fsf",
      "emitterId": "solre"
    }
  }
}

Configuration

Field Default Description

solrCollection

required

Solr collection to iterate.

solrUrls / solrZkHosts / solrZkChroot

required (XOR)

Connection mode — see Connection Modes.

filters

empty

List of Solr filter queries to scope the iteration (e.g., ["status:NEEDS_REPARSE"]).

idField

no default

Solr field used as the iterator’s row identifier.

parsingIdField / failCountField / sizeFieldName / additionalFields

optional

Extra Solr fields surfaced into the FetchEmitTuple metadata. Advanced; usually unset.

rows

5000

Page size for the underlying Solr query.

connectionTimeoutMillis

10000

HTTP connect timeout.

socketTimeoutMillis

60000

HTTP socket read timeout.

userName / password / authScheme / proxyHost / proxyPort

optional

Same as the emitter.

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

Complete Pipeline Example

The example below combines a filesystem iterator/fetcher with the Solr emitter — the common pattern for ingesting a directory of documents into Solr.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "/data/input",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "solre": {
      "solr-emitter": {
        "solrCollection": "tika-docs",
        "solrUrls": ["http://solr1.example.com:8983/solr"],
        "idField": "id",
        "commitWithin": 1000,
        "attachmentStrategy": "PARENT_CHILD",
        "updateStrategy": "ADD"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/input",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "solre"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4
  }
}

Notes

  • The Solr plugin uses SolrJ (solr-solrj). HTTP/2 transport is used when available.

  • For re-parsing workflows, point a solr-pipes-iterator at the same collection a solr-emitter writes to, but use UPDATE_MUST_EXIST on the emitter to avoid creating phantom rows.

  • commitWithin is a soft guarantee — Solr may delay commits under load. For strict ordering, configure auto-commits on the Solr side and leave commitWithin at its default.