Apache Solr Plugin
The Apache Solr plugin (tika-pipes-solr) provides an emitter (writes parsed docs to a Solr collection) and an iterator (enumerates documents already in Solr for re-processing).
| Interface | Component name | Class |
|---|---|---|
Emitter |
|
|
Iterator |
|
|
Connection Modes
Both components support two ways of locating a Solr cluster — pick exactly one:
-
Direct URLs (
solrUrls) — list one or more node URLs. Use this for standalone Solr or for SolrCloud when you want to bypass ZooKeeper for routing. -
ZooKeeper (
solrZkHosts+solrZkChroot) — list the ZooKeeper ensemble; Solr discovers nodes via ZK. Use this for SolrCloud deployments.
The emitter’s validate() enforces the XOR: setting neither or both raises TikaConfigException.
Solr Emitter (solr-emitter)
Writes parsed documents to a Solr collection.
{
"emitters": {
"solre": {
"solr-emitter": {
"solrCollection": "tika-docs",
"solrUrls": ["http://solr1.example.com:8983/solr", "http://solr2.example.com:8983/solr"],
"idField": "id",
"commitWithin": 1000,
"attachmentStrategy": "PARENT_CHILD",
"updateStrategy": "ADD",
"embeddedFileFieldName": "embedded",
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000
}
}
}
}
For SolrCloud with ZooKeeper-based routing, use solrZkHosts (and optionally solrZkChroot) instead of solrUrls:
{
"emitters": {
"solre": {
"solr-emitter": {
"solrCollection": "tika-docs",
"solrZkHosts": ["zk1.example.com:2181", "zk2.example.com:2181", "zk3.example.com:2181"],
"solrZkChroot": "/solr",
"idField": "id",
"commitWithin": 1000,
"attachmentStrategy": "PARENT_CHILD",
"updateStrategy": "ADD"
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Solr collection (or core) name (validated non-blank). |
|
required (XOR) |
List of node URLs, e.g., |
|
required (XOR) |
List of ZooKeeper hosts, e.g., |
|
optional |
ZooKeeper chroot, when using |
|
|
Field in the emitted JSON document that holds the Solr |
|
|
Solr |
|
|
HTTP connect timeout. |
|
|
HTTP socket read timeout. |
|
|
How attached/embedded documents are indexed. One of: * |
|
|
How existing documents are handled. One of: * |
|
|
Field name used to hold embedded-file content (used by |
|
optional |
HTTP basic auth credentials. |
|
optional |
Optional outbound HTTP proxy. |
Solr Iterator (solr-pipes-iterator)
Enumerates documents already in a Solr collection and emits one FetchEmitTuple per matching document. Useful for re-parsing existing documents — e.g., after a parser bug fix or a Tika upgrade.
{
"pipes-iterator": {
"solr-pipes-iterator": {
"solrCollection": "tika-docs",
"solrUrls": ["http://solr1.example.com:8983/solr"],
"filters": ["status:NEEDS_REPARSE"],
"idField": "id",
"rows": 5000,
"connectionTimeoutMillis": 10000,
"socketTimeoutMillis": 60000,
"fetcherId": "fsf",
"emitterId": "solre"
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Solr collection to iterate. |
|
required (XOR) |
Connection mode — see Connection Modes. |
|
empty |
List of Solr filter queries to scope the iteration (e.g., |
|
no default |
Solr field used as the iterator’s row identifier. |
|
optional |
Extra Solr fields surfaced into the |
|
|
Page size for the underlying Solr query. |
|
|
HTTP connect timeout. |
|
|
HTTP socket read timeout. |
|
optional |
Same as the emitter. |
|
required |
IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract. |
Complete Pipeline Example
The example below combines a filesystem iterator/fetcher with the Solr emitter — the common pattern for ingesting a directory of documents into Solr.
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "/data/input",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"solre": {
"solr-emitter": {
"solrCollection": "tika-docs",
"solrUrls": ["http://solr1.example.com:8983/solr"],
"idField": "id",
"commitWithin": 1000,
"attachmentStrategy": "PARENT_CHILD",
"updateStrategy": "ADD"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "/data/input",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "solre"
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4
}
}
Notes
-
The Solr plugin uses SolrJ (
solr-solrj). HTTP/2 transport is used when available. -
For re-parsing workflows, point a
solr-pipes-iteratorat the same collection asolr-emitterwrites to, but useUPDATE_MUST_EXISTon the emitter to avoid creating phantom rows. -
commitWithinis a soft guarantee — Solr may delay commits under load. For strict ordering, configure auto-commits on the Solr side and leavecommitWithinat its default.