File System Plugin
The File System plugin (tika-pipes-file-system) is the most common starting point for Tika Pipes. It provides all four interfaces — fetcher, emitter, iterator, and reporter — backed by the local (or mounted) filesystem.
| Interface | Component name | Class |
|---|---|---|
Fetcher |
|
|
Emitter |
|
|
Iterator |
|
|
Reporter |
|
|
Complete Pipeline Example
The example below is the canonical filesystem-to-filesystem integration test config. Tokens like FETCHER_BASE_PATH, EMITTER_BASE_PATH, and PLUGINS_PATHS are placeholders the test harness substitutes; replace them with real paths in your own config.
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "FETCHER_BASE_PATH",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "EMITTER_BASE_PATH",
"fileExtension": "json",
"onExists": "EXCEPTION"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "FETCHER_BASE_PATH",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "fse"
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4,
"emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
"forkedJvmArgs": ["-Xmx512m"],
"emitStrategy": {
"type": "DYNAMIC",
"thresholdBytes": 1000000
}
},
"auto-detect-parser": {
"throwOnZeroBytes": false
},
"parse-context": {
"mock-digester-factory": {},
"timeout-limits": {
"progressTimeoutMillis": 5000
}
},
"plugin-roots": "PLUGINS_PATHS"
}
File System Fetcher (file-system-fetcher)
Reads files from a local or mounted filesystem. Fetch keys are resolved relative to basePath.
{
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "/data/input",
"extractFileSystemMetadata": true
}
}
}
}
The outer key (fsf) is the fetcher ID — referenced by pipesIterator.fetcherId elsewhere in the config.
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Base directory for fetch operations. Fetch keys are resolved relative to this path. |
|
|
When |
|
|
When |
File System Emitter (file-system-emitter)
Writes parsed results as files under basePath. The relative output path is derived from the emit key of each FetchEmitTuple.
{
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "/data/output",
"fileExtension": "json",
"onExists": "EXCEPTION",
"prettyPrint": false
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Base output directory. The emit key is resolved relative to this path. |
|
|
Extension appended to each output file. For |
|
|
Behavior when the output file already exists: |
|
|
Pretty-print JSON output. Has no effect in |
File System Iterator (file-system-pipes-iterator)
Recursively walks a directory tree, emitting one FetchEmitTuple per file found.
{
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "/data/input",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "fse"
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Root directory to walk. |
|
|
If |
|
required |
IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract. |
File System Reporter (file-system-reporter)
Maintains a JSON status file that summarizes pipeline progress. The reporter writes the file periodically on a background thread; per-record report() calls only update in-memory counters.
{
"pipes-reporters": {
"file-system-reporter": {
"statusFile": "/var/log/tika/status.json",
"reportUpdateMs": 1000
}
}
}
pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Path of the JSON status file. The file is created on first write and overwritten in place. |
|
no default |
Interval in milliseconds between status-file writes. Typical values: |
Status file schema
The reporter serializes an AsyncStatus object to JSON, containing:
-
asyncStatus— current pipeline phase (STARTED,COMPLETED,CRASHED). -
counts— map ofRESULT_STATUSto count (e.g.,PARSE_SUCCESS,PARSE_EXCEPTION,TIMEOUT,OOM). -
totalCountResult— total documents processed and whether the enumeration is complete. -
timestamp— when the file was last written. -
crashMessage— populated only on fatal pipeline failure.
The file is rewritten in full on each tick, not appended.
Live status for watching applications
The reporter is designed to support external "watchers" — UIs, dashboards, or monitoring scripts that poll the status file to display pipeline progress. To use it that way, set reportUpdateMs to match your desired refresh rate:
"reportUpdateMs": 250
The watcher polls statusFile on its own interval and reads the most recent snapshot. Because the file is rewritten in full with the latest status, watchers do not need to handle partial reads.
This pattern is used by tika-gui-v2 to drive its progress UI: the GUI starts a pipeline subprocess, points the reporter at a temp file, and polls that file every few hundred milliseconds.
Tradeoffs:
-
Smaller
reportUpdateMsvalues mean more disk writes. On a fast SSD this is negligible, but on a slow disk (or NFS) the writer thread can become a bottleneck. -
The reporter thread sleeps between writes, so the worst-case staleness of the file is
reportUpdateMsmilliseconds plus serialization time. -
Per-record
report()calls are cheap (counter increment only). The cost of "watching" is bounded by the periodic write, not by document throughput.
Security Notes
-
basePathis a sandbox boundary. The fetcher and emitter reject fetch/emit keys that resolve outsidebasePath. Do not setallowAbsolutePaths=trueunless the source of fetch keys is fully trusted — an attacker-controlled fetch key could otherwise read arbitrary files. -
Symlinks are followed. A symlink under
basePathpointing outsidebasePathmay still be readable. If you need strict containment, do not allow symlinks in your input tree. -
Output directories are created automatically. The emitter creates intermediate directories as needed. Make sure the process’s umask is appropriate for the data being written.