File System Plugin
The File System plugin (tika-pipes-file-system) is the most common starting point for Tika Pipes. It provides all four interfaces — fetcher, emitter, iterator, and reporter — backed by the local (or mounted) filesystem.
| Interface | Component name | Class |
|---|---|---|
Fetcher |
|
|
Emitter |
|
|
Iterator |
|
|
Reporter |
|
|
Complete Pipeline Example
The example below is the canonical filesystem-to-filesystem integration test config. Tokens like FETCHER_BASE_PATH, EMITTER_BASE_PATH, PLUGINS_PATHS, and EMIT_INTERMEDIATE_RESULTS are placeholders the test harness substitutes; replace the path tokens with real paths and EMIT_INTERMEDIATE_RESULTS with the boolean true or false. See Pipes Configuration for what each setting does.
{
"content-handler-factory": {
"basic-content-handler-factory": {
"type": "TEXT",
"writeLimit": -1,
"throwOnWriteLimitReached": true
}
},
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "FETCHER_BASE_PATH",
"extractFileSystemMetadata": false
}
}
},
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "EMITTER_BASE_PATH",
"fileExtension": "json",
"onExists": "EXCEPTION"
}
}
},
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "FETCHER_BASE_PATH",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "fse"
}
},
"pipes": {
"parseMode": "RMETA",
"onParseException": "EMIT",
"numClients": 4,
"emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
"forkedJvmArgs": ["-Xmx512m"],
"emitStrategy": {
"type": "DYNAMIC",
"thresholdBytes": 1000000
}
},
"auto-detect-parser": {
"throwOnZeroBytes": false
},
"parse-context": {
"mock-digester-factory": {},
"timeout-limits": {
"progressTimeoutMillis": 5000
}
},
"plugin-roots": "PLUGINS_PATHS"
}
File System Fetcher (file-system-fetcher)
Reads files from a local or mounted filesystem. Fetch keys are resolved relative to basePath.
{
"fetchers": {
"fsf": {
"file-system-fetcher": {
"basePath": "/data/input",
"extractFileSystemMetadata": true
}
}
}
}
The outer key (fsf) is the fetcher ID — referenced by pipesIterator.fetcherId elsewhere in the config.
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Base directory for fetch operations. Fetch keys are resolved relative to this path. |
|
|
When |
|
|
When |
File System Emitter (file-system-emitter)
Writes parsed results as files under basePath. The relative output path is derived from the emit key of each FetchEmitTuple.
{
"emitters": {
"fse": {
"file-system-emitter": {
"basePath": "/data/output",
"fileExtension": "json",
"onExists": "EXCEPTION",
"prettyPrint": false
}
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Base output directory. The emit key is resolved relative to this path. |
|
|
Extension appended to each output file. For |
|
|
Behavior when the output file already exists: |
|
|
Pretty-print JSON output. Has no effect in |
File System Iterator (file-system-pipes-iterator)
Recursively walks a directory tree, emitting one FetchEmitTuple per file found.
{
"pipes-iterator": {
"file-system-pipes-iterator": {
"basePath": "/data/input",
"countTotal": true,
"fetcherId": "fsf",
"emitterId": "fse"
}
}
}
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Root directory to walk. |
|
|
If |
|
required |
IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract. |
File System Reporter (file-system-reporter)
Maintains a JSON status file that summarizes pipeline progress. The reporter writes the file periodically on a background thread; per-record report() calls only update in-memory counters.
{
"pipes-reporters": {
"file-system-reporter": {
"statusFile": "/var/log/tika/status.json",
"reportUpdateMs": 1000
}
}
}
pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.
Configuration
| Field | Default | Description |
|---|---|---|
|
required |
Path of the JSON status file. Absolute paths are written as given; relative paths resolve against the JVM’s working directory at startup. Parent directories that don’t exist are created automatically on first write. Always include a parent component (e.g., |
|
no default |
Interval in milliseconds between status-file writes. Typical values: |
Status file schema
The reporter serializes an AsyncStatus object to JSON, containing:
-
started— ISO-8601 timestamp of when the reporter was constructed. -
lastUpdate— ISO-8601 timestamp of the most recent write. -
asyncStatus— current pipeline phase (STARTED,COMPLETED,CRASHED). -
statusCounts— map ofPipesResult.RESULT_STATUSto count (e.g.,PARSE_SUCCESS,PARSE_EXCEPTION,TIMEOUT,OOM,EMIT_SUCCESS,EMIT_EXCEPTION). -
totalCountResult— total documents discovered by the iterator and whether the enumeration is complete. -
crashMessage— empty string under normal operation; populated with a stack trace on fatal pipeline failure.
The file is rewritten in full on each tick, not appended.
The write is not atomic — the reporter opens the target path with Files.newBufferedWriter, truncates, and streams the JSON. A watcher reading concurrently with a write can observe a truncated or partial document. Have the watcher treat a parse error as "stale read, try again on the next poll" rather than as a real error.
|
Live status for watching applications
The reporter is designed to support external "watchers" — UIs, dashboards, or monitoring scripts that poll the status file to display pipeline progress. To use it that way, set reportUpdateMs to match your desired refresh rate:
"reportUpdateMs": 250
The watcher polls statusFile on its own interval and reads the most recent snapshot. Each tick rewrites the file in full, so successive snapshots are always coherent — but because the write is not atomic, a watcher reading mid-write can see a truncated document. Tolerate JSON parse errors as transient and retry on the next poll (see the NOTE under Status file schema).
This pattern is used by tika-gui-v2 to drive its progress UI: the GUI starts a pipeline subprocess, points the reporter at a temp file, and polls that file every few hundred milliseconds.
Tradeoffs:
-
Smaller
reportUpdateMsvalues mean more disk writes. On a fast SSD this is negligible, but on a slow disk (or NFS) the writer thread can become a bottleneck. -
The reporter thread sleeps between writes, so the worst-case staleness of the file is
reportUpdateMsmilliseconds plus serialization time. -
Per-record
report()calls are cheap (counter increment only). The cost of "watching" is bounded by the periodic write, not by document throughput.
Security Notes
-
basePathis a sandbox boundary. The fetcher and emitter reject fetch/emit keys that resolve outsidebasePath. Do not setallowAbsolutePaths=trueunless the source of fetch keys is fully trusted — an attacker-controlled fetch key could otherwise read arbitrary files. -
Symlinks are followed. A symlink under
basePathpointing outsidebasePathmay still be readable. If you need strict containment, do not allow symlinks in your input tree. -
Output directories are created automatically. The emitter creates intermediate directories as needed. Make sure the process’s umask is appropriate for the data being written.