File System Plugin

The File System plugin (tika-pipes-file-system) is the most common starting point for Tika Pipes. It provides all four interfaces — fetcher, emitter, iterator, and reporter — backed by the local (or mounted) filesystem.

Interface Component name Class

Fetcher

file-system-fetcher

FileSystemFetcher

Emitter

file-system-emitter

FileSystemEmitter

Iterator

file-system-pipes-iterator

FileSystemPipesIterator

Reporter

file-system-reporter

FileSystemStatusReporter

Complete Pipeline Example

The example below is the canonical filesystem-to-filesystem integration test config. Tokens like FETCHER_BASE_PATH, EMITTER_BASE_PATH, and PLUGINS_PATHS are placeholders the test harness substitutes; replace them with real paths in your own config.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "FETCHER_BASE_PATH",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "fse": {
      "file-system-emitter": {
        "basePath": "EMITTER_BASE_PATH",
        "fileExtension": "json",
        "onExists": "EXCEPTION"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "FETCHER_BASE_PATH",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "fse"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4,
    "emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
    "forkedJvmArgs": ["-Xmx512m"],
    "emitStrategy": {
      "type": "DYNAMIC",
      "thresholdBytes": 1000000
    }
  },
  "auto-detect-parser": {
    "throwOnZeroBytes": false
  },
  "parse-context": {
    "mock-digester-factory": {},
    "timeout-limits": {
      "progressTimeoutMillis": 5000
    }
  },
  "plugin-roots": "PLUGINS_PATHS"
}

File System Fetcher (file-system-fetcher)

Reads files from a local or mounted filesystem. Fetch keys are resolved relative to basePath.

{
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "/data/input",
        "extractFileSystemMetadata": true
      }
    }
  }
}

The outer key (fsf) is the fetcher ID — referenced by pipesIterator.fetcherId elsewhere in the config.

Configuration

Field Default Description

basePath

required

Base directory for fetch operations. Fetch keys are resolved relative to this path.

extractFileSystemMetadata

false

When true, attach file size, created, and modified timestamps to the metadata of each fetched document.

allowAbsolutePaths

false

When true, fetch keys may be absolute paths and basePath may be omitted. Use sparingly — see Security Notes.

File System Emitter (file-system-emitter)

Writes parsed results as files under basePath. The relative output path is derived from the emit key of each FetchEmitTuple.

{
  "emitters": {
    "fse": {
      "file-system-emitter": {
        "basePath": "/data/output",
        "fileExtension": "json",
        "onExists": "EXCEPTION",
        "prettyPrint": false
      }
    }
  }
}

Configuration

Field Default Description

basePath

required

Base output directory. The emit key is resolved relative to this path.

fileExtension

json

Extension appended to each output file. For CONTENT_ONLY mode, set this to match the handler type (txt, html, md, xml).

onExists

EXCEPTION

Behavior when the output file already exists: SKIP (do nothing), REPLACE (overwrite), EXCEPTION (fail loudly).

prettyPrint

false

Pretty-print JSON output. Has no effect in CONTENT_ONLY mode (raw bytes are written).

File System Iterator (file-system-pipes-iterator)

Recursively walks a directory tree, emitting one FetchEmitTuple per file found.

{
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/input",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "fse"
    }
  }
}

Configuration

Field Default Description

basePath

required

Root directory to walk.

countTotal

true

If true, walks the tree once to count files before processing begins. Enables progress reporting at the cost of an extra scan over the tree.

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

Notes

  • Walk order is filesystem-dependent and not guaranteed stable across runs.

  • The relative path of each file (from basePath) becomes the fetch key, and by default also the emit key.

  • Symbolic links are followed.

File System Reporter (file-system-reporter)

Maintains a JSON status file that summarizes pipeline progress. The reporter writes the file periodically on a background thread; per-record report() calls only update in-memory counters.

{
  "pipes-reporters": {
    "file-system-reporter": {
      "statusFile": "/var/log/tika/status.json",
      "reportUpdateMs": 1000
    }
  }
}

pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.

Configuration

Field Default Description

statusFile

required

Path of the JSON status file. The file is created on first write and overwritten in place.

reportUpdateMs

no default

Interval in milliseconds between status-file writes. Typical values: 1000 for a low-overhead heartbeat, 100 for near-real-time updates. There is no built-in default — always set this explicitly.

Status file schema

The reporter serializes an AsyncStatus object to JSON, containing:

  • asyncStatus — current pipeline phase (STARTED, COMPLETED, CRASHED).

  • counts — map of RESULT_STATUS to count (e.g., PARSE_SUCCESS, PARSE_EXCEPTION, TIMEOUT, OOM).

  • totalCountResult — total documents processed and whether the enumeration is complete.

  • timestamp — when the file was last written.

  • crashMessage — populated only on fatal pipeline failure.

The file is rewritten in full on each tick, not appended.

Live status for watching applications

The reporter is designed to support external "watchers" — UIs, dashboards, or monitoring scripts that poll the status file to display pipeline progress. To use it that way, set reportUpdateMs to match your desired refresh rate:

"reportUpdateMs": 250

The watcher polls statusFile on its own interval and reads the most recent snapshot. Because the file is rewritten in full with the latest status, watchers do not need to handle partial reads.

This pattern is used by tika-gui-v2 to drive its progress UI: the GUI starts a pipeline subprocess, points the reporter at a temp file, and polls that file every few hundred milliseconds.

Tradeoffs:

  • Smaller reportUpdateMs values mean more disk writes. On a fast SSD this is negligible, but on a slow disk (or NFS) the writer thread can become a bottleneck.

  • The reporter thread sleeps between writes, so the worst-case staleness of the file is reportUpdateMs milliseconds plus serialization time.

  • Per-record report() calls are cheap (counter increment only). The cost of "watching" is bounded by the periodic write, not by document throughput.

Security Notes

  • basePath is a sandbox boundary. The fetcher and emitter reject fetch/emit keys that resolve outside basePath. Do not set allowAbsolutePaths=true unless the source of fetch keys is fully trusted — an attacker-controlled fetch key could otherwise read arbitrary files.

  • Symlinks are followed. A symlink under basePath pointing outside basePath may still be readable. If you need strict containment, do not allow symlinks in your input tree.

  • Output directories are created automatically. The emitter creates intermediate directories as needed. Make sure the process’s umask is appropriate for the data being written.