File System Plugin

The File System plugin (tika-pipes-file-system) is the most common starting point for Tika Pipes. It provides all four interfaces — fetcher, emitter, iterator, and reporter — backed by the local (or mounted) filesystem.

Interface Component name Class

Fetcher

file-system-fetcher

FileSystemFetcher

Emitter

file-system-emitter

FileSystemEmitter

Iterator

file-system-pipes-iterator

FileSystemPipesIterator

Reporter

file-system-reporter

FileSystemStatusReporter

Complete Pipeline Example

The example below is the canonical filesystem-to-filesystem integration test config. Tokens like FETCHER_BASE_PATH, EMITTER_BASE_PATH, PLUGINS_PATHS, and EMIT_INTERMEDIATE_RESULTS are placeholders the test harness substitutes; replace the path tokens with real paths and EMIT_INTERMEDIATE_RESULTS with the boolean true or false. See Pipes Configuration for what each setting does.

{
  "content-handler-factory": {
    "basic-content-handler-factory": {
      "type": "TEXT",
      "writeLimit": -1,
      "throwOnWriteLimitReached": true
    }
  },
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "FETCHER_BASE_PATH",
        "extractFileSystemMetadata": false
      }
    }
  },
  "emitters": {
    "fse": {
      "file-system-emitter": {
        "basePath": "EMITTER_BASE_PATH",
        "fileExtension": "json",
        "onExists": "EXCEPTION"
      }
    }
  },
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "FETCHER_BASE_PATH",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "fse"
    }
  },
  "pipes": {
    "parseMode": "RMETA",
    "onParseException": "EMIT",
    "numClients": 4,
    "emitIntermediateResults": "EMIT_INTERMEDIATE_RESULTS",
    "forkedJvmArgs": ["-Xmx512m"],
    "emitStrategy": {
      "type": "DYNAMIC",
      "thresholdBytes": 1000000
    }
  },
  "auto-detect-parser": {
    "throwOnZeroBytes": false
  },
  "parse-context": {
    "mock-digester-factory": {},
    "timeout-limits": {
      "progressTimeoutMillis": 5000
    }
  },
  "plugin-roots": "PLUGINS_PATHS"
}

File System Fetcher (file-system-fetcher)

Reads files from a local or mounted filesystem. Fetch keys are resolved relative to basePath.

{
  "fetchers": {
    "fsf": {
      "file-system-fetcher": {
        "basePath": "/data/input",
        "extractFileSystemMetadata": true
      }
    }
  }
}

The outer key (fsf) is the fetcher ID — referenced by pipesIterator.fetcherId elsewhere in the config.

Configuration

Field Default Description

basePath

required

Base directory for fetch operations. Fetch keys are resolved relative to this path.

extractFileSystemMetadata

false

When true, attach file size, created, and modified timestamps to the metadata of each fetched document.

allowAbsolutePaths

false

When true, fetch keys may be absolute paths and basePath may be omitted. Use sparingly — see Security Notes.

File System Emitter (file-system-emitter)

Writes parsed results as files under basePath. The relative output path is derived from the emit key of each FetchEmitTuple.

{
  "emitters": {
    "fse": {
      "file-system-emitter": {
        "basePath": "/data/output",
        "fileExtension": "json",
        "onExists": "EXCEPTION",
        "prettyPrint": false
      }
    }
  }
}

Configuration

Field Default Description

basePath

required

Base output directory. The emit key is resolved relative to this path.

fileExtension

json

Extension appended to each output file. For CONTENT_ONLY mode, set this to match the handler type (txt, html, md, xml).

onExists

EXCEPTION

Behavior when the output file already exists: SKIP (do nothing), REPLACE (overwrite), EXCEPTION (fail loudly).

prettyPrint

false

Pretty-print JSON output. Has no effect in CONTENT_ONLY mode (raw bytes are written).

File System Iterator (file-system-pipes-iterator)

Recursively walks a directory tree, emitting one FetchEmitTuple per file found.

{
  "pipes-iterator": {
    "file-system-pipes-iterator": {
      "basePath": "/data/input",
      "countTotal": true,
      "fetcherId": "fsf",
      "emitterId": "fse"
    }
  }
}

Configuration

Field Default Description

basePath

required

Root directory to walk.

countTotal

true

If true, walks the tree once to count files before processing begins. Enables progress reporting at the cost of an extra scan over the tree.

fetcherId / emitterId

required

IDs of the fetcher and emitter to bind to each emitted tuple. See Pipes Iterators for the shared iterator contract.

Notes

  • Walk order is filesystem-dependent and not guaranteed stable across runs.

  • The relative path of each file (from basePath) becomes the fetch key, and by default also the emit key.

  • Symbolic links are followed.

File System Reporter (file-system-reporter)

Maintains a JSON status file that summarizes pipeline progress. The reporter writes the file periodically on a background thread; per-record report() calls only update in-memory counters.

{
  "pipes-reporters": {
    "file-system-reporter": {
      "statusFile": "/var/log/tika/status.json",
      "reportUpdateMs": 1000
    }
  }
}

pipes-reporters accepts multiple reporters keyed by type name — see Pipes Reporters for how multiple reporters compose.

Configuration

Field Default Description

statusFile

required

Path of the JSON status file. Absolute paths are written as given; relative paths resolve against the JVM’s working directory at startup. Parent directories that don’t exist are created automatically on first write. Always include a parent component (e.g., ./status.json rather than bare status.json) — the auto-create step fails on a path with no parent. The file is created on first write and overwritten in place.

reportUpdateMs

no default

Interval in milliseconds between status-file writes. Typical values: 1000 for a low-overhead heartbeat, 100 for near-real-time updates. There is no built-in default — always set this explicitly.

Status file schema

The reporter serializes an AsyncStatus object to JSON, containing:

  • started — ISO-8601 timestamp of when the reporter was constructed.

  • lastUpdate — ISO-8601 timestamp of the most recent write.

  • asyncStatus — current pipeline phase (STARTED, COMPLETED, CRASHED).

  • statusCounts — map of PipesResult.RESULT_STATUS to count (e.g., PARSE_SUCCESS, PARSE_EXCEPTION, TIMEOUT, OOM, EMIT_SUCCESS, EMIT_EXCEPTION).

  • totalCountResult — total documents discovered by the iterator and whether the enumeration is complete.

  • crashMessage — empty string under normal operation; populated with a stack trace on fatal pipeline failure.

The file is rewritten in full on each tick, not appended.

The write is not atomic — the reporter opens the target path with Files.newBufferedWriter, truncates, and streams the JSON. A watcher reading concurrently with a write can observe a truncated or partial document. Have the watcher treat a parse error as "stale read, try again on the next poll" rather than as a real error.

Live status for watching applications

The reporter is designed to support external "watchers" — UIs, dashboards, or monitoring scripts that poll the status file to display pipeline progress. To use it that way, set reportUpdateMs to match your desired refresh rate:

"reportUpdateMs": 250

The watcher polls statusFile on its own interval and reads the most recent snapshot. Each tick rewrites the file in full, so successive snapshots are always coherent — but because the write is not atomic, a watcher reading mid-write can see a truncated document. Tolerate JSON parse errors as transient and retry on the next poll (see the NOTE under Status file schema).

This pattern is used by tika-gui-v2 to drive its progress UI: the GUI starts a pipeline subprocess, points the reporter at a temp file, and polls that file every few hundred milliseconds.

Tradeoffs:

  • Smaller reportUpdateMs values mean more disk writes. On a fast SSD this is negligible, but on a slow disk (or NFS) the writer thread can become a bottleneck.

  • The reporter thread sleeps between writes, so the worst-case staleness of the file is reportUpdateMs milliseconds plus serialization time.

  • Per-record report() calls are cheap (counter increment only). The cost of "watching" is bounded by the periodic write, not by document throughput.

Security Notes

  • basePath is a sandbox boundary. The fetcher and emitter reject fetch/emit keys that resolve outside basePath. Do not set allowAbsolutePaths=true unless the source of fetch keys is fully trusted — an attacker-controlled fetch key could otherwise read arbitrary files.

  • Symlinks are followed. A symlink under basePath pointing outside basePath may still be readable. If you need strict containment, do not allow symlinks in your input tree.

  • Output directories are created automatically. The emitter creates intermediate directories as needed. Make sure the process’s umask is appropriate for the data being written.