Pipes Troubleshooting

This page covers diagnosing problems with the forked PipesServer processes that Tika Pipes uses for per-document isolation. The most common symptom is a forked process that dies during startup, or one that becomes unresponsive mid-run.

When a forked server fails to start

The Tika parent process always logs the exit code of a failed fork. You will see something like:

ERROR  clientId=2: Process exited with code 1 before connecting to socket
ERROR  Shared server process exited with code 1 before becoming ready

For native JVM crashes (e.g. a segfault in a JNI parser), the JVM writes an hs_err_pid<N>.log file. We direct that via -XX:ErrorFile= into the manager’s per-server temp directory, then read it into the parent’s SLF4J logger before cleanup:

ERROR  clientId=2: JVM crash log hs_err_pid12345.log:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f...
...

So for native crashes, read the parent application’s log first — the hs_err contents are inlined there.

Child JVM stdout/stderr

By default the child PipesServer JVM inherits its stdout and stderr from the parent. This is the 12-factor / container-friendly default: when Tika runs in Docker or Kubernetes, the pipes-server’s log records flow through to the container’s stdio stream where the runtime (Docker, containerd) and any log aggregator (fluentd, fluent-bit, Promtail, the K8s log API, etc.) pick them up automatically. The default pipes-fork-server-default-log4j2.xml writes to SYSTEM_ERR, so inheritance is what makes those records visible to your observability stack.

Telling fork lines from parent lines

Since the fork and parent share a single stdio stream, the bundled pipes-fork-server-default-log4j2.xml pattern adds two orthogonal markers so you can read the interleaved output:

  • [fork] — present only on lines emitted by a forked PipesServer JVM. Lines from the parent process (PipesClient, AsyncProcessor, ConnectionHandler, tika-server, tika-grpc, etc.) do not carry this tag. Different mechanism on each side: the fork has it injected via the bundled pattern’s literal [fork] token; the parent does not include it in its own log4j2/logback patterns.

  • pipesClientId=N — the same value on both sides of a pair. The parent’s PipesClient #N always connects to the fork running with -DpipesClientId=N, so the same N enables correlation across the process boundary. Use it to gather every log line about one conversation, regardless of which side emitted them.

A typical interleaved snippet:

INFO  [main] 14:23:45,123 [fork] pipesClientId=0 o.a.t.p.c.server.PipesServer received SHUT_DOWN
DEBUG [Thread-3] 14:23:45,124 o.a.t.p.c.async.AsyncProcessor pipesClientId=0, status=PARSE_SUCCESS

The first line is from inside fork 0 ([fork] present). The second is the parent talking about fork 0 ([fork] absent, but the same client id appears in the message body).

If you don’t want the pipes-server’s output interleaved with your own — e.g. an embedded use case where the parent is producing its own structured stdout, or a test environment where you want a quieter console — set the system property tika.pipes.server.stdio=discard on the parent JVM:

java -Dtika.pipes.server.stdio=discard -jar your-app.jar ...

With this set, the child’s stdout and stderr are routed to the null sink and the pipes server’s log records are silently dropped at the OS level. (Records written via SLF4J inside the child can still be captured by configuring log4j2.xml / logback.xml to write to your own file or network appender, independent of the stdio setting.)

Safety of the inherit default on Windows

Earlier versions of Tika hit a surefire hang on Windows when inheriting child stdio: a forked child held a duplicate of the parent JVM’s stderr handle, and any reader upstream of the parent (a maven-surefire controller, typically) never saw EOF after the parent died — the child kept the pipe open. That class of hang is now mitigated structurally: every child PipesServer watches its parent’s process handle via ProcessHandle.onExit() (see Parent-death detection) and self- terminates within milliseconds of parent exit. The inherited handle is released essentially synchronously with the parent’s death, and upstream readers see EOF promptly.

Parent-death detection

The child PipesServer JVMs watch their parent’s PID via ProcessHandle.onExit() and self-terminate within milliseconds if the parent dies. The parent passes its own PID via the TIKA_PIPES_PARENT_PID environment variable when spawning the child.

This matters because the parent (e.g. tika-server) can be killed in ways that skip its JVM shutdown hooks — for instance, Process.destroy() on Windows is equivalent to TerminateProcess, which bypasses all hooks. Without parent-death detection, an orphaned PipesServer would only notice via TCP RST on its next socket read, and would not notice at all while busy in a parse, leaving it (and any external subprocess it had spawned, such as a tesseract OCR worker) running indefinitely.

When the watcher fires, the child exits via System.exit, which runs `AbstractExternalProcessParser’s shutdown hook and cleans up any in-flight external subprocesses.

Log levels and sensitive data

Tika Pipes treats FetchKey and EmitKey values as potentially sensitive — they typically contain file paths, URLs, object-store keys, or other identifiers that may be private to the data owner. The convention across pipes core and the bundled plugins is:

Level What is logged

ERROR / WARN

Failures, exceptions, and configuration problems. Never the literal fetchKey/emitKey or any file content. When a failure refers to a specific document, it is identified by the non-sensitive FetchEmitTuple.id (e.g. parse exception: id=abc-123).

INFO

Lifecycle events — server start/stop, plugin start/stop, mode banners, restart events. Per-document or per-request lines have been demoted from INFO to DEBUG so production logs stay quiet.

DEBUG

Per-document progress and aggregated counts (e.g. pipesClientId=2, status=PARSE_SUCCESS, successfully emitted N docs). Safe to enable in production for troubleshooting; correlation is by FetchEmitTuple.id only.

TRACE

Verbose per-fetch and per-emit detail including the literal fetchKey/emitKey (URL, S3 key, blob path, etc.). Enable only when you need to correlate a Tika log line back to a specific resource, and accept that those keys will appear in the log destination.

The fetcher and emitter SPIs (Fetcher.fetch, Emitter.emit, StreamEmitter.emit) receive the literal key but not the tuple id, so plugin code can only log the literal key. Keeping that at TRACE keeps it out of any log destination that is configured at DEBUG or higher.

If you write your own fetcher or emitter plugin, please follow the same convention: literal keys at TRACE, everything else at DEBUG or above with no key in the message.

Exception messages thrown out of a fetcher may still include response-body bytes for HTTP-style fetchers (configurable via maxErrMsgSize on HttpFetcherConfig). Those bytes appear in whatever log catches the thrown exception. Lower maxErrMsgSize — or set it to zero — if your responses can contain sensitive data.

Logging

Tika uses Log4j 2 for both tika-app and tika-server. Default output goes to SYSTEM_ERR with the pattern %-5p [%t] %d{HH:mm:ss,SSS} %c %m%n. Each forked PipesServer logs with its own line prefix so parent and child output stays distinguishable; see Telling fork lines from parent lines.

Default log4j2 configuration

Each distribution ships its own log4j2.xml at the root of the jar (i.e., the entry is just log4j2.xml, not under a package path):

  • tika-app: bundled in tika-app-<version>.jar.

  • tika-server: bundled in tika-server-standard-<version>.jar.

To inspect or extract the bundled config:

unzip -p tika-app-<version>.jar log4j2.xml
unzip -p tika-server-standard-<version>.jar log4j2.xml

Root level defaults to INFO. The bundled configurations are the source of truth — pull them out of the jar if you want to see exactly which loggers are tuned.

Changing the log level

In order of increasing reach:

  1. tika-app -v / --verbose — sets the root logger to DEBUG for the current invocation only. Cheapest knob if you just want a noisier one-off run.

  2. tika-server logLevel config field — set "server": {"logLevel": "debug"} (or "info") in tika-config.json. Applied at server startup.

  3. Custom log4j2.xml — for fine-grained control (per-logger levels, custom appenders, JSON output, file rotation), supply your own configuration via the standard Log4j 2 system property:

    java -Dlog4j.configurationFile=/path/to/my-log4j2.xml -jar tika-app.jar ...

    Your file overrides the bundled one entirely. Start from a copy of the bundled config and tighten or relax loggers from there.

Forked-process logging

Forked PipesServer JVMs inherit the parent’s log4j2 configuration unless tika.pipes.server.stdio=discard is set (in which case all child stdout/stderr is suppressed at the OS level — see Configuration knobs reference).

To debug a specific fork, leave stdio on inherit (the default) and grep parent log output for the pipesClientId=<n> marker that each fork includes.

Configuration knobs reference

System property / env var Effect

tika.pipes.server.stdio (system property)

discard suppresses child stdout/stderr at the OS level. Anything else (or unset) inherits the child’s stdio from the parent JVM. Default: inherit.

TIKA_PIPES_PARENT_PID (env var)

Set automatically by the parent manager when spawning a PipesServer child. The child uses it to watch its parent and self-terminate if the parent dies. Not normally set by users; if you launch PipesServer standalone (outside the normal manager flow) and leave it unset, the parent-watch is simply skipped.