Pipes Troubleshooting
This page covers diagnosing problems with the forked PipesServer processes
that Tika Pipes uses for per-document isolation. The most common symptom is a
forked process that dies during startup, or one that becomes unresponsive
mid-run.
When a forked server fails to start
The Tika parent process always logs the exit code of a failed fork. You will see something like:
ERROR clientId=2: Process exited with code 1 before connecting to socket
ERROR Shared server process exited with code 1 before becoming ready
For native JVM crashes (e.g. a segfault in a JNI parser), the JVM writes an
hs_err_pid<N>.log file. We direct that via -XX:ErrorFile= into the
manager’s per-server temp directory, then read it into the parent’s SLF4J
logger before cleanup:
ERROR clientId=2: JVM crash log hs_err_pid12345.log:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f...
...
So for native crashes, read the parent application’s log first — the hs_err contents are inlined there.
Child JVM stdout/stderr
By default the child PipesServer JVM inherits its stdout and stderr from
the parent. This is the 12-factor / container-friendly default: when Tika
runs in Docker or Kubernetes, the pipes-server’s log records flow through
to the container’s stdio stream where the runtime (Docker, containerd) and
any log aggregator (fluentd, fluent-bit, Promtail, the K8s log API, etc.)
pick them up automatically. The default pipes-fork-server-default-log4j2.xml
writes to SYSTEM_ERR, so inheritance is what makes those records visible
to your observability stack.
Telling fork lines from parent lines
Since the fork and parent share a single stdio stream, the bundled
pipes-fork-server-default-log4j2.xml pattern adds two orthogonal markers
so you can read the interleaved output:
-
[fork]— present only on lines emitted by a forkedPipesServerJVM. Lines from the parent process (PipesClient,AsyncProcessor,ConnectionHandler,tika-server,tika-grpc, etc.) do not carry this tag. Different mechanism on each side: the fork has it injected via the bundled pattern’s literal[fork]token; the parent does not include it in its own log4j2/logback patterns. -
pipesClientId=N— the same value on both sides of a pair. The parent’sPipesClient #Nalways connects to the fork running with-DpipesClientId=N, so the same N enables correlation across the process boundary. Use it to gather every log line about one conversation, regardless of which side emitted them.
A typical interleaved snippet:
INFO [main] 14:23:45,123 [fork] pipesClientId=0 o.a.t.p.c.server.PipesServer received SHUT_DOWN
DEBUG [Thread-3] 14:23:45,124 o.a.t.p.c.async.AsyncProcessor pipesClientId=0, status=PARSE_SUCCESS
The first line is from inside fork 0 ([fork] present). The second is
the parent talking about fork 0 ([fork] absent, but the same client
id appears in the message body).
If you don’t want the pipes-server’s output interleaved with your own — e.g. an embedded use case where the parent is producing its own structured
stdout, or a test environment where you want a quieter console — set the
system property tika.pipes.server.stdio=discard on the parent JVM:
java -Dtika.pipes.server.stdio=discard -jar your-app.jar ...
With this set, the child’s stdout and stderr are routed to the null sink
and the pipes server’s log records are silently dropped at the OS level.
(Records written via SLF4J inside the child can still be captured by
configuring log4j2.xml / logback.xml to write to your own file or
network appender, independent of the stdio setting.)
Safety of the inherit default on Windows
Earlier versions of Tika hit a surefire hang on Windows when inheriting
child stdio: a forked child held a duplicate of the parent JVM’s stderr
handle, and any reader upstream of the parent (a maven-surefire controller,
typically) never saw EOF after the parent died — the child kept the pipe
open. That class of hang is now mitigated structurally: every child
PipesServer watches its parent’s process handle via
ProcessHandle.onExit() (see Parent-death detection) and self-
terminates within milliseconds of parent exit. The inherited handle is
released essentially synchronously with the parent’s death, and upstream
readers see EOF promptly.
Parent-death detection
The child PipesServer JVMs watch their parent’s PID via
ProcessHandle.onExit() and self-terminate within milliseconds if the
parent dies. The parent passes its own PID via the
TIKA_PIPES_PARENT_PID environment variable when spawning the child.
This matters because the parent (e.g. tika-server) can be killed in ways
that skip its JVM shutdown hooks — for instance,
Process.destroy() on Windows is equivalent to TerminateProcess, which
bypasses all hooks. Without parent-death detection, an orphaned PipesServer
would only notice via TCP RST on its next socket read, and would not
notice at all while busy in a parse, leaving it (and any external
subprocess it had spawned, such as a tesseract OCR worker) running
indefinitely.
When the watcher fires, the child exits via System.exit, which runs
`AbstractExternalProcessParser’s shutdown hook and cleans up any
in-flight external subprocesses.
Log levels and sensitive data
Tika Pipes treats FetchKey and EmitKey values as potentially sensitive — they typically contain file paths, URLs, object-store keys, or other identifiers
that may be private to the data owner. The convention across pipes core and the
bundled plugins is:
| Level | What is logged |
|---|---|
|
Failures, exceptions, and configuration problems. Never the literal
|
|
Lifecycle events — server start/stop, plugin start/stop, mode banners, restart events. Per-document or per-request lines have been demoted from INFO to DEBUG so production logs stay quiet. |
|
Per-document progress and aggregated counts (e.g. |
|
Verbose per-fetch and per-emit detail including the literal
|
The fetcher and emitter SPIs (Fetcher.fetch, Emitter.emit,
StreamEmitter.emit) receive the literal key but not the tuple id, so
plugin code can only log the literal key. Keeping that at TRACE keeps it
out of any log destination that is configured at DEBUG or higher.
If you write your own fetcher or emitter plugin, please follow the same convention: literal keys at TRACE, everything else at DEBUG or above with no key in the message.
Exception messages thrown out of a fetcher may still include
response-body bytes for HTTP-style fetchers (configurable via
maxErrMsgSize on HttpFetcherConfig). Those bytes appear in whatever
log catches the thrown exception. Lower maxErrMsgSize — or set it to
zero — if your responses can contain sensitive data.
|
Logging
Tika uses Log4j 2 for both tika-app and tika-server. Default output goes to SYSTEM_ERR with the pattern %-5p [%t] %d{HH:mm:ss,SSS} %c %m%n. Each forked PipesServer logs with its own line prefix so parent and child output stays distinguishable; see Telling fork lines from parent lines.
Default log4j2 configuration
Each distribution ships its own log4j2.xml at the root of the jar (i.e., the entry is just log4j2.xml, not under a package path):
-
tika-app: bundled in
tika-app-<version>.jar. -
tika-server: bundled in
tika-server-standard-<version>.jar.
To inspect or extract the bundled config:
unzip -p tika-app-<version>.jar log4j2.xml
unzip -p tika-server-standard-<version>.jar log4j2.xml
Root level defaults to INFO. The bundled configurations are the source of truth — pull them out of the jar if you want to see exactly which loggers are tuned.
Changing the log level
In order of increasing reach:
-
tika-app-v/--verbose— sets the root logger toDEBUGfor the current invocation only. Cheapest knob if you just want a noisier one-off run. -
tika-serverlogLevelconfig field — set"server": {"logLevel": "debug"}(or"info") intika-config.json. Applied at server startup. -
Custom
log4j2.xml— for fine-grained control (per-logger levels, custom appenders, JSON output, file rotation), supply your own configuration via the standard Log4j 2 system property:java -Dlog4j.configurationFile=/path/to/my-log4j2.xml -jar tika-app.jar ...Your file overrides the bundled one entirely. Start from a copy of the bundled config and tighten or relax loggers from there.
Forked-process logging
Forked PipesServer JVMs inherit the parent’s log4j2 configuration unless tika.pipes.server.stdio=discard is set (in which case all child stdout/stderr is suppressed at the OS level — see Configuration knobs reference).
To debug a specific fork, leave stdio on inherit (the default) and grep parent log output for the pipesClientId=<n> marker that each fork includes.
Configuration knobs reference
| System property / env var | Effect |
|---|---|
|
|
|
Set automatically by the parent manager when spawning a |