Tika gRPC
This section covers using Apache Tika via gRPC.
Overview
Tika gRPC provides a high-performance gRPC interface for parsing documents. This is useful for microservices architectures and polyglot environments.
The service definition lives in tika-grpc/src/main/proto/tika.proto. Clients
register a fetcher (SaveFetcher) and then submit FetchAndParseRequest
messages, each of which returns a FetchAndParseReply with extracted
metadata and content.
Per-Request ParseContext
FetchAndParseRequest.parse_context_json lets the caller override the
server’s default ParseContext on a per-request basis. Keys are
parse-context component names; values are their JSON configs.
{
"basic-content-handler-factory": {"type": "HTML"},
"timeout-limits": {"progressTimeoutMillis": 30000}
}
See META-INF/tika/parse-context.idx (generated at build time from
@TikaComponent annotations) for the available component names.