Kristian Rickert created TIKA-4771:
--------------------------------------
Summary: Pluggable external parsers over gRPC: attach plugin-owned
results to Document via a typed Any envelope
Key: TIKA-4771
URL: https://issues.apache.org/jira/browse/TIKA-4771
Project: Tika
Issue Type: New Feature
Components: parser
Reporter: Kristian Rickert
Follow-up to TIKA-4766 (PR #2921). Design proposal – feedback wanted on the
open questions below before code.
h3. Problem
The typed {{Document}} contract deliberately does not model format- or
domain-specific result shapes (a document-layout model's tree, NLP annotations,
embeddings). Downstream projects want to attach such results to a Tika parse
without Tika ever having to model, depend on, or even load their types.
h3. Proposal
A third party implements a small {{ExternalParser}} gRPC service and registers
it with the Tika gRPC server. For each document whose content type matches the
registration, the server calls the plugin and appends its result to the
{{{}Document{}}}. The result rides in an envelope that wraps a
{{{}google.protobuf.Any{}}}:
{code:java}
message ExtensionResult {
google.protobuf.Any payload = 1; // plugin-owned message type; Tika never
models it
string plugin_id = 2; // which registration produced this
repeated string warnings = 3;
int64 call_time_ms = 4;
// future (additive, non-breaking): schema/descriptor reference, links into
// Document blocks, common cross-plugin metadata
}
// on the Document from TIKA-4766:
repeated ExtensionResult extensions = 40;
{code}
Wrapping the {{Any}} (rather than a bare {{{}repeated Any{}}}) means envelope
metadata can be added later as {{optional}} fields without another contract
change.
This extends a pattern tika-grpc already has: the server currently brokers
registered fetchers, emitters, and pipes iterators; this applies the same
registration-and-routing model to parse-time enrichment.
h3. Sync and streaming plugin modes
This is where gRPC can shine and really speed up indexing pipelines -
The {{ExternalParser}} service offers {{rpc Parse(...) returns
(ExternalParseReply)}} (unary, required) and {{rpc ParseStream(...) returns
(stream ExternalParseReply)}} (optional). The registration record declares
which modes the plugin supports; plugin authors pick their programming model
and Tika folds either into {{{}ExtensionResult{}}}(s). Under the document event
stream (separate follow-up proposal, ticket forthcoming), streamed plugin
results are forwarded to clients as extension events the moment they arrive,
rather than after the parse completes.
h3. Type safety without coupling
Think of the Any object as similar to having a struct in java with (Object
payload, Class<T> clazz) as it's members.
{{Any}} is lazy: the payload stays bytes until a consumer that has the plugin's
generated class unpacks it.
{code:java}
ExtensionResult r = document.getExtensions(0);
if (r.getPayload().is(DoclingDocument.class)) {
DoclingDocument doc = r.getPayload().unpack(DoclingDocument.class); // T
extends Message
}
{code}
Tika itself never links against plugin classes. Java/Python/Go/Rust clients
each unpack with their own generated stubs from the plugin's proto. This
solves the need for gRPC systems to waste cycles on serializing when all it's
doing is brokering parts of messages.
h3. Rendering without compiled classes (descriptors)
For JSON transcoding or debugging of a payload whose class is not on the
classpath, descriptors are required. Proposed path: the registration record may
optionally carry a serialized {{FileDescriptorSet}} (tika-grpc-api already
bundles its own descriptors this way), enabling {{DynamicMessage}} +
{{JsonFormat.TypeRegistry}} rendering. A full schema-registry integration can
come later as an additive feature; it is not a prerequisite.
h3. Security
Registration makes the server call an arbitrary network address on every future
matching parse. It must be gated exactly like fetcher/iterator management
({{{}allowComponentManagement{}}}, disabled by default), TLS-capable, and a
failing or unreachable plugin must never fail the parse (log + envelope warning
instead).
h3. Open questions (open input wanted!!)
# Registration model: runtime RPCs
({{{}Register/Get/List/UnregisterExternalParser{}}}), static tika-config
entries, or both?
# Plugin input: content bytes + content type + the parsed {{{}Document{}}}, so
plugins never re-fetch or re-parse? Passing the full {{Document}} is the
current lean.
# Channel management: pooled/cached channels per registered target, with a
call deadline.
# Descriptor transport: inline {{FileDescriptorSet}} on the registration vs.
external registry vs. defer entirely.
h3. Status
A working prototype (service proto, registry, registration RPCs, tests) exists
on a branch and will be reshaped to this envelope design after PR #2921
settles, so the wire contract tracks the final {{Document}} shape. A demo
consumer is planned on the OpenNLP side (OPENNLP-1833): Tika parse -> typed
Document -> NLP annotations/embeddings attached as an {{{}ExtensionResult{}}}.
h3. Opinionated side-note
A new parser ships in some other language every other week, and the pace is
accelerating. This proposal lets Tika ride that wave instead of chasing it:
each engine owns its result type end-to-end, and Tika orchestrates through the
common {{Document}} and envelope metadata. It's very much in the spirit of the
Pipes design – the same registration-and-routing idea that made fetchers and
emitters pluggable, extended to parse output. I think this makes integrations
dramatically easier and opens Tika up to parsing capability it would never want
to carry natively – as witnessed by my initial design.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)