Kristian Rickert created TIKA-4772:
--------------------------------------
Summary: Stream parsed-document events over gRPC; unary Document
becomes the fold of the stream
Key: TIKA-4772
URL: https://issues.apache.org/jira/browse/TIKA-4772
Project: Tika
Issue Type: New Feature
Components: parser
Reporter: Kristian Rickert
h3. Motivation
Tika's parsers are already incremental: every parser emits its output
progressively through the {{ContentHandler}} contract as it reads the input.
The buffered single-reply gRPC response is an artifact of the transport, not of
Tika's model. Exposing the parse as an event stream (a) matches what the
parsers already do, (b) gives time-to-first-block latency for RAG/embedding/NLP
pipelines that can begin work while a large document is still parsing, (c)
bounds client memory on huge documents, and (d) makes partial results on
failure precise instead of best-effort.
h3. Design: one contract, two delivery modes
{code:java}
rpc FetchAndParseEvents(FetchAndParseRequest) returns (stream DocumentEvent) {}
message DocumentEvent {
oneof event {
DocumentEnvelope envelope = 1; // first event: id, content_type, origin
BlockEvent block = 2; // one content Block, in document order
DocumentMetadata metadata = 3; // emitted when known (often late in the
parse)
MetadataField extra = 4;
ExtensionResult extension = 5; // external parser results, forwarded as
they arrive
ParseStatus status = 6; // terminal event
}
}
message BlockEvent {
repeated int32 embedded_path = 1; // [] = root document; [2] = third
embedded child
Block block = 2;
}
{code}
Events are append-only immutable facts (a late {{metadata}} event completes,
never revises). The defining invariant: *The fold of the event stream is
exactly the {{Document}} from TIKA-4766.* Backward compatibility falls out by
construction rather than by discipline: the existing unary-style reply is
implemented AS the fold – the server consumes its own event stream to
completion and emits the assembled {{Document}}. There is one parse path, so
the streamed and materialized forms cannot drift, and existing clients see
identical behavior forever. A mid-parse failure terminates the stream with
{{PARTIAL}}, and the fold yields a Document holding exactly the blocks parsed
before the failure -- {{PARTIAL}} becomes a precise statement instead of a
best-effort flag.
h3. The event model is the wire's native model
{{{}DocumentEvent{}}}/{{{}Block{}}} is the contract's native representation.
Tika's existing parsers keep their ContentHandler contract exactly as-is: a
bridge insideoutput into document events at the edge, and a reverse adapter
serves anything that consumes ContentHandlers today – nothing about the Java
parser APIchanges, now or later. The same event stream can equally be produced
directly, by server components or (via the external parser mechanism,
TIKA-4771) by services in any language, so every producer is first-class and
identical on the wire. Structure the block model cannot express maps to
{{{}HtmlBlock{}}}/metadata, never silently dropped.
This extends a pattern tika-grpc already has: the server currently brokers
registered fetchers, emitters, and pipes iterators; this applies the same
brokering to parse output.
h3. Phasing
# *Phase 1 (contract + cheap wins):* declare the RPC; serve a coarse but valid
stream (envelope, then blocks, then metadata, then terminal status); implement
the unary-style reply as the fold. Streaming results from external parsers
({{{}ParseStream{}}} in TIKA-4771) also land here - the gRPC server calls those
services directly, so forwarding their streams touches nothing outside
tika-grpc.
# *Phase 2 (fine granularity):* frame {{DocumentEvent}} across the tika-pipes
forked-worker boundary so events flow live during the parse. The worker
protocol speaks {{DocumentEvent}} frames; the existing parser fleet feeds it
through the {{ContentHandler}} bridge at the edge. This touches tika-pipes-core
and deserves its own discussion - maintainer input explicitly wanted on
whether/how the worker framing should support incremental events. Because unary
== fold, improving granularity later changes neither the contract nor unary
behavior.
h3. Open questions
# Pipes worker protocol: appetite and preferred shape for incremental event
framing across the fork boundary (this is the phase-2 gate)?
# Ordering/assembly: block order per {{embedded_path}} is document order; is
that sufficient, or do we want explicit sequence numbers for resumability?
# Flow control: rely on gRPC per-stream backpressure, or add a server-side
high-water mark with a documented buffering policy?
# Metadata timing: emit known-early fields (content type, origin) in the
envelope and the rest at completion, or allow multiple cumulative {{metadata}}
events?
h3. Relationship to other work
Depends on TIKA-4766 (the {{{}Document{}}}/{{{}Block{}}} contract this
streams). Relates to TIKA-4771 (external parsers; their streamed results ride
this stream as {{extension}} events). Phase 2 will spawn a linked tika-pipes
issue once there is agreement on direction.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)