[
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18093082#comment-18093082
]
ASF GitHub Bot commented on TIKA-4766:
--------------------------------------
krickert opened a new pull request, #2921:
URL: https://github.com/apache/tika/pull/2921
## Summary
Follow-up to #2916, reshaped per the review there. Instead of mirroring
Tika's open metadata taxonomy in protobuf (~5k lines of proto, per-format
messages), this PR types the thing that is actually stable: **the parsed
document**. One small contract — `document.proto` is 208 lines — and format
specifics live in per-parser mapping *code*, never in the wire.
`FetchAndParseReply.fields` (`map<string,string>`, field 2, now reserved) is
replaced by `FetchAndParseReply.document`.
## How this answers the #2916 review
| Concern from #2916 | Where it landed |
|---|---|
| "11k lines to nail down maybe 80% of an open set" | 208-line contract;
whole PR is ~3.9k lines, most of it mapper code + tests |
| Clients rebuild when metadata definitions change | A metadata key is now
*data* (`extra` tail), not schema — add/rename/retype a Tika `Property` and no
client regenerates anything |
| Lossless catch-all as source of truth | `Document.extra` carries every
Tika key, multivalue-preserving; a test asserts nothing is dropped |
| Special handling only for DC + core props | `DocumentMetadata` types only
the bounded cross-format fields (title, authors, dates as `Timestamp`, counts,
dimensions, rights) |
| Break it into individually reviewable tasks | This PR is the contract
only; see "Deliberately not in this PR" |
## The shape
1. **Content tree**: `markdown` (the same render `ToMarkdownContentHandler`
already produces since TIKA-4730) plus `blocks` — that markdown parsed once,
format-agnostically, into a structured tree of
headings/paragraphs/lists/**tables**/code blocks/inline runs (CommonMark + GFM,
a spec that does not churn). This is what a downstream NLP/RAG/embeddings
consumer actually wants: typed tables and sections, not a string to re-parse.
2. **Typed common metadata**: `DocumentMetadata`, grouped by concern, not by
source format. Dates are `Timestamp`s, counts are ints — not strings that 12
language clients each re-parse.
3. **Tagged tail**: `extra` — every remaining key, typed only where Tika's
own `Property` declares a type (integer/real/boolean/date), string otherwise,
never guessed.
4. **`embedded`** recurses: a PDF with an embedded image is a parent
`Document` with a fully typed child — no forcing two formats into one bucket
(this was the oneof problem from #2916).
5. Adding a format = adding a `DocumentTransformer` (see
`tika-grpc-mapper/docs/EXTENSIONS.md`); `PdfDocumentTransformer` is 65 lines
and the wire contract does not move.
## Deliberately not in this PR (follow-ups, each its own PR)
- **Pluggable external parsers**: registering a third-party gRPC service
whose output rides along on the `Document` as a `google.protobuf.Any` — so
wildly different result shapes (e.g. a document-layout model's tree) never
require Tika to model them. Built and tested on a branch; kept out to keep this
reviewable.
- **A Markdown parser** for `.md` input files (separate JIRA).
- Richer typed fields, if and only if real cross-format demand appears —
they'd be additive `optional` fields, compatible both directions.
## Open decisions where reviewer preference wins
1. **Tail shape**: `repeated MetadataField` with a typed value oneof (as
implemented) vs the `map<string, StringList>` suggested in #2916. The
typed-where-declared tail preserves types without guessing; the map is
maximally churn-proof. Swapping is a one-message change — happy to go either
way.
2. **`markdown` + `blocks` both**: today both ship (string render +
structured tree). If payload size matters, a per-request flag choosing one is
easy.
3. **Hard removal vs staged**: `fields` is hard-removed (4.0, nothing
consumes it yet); can switch to deprecate-then-remove if preferred.
## Client migration
| Before | After |
|--------|-------|
| `fields["X-TIKA:content"]` | `document.markdown` (or walk
`document.blocks`) |
| `fields["Content-Type"]` | `document.content_type` |
| Ad hoc title/author/date strings | `document.metadata.title` / `.authors`
/ `.created` (`Timestamp`) |
| Any other key | `document.extra` (typed by declared `Property` type,
string otherwise) |
## Test plan
- [x] `./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test` — green
(transformer tests against real parse fixtures per format, block-tree tests,
`DocumentBuilder` envelope/status/embedded tests, server tests reading
`FetchAndParseReply.document`)
- [x] `tika-grpc-api` jar bundles
`META-INF/org.apache.tika.grpc.v1.descriptors` (verified: contains
`document.proto`)
- [x] e2e `tika-grpc-e2e-test` compiles against the new API
- [ ] CI
Downstream context: this contract is what the OpenNLP gRPC work
(OPENNLP-1833) will consume as input — Tika parse → typed document →
NLP/embeddings without re-parsing strings.
> Typed ParseResponse for Tika gRPC
> ---------------------------------
>
> Key: TIKA-4766
> URL: https://issues.apache.org/jira/browse/TIKA-4766
> Project: Tika
> Issue Type: New Feature
> Components: tika-pipes
> Affects Versions: 4.0.0
> Reporter: Kristian Rickert
> Priority: Major
> Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map
> ({{{}map<string,string>{}}}) with a typed
> {{org.apache.tika.grpc.v1.ParseResponse}} on
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other
> supported formats). Dublin Core fields are normalized on the response root.
> Creative Commons / XMP rights metadata is exposed on a dedicated field when
> present.
> This is a *breaking change* for gRPC clients that read string keys from
> {{{}fields{}}}.
> h2. Motivation
> * Clients today must parse hundreds of ad hoc string keys with no schema or
> type safety.
> * Format-specific metadata is easier to consume and evolve with protobuf +
> bundled descriptors.
> * Mapping logic is separated from the gRPC server so it can be tested and
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove
> {{fields}}|
> *Out of scope (follow-up tickets):*
> * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
> * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}}
> (currently additional_struct MVP)
> * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2,
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
> # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}},
> {{{}content{}}}, optional {{embedded_docs}}
> # *Dublin Core:* {{dublin_core}} (shared across formats)
> # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}},
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}},
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}},
> {{{}generic{}}})
> # *Creative Commons:* {{creative_commons}} when XMP rights metadata is
> present
> h2. Architecture
> {code:none}
> Client
> -> TikaGrpcServerImpl (tika-grpc)
> -> Tika Pipes / parsers -> Metadata + body
> -> ParseResponseMapper (tika-grpc-mapper)
> -> format builders -> ParseResponse (tika-grpc-api)
> <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} +
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for
> stubs/descriptors.
> h2. Implementation notes
> * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
> * {{creative_commons}} is field 25 outside the document oneof so it can
> coexist with PDF/Office/etc.
> * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional
> post-processing (e.g. outlines) without coupling the core mapper to
> PDFBox/HTML libraries
> * Mapper tests use Tika parser test-jar fixtures (~35 tests in
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
> * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is
> removed/reserved
> * ( ) {{tika-grpc-api}} jar bundles descriptor set at
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
> * ( ) PDF, Office, HTML, and at least one other format return the expected
> typed oneof in integration tests
> * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
> * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for
> clients
> * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
> * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} /
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
> * Confirm no regression in fetcher CRUD and streaming RPCs
> * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP
> rights
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)