krickert opened a new pull request, #2921: URL: https://github.com/apache/tika/pull/2921
## Summary Follow-up to #2916, reshaped per the review there. Instead of mirroring Tika's open metadata taxonomy in protobuf (~5k lines of proto, per-format messages), this PR types the thing that is actually stable: **the parsed document**. One small contract — `document.proto` is 208 lines — and format specifics live in per-parser mapping *code*, never in the wire. `FetchAndParseReply.fields` (`map<string,string>`, field 2, now reserved) is replaced by `FetchAndParseReply.document`. ## How this answers the #2916 review | Concern from #2916 | Where it landed | |---|---| | "11k lines to nail down maybe 80% of an open set" | 208-line contract; whole PR is ~3.9k lines, most of it mapper code + tests | | Clients rebuild when metadata definitions change | A metadata key is now *data* (`extra` tail), not schema — add/rename/retype a Tika `Property` and no client regenerates anything | | Lossless catch-all as source of truth | `Document.extra` carries every Tika key, multivalue-preserving; a test asserts nothing is dropped | | Special handling only for DC + core props | `DocumentMetadata` types only the bounded cross-format fields (title, authors, dates as `Timestamp`, counts, dimensions, rights) | | Break it into individually reviewable tasks | This PR is the contract only; see "Deliberately not in this PR" | ## The shape 1. **Content tree**: `markdown` (the same render `ToMarkdownContentHandler` already produces since TIKA-4730) plus `blocks` — that markdown parsed once, format-agnostically, into a structured tree of headings/paragraphs/lists/**tables**/code blocks/inline runs (CommonMark + GFM, a spec that does not churn). This is what a downstream NLP/RAG/embeddings consumer actually wants: typed tables and sections, not a string to re-parse. 2. **Typed common metadata**: `DocumentMetadata`, grouped by concern, not by source format. Dates are `Timestamp`s, counts are ints — not strings that 12 language clients each re-parse. 3. **Tagged tail**: `extra` — every remaining key, typed only where Tika's own `Property` declares a type (integer/real/boolean/date), string otherwise, never guessed. 4. **`embedded`** recurses: a PDF with an embedded image is a parent `Document` with a fully typed child — no forcing two formats into one bucket (this was the oneof problem from #2916). 5. Adding a format = adding a `DocumentTransformer` (see `tika-grpc-mapper/docs/EXTENSIONS.md`); `PdfDocumentTransformer` is 65 lines and the wire contract does not move. ## Deliberately not in this PR (follow-ups, each its own PR) - **Pluggable external parsers**: registering a third-party gRPC service whose output rides along on the `Document` as a `google.protobuf.Any` — so wildly different result shapes (e.g. a document-layout model's tree) never require Tika to model them. Built and tested on a branch; kept out to keep this reviewable. - **A Markdown parser** for `.md` input files (separate JIRA). - Richer typed fields, if and only if real cross-format demand appears — they'd be additive `optional` fields, compatible both directions. ## Open decisions where reviewer preference wins 1. **Tail shape**: `repeated MetadataField` with a typed value oneof (as implemented) vs the `map<string, StringList>` suggested in #2916. The typed-where-declared tail preserves types without guessing; the map is maximally churn-proof. Swapping is a one-message change — happy to go either way. 2. **`markdown` + `blocks` both**: today both ship (string render + structured tree). If payload size matters, a per-request flag choosing one is easy. 3. **Hard removal vs staged**: `fields` is hard-removed (4.0, nothing consumes it yet); can switch to deprecate-then-remove if preferred. ## Client migration | Before | After | |--------|-------| | `fields["X-TIKA:content"]` | `document.markdown` (or walk `document.blocks`) | | `fields["Content-Type"]` | `document.content_type` | | Ad hoc title/author/date strings | `document.metadata.title` / `.authors` / `.created` (`Timestamp`) | | Any other key | `document.extra` (typed by declared `Property` type, string otherwise) | ## Test plan - [x] `./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test` — green (transformer tests against real parse fixtures per format, block-tree tests, `DocumentBuilder` envelope/status/embedded tests, server tests reading `FetchAndParseReply.document`) - [x] `tika-grpc-api` jar bundles `META-INF/org.apache.tika.grpc.v1.descriptors` (verified: contains `document.proto`) - [x] e2e `tika-grpc-e2e-test` compiles against the new API - [ ] CI Downstream context: this contract is what the OpenNLP gRPC work (OPENNLP-1833) will consume as input — Tika parse → typed document → NLP/embeddings without re-parsing strings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
