[ 
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18093082#comment-18093082
 ] 

ASF GitHub Bot commented on TIKA-4766:
--------------------------------------

krickert opened a new pull request, #2921:
URL: https://github.com/apache/tika/pull/2921

   ## Summary
   
   Follow-up to #2916, reshaped per the review there. Instead of mirroring 
Tika's open metadata taxonomy in protobuf (~5k lines of proto, per-format 
messages), this PR types the thing that is actually stable: **the parsed 
document**. One small contract — `document.proto` is 208 lines — and format 
specifics live in per-parser mapping *code*, never in the wire.
   
   `FetchAndParseReply.fields` (`map<string,string>`, field 2, now reserved) is 
replaced by `FetchAndParseReply.document`.
   
   ## How this answers the #2916 review
   
   | Concern from #2916 | Where it landed |
   |---|---|
   | "11k lines to nail down maybe 80% of an open set" | 208-line contract; 
whole PR is ~3.9k lines, most of it mapper code + tests |
   | Clients rebuild when metadata definitions change | A metadata key is now 
*data* (`extra` tail), not schema — add/rename/retype a Tika `Property` and no 
client regenerates anything |
   | Lossless catch-all as source of truth | `Document.extra` carries every 
Tika key, multivalue-preserving; a test asserts nothing is dropped |
   | Special handling only for DC + core props | `DocumentMetadata` types only 
the bounded cross-format fields (title, authors, dates as `Timestamp`, counts, 
dimensions, rights) |
   | Break it into individually reviewable tasks | This PR is the contract 
only; see "Deliberately not in this PR" |
   
   ## The shape
   
   1. **Content tree**: `markdown` (the same render `ToMarkdownContentHandler` 
already produces since TIKA-4730) plus `blocks` — that markdown parsed once, 
format-agnostically, into a structured tree of 
headings/paragraphs/lists/**tables**/code blocks/inline runs (CommonMark + GFM, 
a spec that does not churn). This is what a downstream NLP/RAG/embeddings 
consumer actually wants: typed tables and sections, not a string to re-parse.
   2. **Typed common metadata**: `DocumentMetadata`, grouped by concern, not by 
source format. Dates are `Timestamp`s, counts are ints — not strings that 12 
language clients each re-parse.
   3. **Tagged tail**: `extra` — every remaining key, typed only where Tika's 
own `Property` declares a type (integer/real/boolean/date), string otherwise, 
never guessed.
   4. **`embedded`** recurses: a PDF with an embedded image is a parent 
`Document` with a fully typed child — no forcing two formats into one bucket 
(this was the oneof problem from #2916).
   5. Adding a format = adding a `DocumentTransformer` (see 
`tika-grpc-mapper/docs/EXTENSIONS.md`); `PdfDocumentTransformer` is 65 lines 
and the wire contract does not move.
   
   ## Deliberately not in this PR (follow-ups, each its own PR)
   
   - **Pluggable external parsers**: registering a third-party gRPC service 
whose output rides along on the `Document` as a `google.protobuf.Any` — so 
wildly different result shapes (e.g. a document-layout model's tree) never 
require Tika to model them. Built and tested on a branch; kept out to keep this 
reviewable.
   - **A Markdown parser** for `.md` input files (separate JIRA).
   - Richer typed fields, if and only if real cross-format demand appears — 
they'd be additive `optional` fields, compatible both directions.
   
   ## Open decisions where reviewer preference wins
   
   1. **Tail shape**: `repeated MetadataField` with a typed value oneof (as 
implemented) vs the `map<string, StringList>` suggested in #2916. The 
typed-where-declared tail preserves types without guessing; the map is 
maximally churn-proof. Swapping is a one-message change — happy to go either 
way.
   2. **`markdown` + `blocks` both**: today both ship (string render + 
structured tree). If payload size matters, a per-request flag choosing one is 
easy.
   3. **Hard removal vs staged**: `fields` is hard-removed (4.0, nothing 
consumes it yet); can switch to deprecate-then-remove if preferred.
   
   ## Client migration
   
   | Before | After |
   |--------|-------|
   | `fields["X-TIKA:content"]` | `document.markdown` (or walk 
`document.blocks`) |
   | `fields["Content-Type"]` | `document.content_type` |
   | Ad hoc title/author/date strings | `document.metadata.title` / `.authors` 
/ `.created` (`Timestamp`) |
   | Any other key | `document.extra` (typed by declared `Property` type, 
string otherwise) |
   
   ## Test plan
   
   - [x] `./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test` — green 
(transformer tests against real parse fixtures per format, block-tree tests, 
`DocumentBuilder` envelope/status/embedded tests, server tests reading 
`FetchAndParseReply.document`)
   - [x] `tika-grpc-api` jar bundles 
`META-INF/org.apache.tika.grpc.v1.descriptors` (verified: contains 
`document.proto`)
   - [x] e2e `tika-grpc-e2e-test` compiles against the new API
   - [ ] CI
   
   Downstream context: this contract is what the OpenNLP gRPC work 
(OPENNLP-1833) will consume as input — Tika parse → typed document → 
NLP/embeddings without re-parsing strings.
   




> Typed ParseResponse for Tika gRPC
> ---------------------------------
>
>                 Key: TIKA-4766
>                 URL: https://issues.apache.org/jira/browse/TIKA-4766
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 4.0.0
>            Reporter: Kristian Rickert
>            Priority: Major
>              Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map 
> ({{{}map<string,string>{}}}) with a typed 
> {{org.apache.tika.grpc.v1.ParseResponse}} on 
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned 
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other 
> supported formats). Dublin Core fields are normalized on the response root. 
> Creative Commons / XMP rights metadata is exposed on a dedicated field when 
> present.
> This is a *breaking change* for gRPC clients that read string keys from 
> {{{}fields{}}}.
> h2. Motivation
>  * Clients today must parse hundreds of ad hoc string keys with no schema or 
> type safety.
>  * Format-specific metadata is easier to consume and evolve with protobuf + 
> bundled descriptors.
>  * Mapping logic is separated from the gRPC server so it can be tested and 
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch 
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java 
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional 
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove 
> {{fields}}|
> *Out of scope (follow-up tickets):*
>  * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
>  * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}} 
> (currently additional_struct MVP)
>  * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2, 
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
>  # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}}, 
> {{{}content{}}}, optional {{embedded_docs}}
>  # *Dublin Core:* {{dublin_core}} (shared across formats)
>  # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}}, 
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}}, 
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}}, 
> {{{}generic{}}})
>  # *Creative Commons:* {{creative_commons}} when XMP rights metadata is 
> present
> h2. Architecture
> {code:none}
> Client
>   -> TikaGrpcServerImpl (tika-grpc)
>        -> Tika Pipes / parsers -> Metadata + body
>        -> ParseResponseMapper (tika-grpc-mapper)
>             -> format builders -> ParseResponse (tika-grpc-api)
>   <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} + 
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for 
> stubs/descriptors.
> h2. Implementation notes
>  * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
>  * {{creative_commons}} is field 25 outside the document oneof so it can 
> coexist with PDF/Office/etc.
>  * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional 
> post-processing (e.g. outlines) without coupling the core mapper to 
> PDFBox/HTML libraries
>  * Mapper tests use Tika parser test-jar fixtures (~35 tests in 
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
>  * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is 
> removed/reserved
>  * ( ) {{tika-grpc-api}} jar bundles descriptor set at 
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
>  * ( ) PDF, Office, HTML, and at least one other format return the expected 
> typed oneof in integration tests
>  * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
>  * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for 
> clients
>  * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
>  * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} / 
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
>  * Confirm no regression in fetcher CRUD and streaming RPCs
>  * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP 
> rights
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to