[
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092658#comment-18092658
]
ASF GitHub Bot commented on TIKA-4766:
--------------------------------------
krickert commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4844253570
The Claude review was useful and I went through all 7 points. Pushed in
`99883db0c`.
Short version: the "augment, not replace" shape you recommended and where
this landed are the same now. One lossless, populated channel as the source of
truth (`ParseResponse.metadata`), a typed convenience layer on top,
Content-Type trusted, and the oneof dropped for coexisting submessages.
Below is a summary of the changes that line up with the concerns that
yourself and Claude pointed out (summary provided by claude based on my initial
write up) -
| # | Concern | Status | What changed
|
| -
> Typed ParseResponse for Tika gRPC
> ---------------------------------
>
> Key: TIKA-4766
> URL: https://issues.apache.org/jira/browse/TIKA-4766
> Project: Tika
> Issue Type: New Feature
> Components: tika-pipes
> Affects Versions: 4.0.0
> Reporter: Kristian Rickert
> Priority: Major
> Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map
> ({{{}map<string,string>{}}}) with a typed
> {{org.apache.tika.grpc.v1.ParseResponse}} on
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other
> supported formats). Dublin Core fields are normalized on the response root.
> Creative Commons / XMP rights metadata is exposed on a dedicated field when
> present.
> This is a *breaking change* for gRPC clients that read string keys from
> {{{}fields{}}}.
> h2. Motivation
> * Clients today must parse hundreds of ad hoc string keys with no schema or
> type safety.
> * Format-specific metadata is easier to consume and evolve with protobuf +
> bundled descriptors.
> * Mapping logic is separated from the gRPC server so it can be tested and
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove
> {{fields}}|
> *Out of scope (follow-up tickets):*
> * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
> * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}}
> (currently additional_struct MVP)
> * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2,
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
> # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}},
> {{{}content{}}}, optional {{embedded_docs}}
> # *Dublin Core:* {{dublin_core}} (shared across formats)
> # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}},
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}},
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}},
> {{{}generic{}}})
> # *Creative Commons:* {{creative_commons}} when XMP rights metadata is
> present
> h2. Architecture
> {code:none}
> Client
> -> TikaGrpcServerImpl (tika-grpc)
> -> Tika Pipes / parsers -> Metadata + body
> -> ParseResponseMapper (tika-grpc-mapper)
> -> format builders -> ParseResponse (tika-grpc-api)
> <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} +
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for
> stubs/descriptors.
> h2. Implementation notes
> * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
> * {{creative_commons}} is field 25 outside the document oneof so it can
> coexist with PDF/Office/etc.
> * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional
> post-processing (e.g. outlines) without coupling the core mapper to
> PDFBox/HTML libraries
> * Mapper tests use Tika parser test-jar fixtures (~35 tests in
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
> * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is
> removed/reserved
> * ( ) {{tika-grpc-api}} jar bundles descriptor set at
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
> * ( ) PDF, Office, HTML, and at least one other format return the expected
> typed oneof in integration tests
> * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
> * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for
> clients
> * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
> * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} /
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
> * Confirm no regression in fetcher CRUD and streaming RPCs
> * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP
> rights
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)