[
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092731#comment-18092731
]
ASF GitHub Bot commented on TIKA-4766:
--------------------------------------
tballison commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4846751029
Even with agents to help out, I can't stomach 11k lines of code to nail down
maybe 80% of an open set.
I'm really worried about maintenance within the project and then clients
having to rebuild their protos when we change metadata definitions.
We've had churn on value types EVEN for dublin core over the history of the
project. Even if we limit custom handling to that, clients will still have to
rebuild their protos when we make changes.
I'd be ok, maybe, with special handling for dublin core and some of the tika
core properties: media type, etc.
Fellow devs (@nddipiazza) what do you think about this?
From claude: The lossless catch-all is the right idea and the part that
belongs in Tika — it's what should replace the removed fields map. I'd simplify
its shape, though: from repeated MetadataEntry with a typed oneof to a plain
multivalue map<string, StringList>. That keeps the native dict lookup clients
had with the old map<string,string>, fixes the real gap (multivalue), and drops
the per-value typing — which for dynamic keys forces clients to branch on a
6-way union on every read without giving them a compile-time typed accessor
anyway. A new or renamed metadata key still never forces a client rebuild,
because a key is data, not schema. On top of that map I'd add only
special-cased DC + a few core props as typed strings.
@krickert what, specifically, do you need within the Tika project and what
can you do outside of Tika to meet your objectives?
> Typed ParseResponse for Tika gRPC
> ---------------------------------
>
> Key: TIKA-4766
> URL: https://issues.apache.org/jira/browse/TIKA-4766
> Project: Tika
> Issue Type: New Feature
> Components: tika-pipes
> Affects Versions: 4.0.0
> Reporter: Kristian Rickert
> Priority: Major
> Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map
> ({{{}map<string,string>{}}}) with a typed
> {{org.apache.tika.grpc.v1.ParseResponse}} on
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other
> supported formats). Dublin Core fields are normalized on the response root.
> Creative Commons / XMP rights metadata is exposed on a dedicated field when
> present.
> This is a *breaking change* for gRPC clients that read string keys from
> {{{}fields{}}}.
> h2. Motivation
> * Clients today must parse hundreds of ad hoc string keys with no schema or
> type safety.
> * Format-specific metadata is easier to consume and evolve with protobuf +
> bundled descriptors.
> * Mapping logic is separated from the gRPC server so it can be tested and
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove
> {{fields}}|
> *Out of scope (follow-up tickets):*
> * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
> * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}}
> (currently additional_struct MVP)
> * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2,
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
> # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}},
> {{{}content{}}}, optional {{embedded_docs}}
> # *Dublin Core:* {{dublin_core}} (shared across formats)
> # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}},
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}},
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}},
> {{{}generic{}}})
> # *Creative Commons:* {{creative_commons}} when XMP rights metadata is
> present
> h2. Architecture
> {code:none}
> Client
> -> TikaGrpcServerImpl (tika-grpc)
> -> Tika Pipes / parsers -> Metadata + body
> -> ParseResponseMapper (tika-grpc-mapper)
> -> format builders -> ParseResponse (tika-grpc-api)
> <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} +
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for
> stubs/descriptors.
> h2. Implementation notes
> * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
> * {{creative_commons}} is field 25 outside the document oneof so it can
> coexist with PDF/Office/etc.
> * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional
> post-processing (e.g. outlines) without coupling the core mapper to
> PDFBox/HTML libraries
> * Mapper tests use Tika parser test-jar fixtures (~35 tests in
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
> * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is
> removed/reserved
> * ( ) {{tika-grpc-api}} jar bundles descriptor set at
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
> * ( ) PDF, Office, HTML, and at least one other format return the expected
> typed oneof in integration tests
> * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
> * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for
> clients
> * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
> * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} /
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
> * Confirm no regression in fetcher CRUD and streaming RPCs
> * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP
> rights
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)