[ 
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18093033#comment-18093033
 ] 

ASF GitHub Bot commented on TIKA-4766:
--------------------------------------

krickert commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4859189416

   @nddipiazza the biggest problem is that grpc is not compatible with rest and 
the other way around.  They both work on different modeling concepts (I think 
the proto spec is 100x better).  However, protos are easy to return as JSON but 
never the other way around.
   
   I'm gonna close this branch in favor of my new one.  It comes down to this: 
you should always model your grpc service so it's grpc, not a mimic of REST.  
It's sorta like someone using hungarian notation in java code - it just looks 
strange.
   
   That being said, the app entities are not good at this.  Everyone treats 
gRPC as second class and wants a shortcut, but that's the wrong approach.  
   
   When you make the interface - you are coding 12 clients at the same time.  
It may look like a lot - but it's far less code.  




> Typed ParseResponse for Tika gRPC
> ---------------------------------
>
>                 Key: TIKA-4766
>                 URL: https://issues.apache.org/jira/browse/TIKA-4766
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 4.0.0
>            Reporter: Kristian Rickert
>            Priority: Major
>              Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map 
> ({{{}map<string,string>{}}}) with a typed 
> {{org.apache.tika.grpc.v1.ParseResponse}} on 
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned 
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other 
> supported formats). Dublin Core fields are normalized on the response root. 
> Creative Commons / XMP rights metadata is exposed on a dedicated field when 
> present.
> This is a *breaking change* for gRPC clients that read string keys from 
> {{{}fields{}}}.
> h2. Motivation
>  * Clients today must parse hundreds of ad hoc string keys with no schema or 
> type safety.
>  * Format-specific metadata is easier to consume and evolve with protobuf + 
> bundled descriptors.
>  * Mapping logic is separated from the gRPC server so it can be tested and 
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch 
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java 
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional 
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove 
> {{fields}}|
> *Out of scope (follow-up tickets):*
>  * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
>  * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}} 
> (currently additional_struct MVP)
>  * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2, 
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
>  # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}}, 
> {{{}content{}}}, optional {{embedded_docs}}
>  # *Dublin Core:* {{dublin_core}} (shared across formats)
>  # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}}, 
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}}, 
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}}, 
> {{{}generic{}}})
>  # *Creative Commons:* {{creative_commons}} when XMP rights metadata is 
> present
> h2. Architecture
> {code:none}
> Client
>   -> TikaGrpcServerImpl (tika-grpc)
>        -> Tika Pipes / parsers -> Metadata + body
>        -> ParseResponseMapper (tika-grpc-mapper)
>             -> format builders -> ParseResponse (tika-grpc-api)
>   <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} + 
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for 
> stubs/descriptors.
> h2. Implementation notes
>  * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
>  * {{creative_commons}} is field 25 outside the document oneof so it can 
> coexist with PDF/Office/etc.
>  * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional 
> post-processing (e.g. outlines) without coupling the core mapper to 
> PDFBox/HTML libraries
>  * Mapper tests use Tika parser test-jar fixtures (~35 tests in 
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
>  * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is 
> removed/reserved
>  * ( ) {{tika-grpc-api}} jar bundles descriptor set at 
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
>  * ( ) PDF, Office, HTML, and at least one other format return the expected 
> typed oneof in integration tests
>  * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
>  * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for 
> clients
>  * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
>  * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} / 
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
>  * Confirm no regression in fetcher CRUD and streaming RPCs
>  * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP 
> rights
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to