[ 
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092719#comment-18092719
 ] 

ASF GitHub Bot commented on TIKA-4766:
--------------------------------------

krickert opened a new pull request, #2916:
URL: https://github.com/apache/tika/pull/2916

   ## Summary
   
   This change replaces the flat `map<string,string>` on `FetchAndParseReply` 
with a
   typed `org.apache.tika.grpc.v1.ParseResponse`. Parse metadata is mapped from 
Tika
   `Metadata` into protobuf messages aligned with Tika property interfaces 
(PDF, Office,
   HTML, and the other supported formats), with Dublin Core normalized at the 
response
   root and Creative Commons licensing carried on a dedicated field when 
present.
   
   The work is split into three Maven modules so clients can depend on the 
schema and
   generated stubs without pulling in the server, and so mapping logic stays 
testable
   outside the gRPC layer:
   
   - **tika-grpc-api** — protobuf sources, Java generation, bundled 
`FileDescriptorSet`
   - **tika-grpc-mapper** — `ParseResponseMapper` and format builders; optional
     `ParseResponseDecorator` for future extensions (for example document 
outlines)
   - **tika-grpc** — existing service; `FetchAndParseReply.parse_response` 
replaces
     the removed `fields` map
   
   This is a breaking change for gRPC clients that read string keys from 
`fields`.
   Migration is documented in `tika-grpc-api/README.md` and summarized in
   `tika-grpc/README.md`.
   
   ## Architecture
   
   ### Module dependencies
   
   ```mermaid
   flowchart TB
     subgraph clients [Clients]
       C[gRPC / Java clients]
     end
   
     subgraph server [tika-grpc]
       S[TikaGrpcServerImpl]
       P[Tika Pipes workers]
     end
   
     subgraph mapper [tika-grpc-mapper]
       M[ParseResponseMapper]
       B[Format metadata builders]
       D[ParseResponseDecorator optional]
     end
   
     subgraph api [tika-grpc-api]
       PR[parse_response.proto and format protos]
       FD[FileDescriptorSet in META-INF]
       J[Generated Java stubs]
     end
   
     subgraph tika [Tika core]
       AD[AutoDetectParser / Pipes]
       MD[Metadata]
     end
   
     C -->|FetchAndParse| S
     S --> P
     P --> AD
     AD --> MD
     S --> M
     M --> B
     M --> D
     M --> PR
     B --> PR
     PR --> J
     PR --> FD
     S -->|FetchAndParseReply.parse_response| C
   ```
   
   ### Parse pipeline
   
   ```mermaid
   sequenceDiagram
     participant Client
     participant Grpc as TikaGrpcServerImpl
     participant Pipes as Tika Pipes
     participant Parser as Tika parser
     participant Mapper as ParseResponseMapper
   
     Client->>Grpc: FetchAndParseRequest
     Grpc->>Pipes: fetch and parse
     Pipes->>Parser: parse bytes
     Parser-->>Pipes: Metadata, body text
     Pipes-->>Grpc: parse result
     Grpc->>Mapper: map metadata, body, status
     Mapper->>Mapper: detect format oneof
     Mapper->>Mapper: build dublin_core
     Mapper->>Mapper: optional CC overlay
     Mapper-->>Grpc: ParseResponse
     Grpc-->>Client: FetchAndParseReply.parse_response
   ```
   
   ### ParseResponse layout
   
   ```mermaid
   classDiagram
     class ParseResponse {
       string parse_id
       ParseStatus status
       ParseContent content
       DublinCoreMetadata dublin_core
       oneof document_metadata
       CreativeCommonsMetadata creative_commons
       repeated EmbeddedDocument embedded_docs
     }
   
     class ParseContent {
       string body
       string title
     }
   
     class PdfMetadata
     class OfficeMetadata
     class HtmlMetadata
     class ImageMetadata
     class GenericMetadata
   
     ParseResponse --> ParseContent
     ParseResponse --> DublinCoreMetadata
     ParseResponse --> PdfMetadata : pdf
     ParseResponse --> OfficeMetadata : office
     ParseResponse --> HtmlMetadata : html
     ParseResponse --> ImageMetadata : image
     ParseResponse --> GenericMetadata : generic
   ```
   
   ## Breaking API change
   
   | Removed                                         | Replacement              
                     |
   | ---------------------------------------------

> Typed ParseResponse for Tika gRPC
> ---------------------------------
>
>                 Key: TIKA-4766
>                 URL: https://issues.apache.org/jira/browse/TIKA-4766
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 4.0.0
>            Reporter: Kristian Rickert
>            Priority: Major
>              Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map 
> ({{{}map<string,string>{}}}) with a typed 
> {{org.apache.tika.grpc.v1.ParseResponse}} on 
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned 
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other 
> supported formats). Dublin Core fields are normalized on the response root. 
> Creative Commons / XMP rights metadata is exposed on a dedicated field when 
> present.
> This is a *breaking change* for gRPC clients that read string keys from 
> {{{}fields{}}}.
> h2. Motivation
>  * Clients today must parse hundreds of ad hoc string keys with no schema or 
> type safety.
>  * Format-specific metadata is easier to consume and evolve with protobuf + 
> bundled descriptors.
>  * Mapping logic is separated from the gRPC server so it can be tested and 
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch 
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java 
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional 
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove 
> {{fields}}|
> *Out of scope (follow-up tickets):*
>  * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
>  * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}} 
> (currently additional_struct MVP)
>  * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2, 
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
>  # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}}, 
> {{{}content{}}}, optional {{embedded_docs}}
>  # *Dublin Core:* {{dublin_core}} (shared across formats)
>  # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}}, 
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}}, 
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}}, 
> {{{}generic{}}})
>  # *Creative Commons:* {{creative_commons}} when XMP rights metadata is 
> present
> h2. Architecture
> {code:none}
> Client
>   -> TikaGrpcServerImpl (tika-grpc)
>        -> Tika Pipes / parsers -> Metadata + body
>        -> ParseResponseMapper (tika-grpc-mapper)
>             -> format builders -> ParseResponse (tika-grpc-api)
>   <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} + 
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for 
> stubs/descriptors.
> h2. Implementation notes
>  * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
>  * {{creative_commons}} is field 25 outside the document oneof so it can 
> coexist with PDF/Office/etc.
>  * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional 
> post-processing (e.g. outlines) without coupling the core mapper to 
> PDFBox/HTML libraries
>  * Mapper tests use Tika parser test-jar fixtures (~35 tests in 
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
>  * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is 
> removed/reserved
>  * ( ) {{tika-grpc-api}} jar bundles descriptor set at 
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
>  * ( ) PDF, Office, HTML, and at least one other format return the expected 
> typed oneof in integration tests
>  * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
>  * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for 
> clients
>  * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
>  * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} / 
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
>  * Confirm no regression in fetcher CRUD and streaming RPCs
>  * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP 
> rights
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to