[
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092719#comment-18092719
]
ASF GitHub Bot commented on TIKA-4766:
--------------------------------------
krickert opened a new pull request, #2916:
URL: https://github.com/apache/tika/pull/2916
## Summary
This change replaces the flat `map<string,string>` on `FetchAndParseReply`
with a
typed `org.apache.tika.grpc.v1.ParseResponse`. Parse metadata is mapped from
Tika
`Metadata` into protobuf messages aligned with Tika property interfaces
(PDF, Office,
HTML, and the other supported formats), with Dublin Core normalized at the
response
root and Creative Commons licensing carried on a dedicated field when
present.
The work is split into three Maven modules so clients can depend on the
schema and
generated stubs without pulling in the server, and so mapping logic stays
testable
outside the gRPC layer:
- **tika-grpc-api** — protobuf sources, Java generation, bundled
`FileDescriptorSet`
- **tika-grpc-mapper** — `ParseResponseMapper` and format builders; optional
`ParseResponseDecorator` for future extensions (for example document
outlines)
- **tika-grpc** — existing service; `FetchAndParseReply.parse_response`
replaces
the removed `fields` map
This is a breaking change for gRPC clients that read string keys from
`fields`.
Migration is documented in `tika-grpc-api/README.md` and summarized in
`tika-grpc/README.md`.
## Architecture
### Module dependencies
```mermaid
flowchart TB
subgraph clients [Clients]
C[gRPC / Java clients]
end
subgraph server [tika-grpc]
S[TikaGrpcServerImpl]
P[Tika Pipes workers]
end
subgraph mapper [tika-grpc-mapper]
M[ParseResponseMapper]
B[Format metadata builders]
D[ParseResponseDecorator optional]
end
subgraph api [tika-grpc-api]
PR[parse_response.proto and format protos]
FD[FileDescriptorSet in META-INF]
J[Generated Java stubs]
end
subgraph tika [Tika core]
AD[AutoDetectParser / Pipes]
MD[Metadata]
end
C -->|FetchAndParse| S
S --> P
P --> AD
AD --> MD
S --> M
M --> B
M --> D
M --> PR
B --> PR
PR --> J
PR --> FD
S -->|FetchAndParseReply.parse_response| C
```
### Parse pipeline
```mermaid
sequenceDiagram
participant Client
participant Grpc as TikaGrpcServerImpl
participant Pipes as Tika Pipes
participant Parser as Tika parser
participant Mapper as ParseResponseMapper
Client->>Grpc: FetchAndParseRequest
Grpc->>Pipes: fetch and parse
Pipes->>Parser: parse bytes
Parser-->>Pipes: Metadata, body text
Pipes-->>Grpc: parse result
Grpc->>Mapper: map metadata, body, status
Mapper->>Mapper: detect format oneof
Mapper->>Mapper: build dublin_core
Mapper->>Mapper: optional CC overlay
Mapper-->>Grpc: ParseResponse
Grpc-->>Client: FetchAndParseReply.parse_response
```
### ParseResponse layout
```mermaid
classDiagram
class ParseResponse {
string parse_id
ParseStatus status
ParseContent content
DublinCoreMetadata dublin_core
oneof document_metadata
CreativeCommonsMetadata creative_commons
repeated EmbeddedDocument embedded_docs
}
class ParseContent {
string body
string title
}
class PdfMetadata
class OfficeMetadata
class HtmlMetadata
class ImageMetadata
class GenericMetadata
ParseResponse --> ParseContent
ParseResponse --> DublinCoreMetadata
ParseResponse --> PdfMetadata : pdf
ParseResponse --> OfficeMetadata : office
ParseResponse --> HtmlMetadata : html
ParseResponse --> ImageMetadata : image
ParseResponse --> GenericMetadata : generic
```
## Breaking API change
| Removed | Replacement
|
| ---------------------------------------------
> Typed ParseResponse for Tika gRPC
> ---------------------------------
>
> Key: TIKA-4766
> URL: https://issues.apache.org/jira/browse/TIKA-4766
> Project: Tika
> Issue Type: New Feature
> Components: tika-pipes
> Affects Versions: 4.0.0
> Reporter: Kristian Rickert
> Priority: Major
> Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map
> ({{{}map<string,string>{}}}) with a typed
> {{org.apache.tika.grpc.v1.ParseResponse}} on
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other
> supported formats). Dublin Core fields are normalized on the response root.
> Creative Commons / XMP rights metadata is exposed on a dedicated field when
> present.
> This is a *breaking change* for gRPC clients that read string keys from
> {{{}fields{}}}.
> h2. Motivation
> * Clients today must parse hundreds of ad hoc string keys with no schema or
> type safety.
> * Format-specific metadata is easier to consume and evolve with protobuf +
> bundled descriptors.
> * Mapping logic is separated from the gRPC server so it can be tested and
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove
> {{fields}}|
> *Out of scope (follow-up tickets):*
> * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
> * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}}
> (currently additional_struct MVP)
> * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2,
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
> # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}},
> {{{}content{}}}, optional {{embedded_docs}}
> # *Dublin Core:* {{dublin_core}} (shared across formats)
> # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}},
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}},
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}},
> {{{}generic{}}})
> # *Creative Commons:* {{creative_commons}} when XMP rights metadata is
> present
> h2. Architecture
> {code:none}
> Client
> -> TikaGrpcServerImpl (tika-grpc)
> -> Tika Pipes / parsers -> Metadata + body
> -> ParseResponseMapper (tika-grpc-mapper)
> -> format builders -> ParseResponse (tika-grpc-api)
> <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} +
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for
> stubs/descriptors.
> h2. Implementation notes
> * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
> * {{creative_commons}} is field 25 outside the document oneof so it can
> coexist with PDF/Office/etc.
> * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional
> post-processing (e.g. outlines) without coupling the core mapper to
> PDFBox/HTML libraries
> * Mapper tests use Tika parser test-jar fixtures (~35 tests in
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
> * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is
> removed/reserved
> * ( ) {{tika-grpc-api}} jar bundles descriptor set at
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
> * ( ) PDF, Office, HTML, and at least one other format return the expected
> typed oneof in integration tests
> * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
> * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for
> clients
> * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
> * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} /
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
> * Confirm no regression in fetcher CRUD and streaming RPCs
> * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP
> rights
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)