[
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kristian Rickert updated TIKA-4766:
-----------------------------------
Summary: Replace tika-grpc fields map with a typed Document parse contract
(was: Typed ParseResponse for Tika gRPC)
> Replace tika-grpc fields map with a typed Document parse contract
> -----------------------------------------------------------------
>
> Key: TIKA-4766
> URL: https://issues.apache.org/jira/browse/TIKA-4766
> Project: Tika
> Issue Type: New Feature
> Components: tika-pipes
> Affects Versions: 4.0.0
> Reporter: Kristian Rickert
> Priority: Major
> Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map
> ({{{}map<string,string>{}}}) with a typed
> {{org.apache.tika.grpc.v1.ParseResponse}} on
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other
> supported formats). Dublin Core fields are normalized on the response root.
> Creative Commons / XMP rights metadata is exposed on a dedicated field when
> present.
> This is a *breaking change* for gRPC clients that read string keys from
> {{{}fields{}}}.
> h2. Motivation
> * Clients today must parse hundreds of ad hoc string keys with no schema or
> type safety.
> * Format-specific metadata is easier to consume and evolve with protobuf +
> bundled descriptors.
> * Mapping logic is separated from the gRPC server so it can be tested and
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove
> {{fields}}|
> *Out of scope (follow-up tickets):*
> * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
> * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}}
> (currently additional_struct MVP)
> * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2,
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
> # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}},
> {{{}content{}}}, optional {{embedded_docs}}
> # *Dublin Core:* {{dublin_core}} (shared across formats)
> # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}},
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}},
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}},
> {{{}generic{}}})
> # *Creative Commons:* {{creative_commons}} when XMP rights metadata is
> present
> h2. Architecture
> {code:none}
> Client
> -> TikaGrpcServerImpl (tika-grpc)
> -> Tika Pipes / parsers -> Metadata + body
> -> ParseResponseMapper (tika-grpc-mapper)
> -> format builders -> ParseResponse (tika-grpc-api)
> <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} +
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for
> stubs/descriptors.
> h2. Implementation notes
> * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
> * {{creative_commons}} is field 25 outside the document oneof so it can
> coexist with PDF/Office/etc.
> * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional
> post-processing (e.g. outlines) without coupling the core mapper to
> PDFBox/HTML libraries
> * Mapper tests use Tika parser test-jar fixtures (~35 tests in
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
> * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is
> removed/reserved
> * ( ) {{tika-grpc-api}} jar bundles descriptor set at
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
> * ( ) PDF, Office, HTML, and at least one other format return the expected
> typed oneof in integration tests
> * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
> * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for
> clients
> * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
> * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} /
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
> * Confirm no regression in fetcher CRUD and streaming RPCs
> * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP
> rights
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)