Kristian Rickert created TIKA-4766:
--------------------------------------

             Summary: Typed ParseResponse for Tika gRPC
                 Key: TIKA-4766
                 URL: https://issues.apache.org/jira/browse/TIKA-4766
             Project: Tika
          Issue Type: New Feature
          Components: tika-pipes
    Affects Versions: 4.0.0
            Reporter: Kristian Rickert


h2. Summary

Replace the flat {{FetchAndParseReply.fields}} map ({{{}map<string,string>{}}}) 
with a typed {{org.apache.tika.grpc.v1.ParseResponse}} on 
{{{}FetchAndParseReply.parse_response{}}}.

Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned 
with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other 
supported formats). Dublin Core fields are normalized on the response root. 
Creative Commons / XMP rights metadata is exposed on a dedicated field when 
present.

This is a *breaking change* for gRPC clients that read string keys from 
{{{}fields{}}}.
h2. Motivation
 * Clients today must parse hundreds of ad hoc string keys with no schema or 
type safety.
 * Format-specific metadata is easier to consume and evolve with protobuf + 
bundled descriptors.
 * Mapping logic is separated from the gRPC server so it can be tested and 
reused (e.g. by downstream parse services).

h2. Scope

*New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch 
{{{}typed-parse-response-grpc{}}}):*
||Module||Purpose||
|{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java 
stubs, {{FileDescriptorSet}} under {{META-INF/}}|
|{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional 
{{ParseResponseDecorator}} hook for future outline enrichment|
|{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove 
{{fields}}|

*Out of scope (follow-up tickets):*
 * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
 * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}} 
(currently additional_struct MVP)
 * Downstream consumer updates outside the Tika fork

h2. API change
||Removed||Replacement||
|{{FetchAndParseReply.fields}} (field 2, 
reserved)|{{FetchAndParseReply.parse_response}} (field 5)|

*Client migration (examples):*
{code:java}
// Body text
parse_response.content.body

// Title
parse_response.content.title
parse_response.dublin_core.title

// PDF
parse_response.pdf.doc_info_producer
parse_response.pdf.n_pages

// Creative Commons (alongside primary type)
parse_response.creative_commons.web_statement
{code}
h2. ParseResponse layout
 # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}}, 
{{{}content{}}}, optional {{embedded_docs}}
 # *Dublin Core:* {{dublin_core}} (shared across formats)
 # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}}, 
{{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}}, 
{{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}}, 
{{{}generic{}}})
 # *Creative Commons:* {{creative_commons}} when XMP rights metadata is present

h2. Architecture
{code:none}
Client
  -> TikaGrpcServerImpl (tika-grpc)
       -> Tika Pipes / parsers -> Metadata + body
       -> ParseResponseMapper (tika-grpc-mapper)
            -> format builders -> ParseResponse (tika-grpc-api)
  <- FetchAndParseReply.parse_response
{code}
*Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} + 
{{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for 
stubs/descriptors.
h2. Implementation notes
 * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
 * {{creative_commons}} is field 25 outside the document oneof so it can 
coexist with PDF/Office/etc.
 * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional 
post-processing (e.g. outlines) without coupling the core mapper to PDFBox/HTML 
libraries
 * Mapper tests use Tika parser test-jar fixtures (~35 tests in 
{{{}tika-grpc-mapper{}}})

h2. Acceptance criteria
 * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is 
removed/reserved
 * ( ) {{tika-grpc-api}} jar bundles descriptor set at 
{{META-INF/org.apache.tika.grpc.v1.descriptors}}
 * ( ) PDF, Office, HTML, and at least one other format return the expected 
typed oneof in integration tests
 * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
 * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for 
clients
 * ( ) Breaking change called out in release notes / migration guide

h2. Test plan
{code:bash}
./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
{code}
 * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} / 
{{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
 * Confirm no regression in fetcher CRUD and streaming RPCs
 * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP 
rights

h2. References
 * Fork branch: {{typed-parse-response-grpc}} on 
[ai-pipestream/tika|https://github.com/ai-pipestream/tika/tree/typed-parse-response-grpc]
 * PR body / architecture detail: 
{{tika-grpc/scripts/pr-typed-parse-response.md}} in that branch
 * Server README: {{tika-grpc/README.md}} (Typed parse output section)

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to