[ 
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristian Rickert updated TIKA-4766:
-----------------------------------
    Description: 

Replace the flat {{FetchAndParseReply.fields}} map ({{map<string,string>}}) 
with a typed parse result.

Approach (reshaped from the original {{ParseResponse}} design after review in 
[PR #2916|https://github.com/apache/tika/pull/2916]): a single small 
{{Document}} proto (~200 lines) rather than per-format metadata messages.

* *Content*: a structured markdown block tree -- headings, paragraphs, lists, 
tables, code blocks, inline runs (CommonMark + GFM) -- plus the rendered 
markdown string {{ToMarkdownContentHandler}} already produces (TIKA-4730).
* *Typed common metadata*: title, authors, keywords, languages, 
created/modified as {{Timestamp}}s, page/word/character counts, dimensions, 
rights.
* *Lossless tagged tail*: every remaining metadata key, multivalue-preserving, 
typed only where Tika's own {{Property}} declares a type, string otherwise -- 
never guessed.
* *Embedded documents* recurse as fully typed child {{Document}}s.
* Format specifics live in per-parser {{DocumentTransformer}} code 
({{tika-grpc-mapper}}), never in the wire contract, so metadata churn never 
forces a client rebuild.

Modules: {{tika-grpc-api}} (proto + generated messages + bundled 
{{FileDescriptorSet}}), {{tika-grpc-mapper}}, {{tika-grpc}} integration. 
Breaking change for clients reading {{fields}} (field number reserved).

PR: [https://github.com/apache/tika/pull/2921] (supersedes 
[#2916|https://github.com/apache/tika/pull/2916])

Follow-ups will be tracked in separate issues: pluggable external parsers 
(opaque {{Any}} extension results from registered gRPC services), and a 
Markdown input parser.

  was:
h2. Summary

Replace the flat {{FetchAndParseReply.fields}} map ({{{}map<string,string>{}}}) 
with a typed {{org.apache.tika.grpc.v1.ParseResponse}} on 
{{{}FetchAndParseReply.parse_response{}}}.

Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned 
with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other 
supported formats). Dublin Core fields are normalized on the response root. 
Creative Commons / XMP rights metadata is exposed on a dedicated field when 
present.

This is a *breaking change* for gRPC clients that read string keys from 
{{{}fields{}}}.
h2. Motivation
 * Clients today must parse hundreds of ad hoc string keys with no schema or 
type safety.
 * Format-specific metadata is easier to consume and evolve with protobuf + 
bundled descriptors.
 * Mapping logic is separated from the gRPC server so it can be tested and 
reused (e.g. by downstream parse services).

h2. Scope

*New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch 
{{{}typed-parse-response-grpc{}}}):*
||Module||Purpose||
|{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java 
stubs, {{FileDescriptorSet}} under {{META-INF/}}|
|{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional 
{{ParseResponseDecorator}} hook for future outline enrichment|
|{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove 
{{fields}}|

*Out of scope (follow-up tickets):*
 * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
 * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}} 
(currently additional_struct MVP)
 * Downstream consumer updates outside the Tika fork

h2. API change
||Removed||Replacement||
|{{FetchAndParseReply.fields}} (field 2, 
reserved)|{{FetchAndParseReply.parse_response}} (field 5)|

*Client migration (examples):*
{code:java}
// Body text
parse_response.content.body

// Title
parse_response.content.title
parse_response.dublin_core.title

// PDF
parse_response.pdf.doc_info_producer
parse_response.pdf.n_pages

// Creative Commons (alongside primary type)
parse_response.creative_commons.web_statement
{code}
h2. ParseResponse layout
 # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}}, 
{{{}content{}}}, optional {{embedded_docs}}
 # *Dublin Core:* {{dublin_core}} (shared across formats)
 # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}}, 
{{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}}, 
{{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}}, 
{{{}generic{}}})
 # *Creative Commons:* {{creative_commons}} when XMP rights metadata is present

h2. Architecture
{code:none}
Client
  -> TikaGrpcServerImpl (tika-grpc)
       -> Tika Pipes / parsers -> Metadata + body
       -> ParseResponseMapper (tika-grpc-mapper)
            -> format builders -> ParseResponse (tika-grpc-api)
  <- FetchAndParseReply.parse_response
{code}
*Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} + 
{{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for 
stubs/descriptors.
h2. Implementation notes
 * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
 * {{creative_commons}} is field 25 outside the document oneof so it can 
coexist with PDF/Office/etc.
 * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional 
post-processing (e.g. outlines) without coupling the core mapper to PDFBox/HTML 
libraries
 * Mapper tests use Tika parser test-jar fixtures (~35 tests in 
{{{}tika-grpc-mapper{}}})

h2. Acceptance criteria
 * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is 
removed/reserved
 * ( ) {{tika-grpc-api}} jar bundles descriptor set at 
{{META-INF/org.apache.tika.grpc.v1.descriptors}}
 * ( ) PDF, Office, HTML, and at least one other format return the expected 
typed oneof in integration tests
 * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
 * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for 
clients
 * ( ) Breaking change called out in release notes / migration guide

h2. Test plan
{code:bash}
./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
{code}
 * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} / 
{{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
 * Confirm no regression in fetcher CRUD and streaming RPCs
 * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP 
rights

 

 


> Replace tika-grpc fields map with a typed Document parse contract
> -----------------------------------------------------------------
>
>                 Key: TIKA-4766
>                 URL: https://issues.apache.org/jira/browse/TIKA-4766
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 4.0.0
>            Reporter: Kristian Rickert
>            Priority: Major
>              Labels: grpc, pipes, protobuf
>
> Replace the flat {{FetchAndParseReply.fields}} map ({{map<string,string>}}) 
> with a typed parse result.
> Approach (reshaped from the original {{ParseResponse}} design after review in 
> [PR #2916|https://github.com/apache/tika/pull/2916]): a single small 
> {{Document}} proto (~200 lines) rather than per-format metadata messages.
> * *Content*: a structured markdown block tree -- headings, paragraphs, lists, 
> tables, code blocks, inline runs (CommonMark + GFM) -- plus the rendered 
> markdown string {{ToMarkdownContentHandler}} already produces (TIKA-4730).
> * *Typed common metadata*: title, authors, keywords, languages, 
> created/modified as {{Timestamp}}s, page/word/character counts, dimensions, 
> rights.
> * *Lossless tagged tail*: every remaining metadata key, 
> multivalue-preserving, typed only where Tika's own {{Property}} declares a 
> type, string otherwise -- never guessed.
> * *Embedded documents* recurse as fully typed child {{Document}}s.
> * Format specifics live in per-parser {{DocumentTransformer}} code 
> ({{tika-grpc-mapper}}), never in the wire contract, so metadata churn never 
> forces a client rebuild.
> Modules: {{tika-grpc-api}} (proto + generated messages + bundled 
> {{FileDescriptorSet}}), {{tika-grpc-mapper}}, {{tika-grpc}} integration. 
> Breaking change for clients reading {{fields}} (field number reserved).
> PR: [https://github.com/apache/tika/pull/2921] (supersedes 
> [#2916|https://github.com/apache/tika/pull/2916])
> Follow-ups will be tracked in separate issues: pluggable external parsers 
> (opaque {{Any}} extension results from registered gRPC services), and a 
> Markdown input parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to