[
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kristian Rickert updated TIKA-4766:
-----------------------------------
Description:
Replace the flat {{FetchAndParseReply.fields}} map ({{map<string,string>}})
with a typed parse result.
Approach (reshaped from the original {{ParseResponse}} design after review in
[PR #2916|https://github.com/apache/tika/pull/2916]): a single small
{{Document}} proto (~200 lines) rather than per-format metadata messages.
* *Content*: a structured markdown block tree -- headings, paragraphs, lists,
tables, code blocks, inline runs (CommonMark + GFM) -- plus the rendered
markdown string {{ToMarkdownContentHandler}} already produces (TIKA-4730).
* *Typed common metadata*: title, authors, keywords, languages,
created/modified as {{Timestamp}}s, page/word/character counts, dimensions,
rights.
* *Lossless tagged tail*: every remaining metadata key, multivalue-preserving,
typed only where Tika's own {{Property}} declares a type, string otherwise --
never guessed.
* *Embedded documents* recurse as fully typed child {{Document}}s.
* Format specifics live in per-parser {{DocumentTransformer}} code
({{tika-grpc-mapper}}), never in the wire contract, so metadata churn never
forces a client rebuild.
Modules: {{tika-grpc-api}} (proto + generated messages + bundled
{{FileDescriptorSet}}), {{tika-grpc-mapper}}, {{tika-grpc}} integration.
Breaking change for clients reading {{fields}} (field number reserved).
PR: [https://github.com/apache/tika/pull/2921] (supersedes
[#2916|https://github.com/apache/tika/pull/2916])
Follow-ups will be tracked in separate issues: pluggable external parsers
(opaque {{Any}} extension results from registered gRPC services), and a
Markdown input parser.
was:
h2. Summary
Replace the flat {{FetchAndParseReply.fields}} map ({{{}map<string,string>{}}})
with a typed {{org.apache.tika.grpc.v1.ParseResponse}} on
{{{}FetchAndParseReply.parse_response{}}}.
Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned
with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other
supported formats). Dublin Core fields are normalized on the response root.
Creative Commons / XMP rights metadata is exposed on a dedicated field when
present.
This is a *breaking change* for gRPC clients that read string keys from
{{{}fields{}}}.
h2. Motivation
* Clients today must parse hundreds of ad hoc string keys with no schema or
type safety.
* Format-specific metadata is easier to consume and evolve with protobuf +
bundled descriptors.
* Mapping logic is separated from the gRPC server so it can be tested and
reused (e.g. by downstream parse services).
h2. Scope
*New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch
{{{}typed-parse-response-grpc{}}}):*
||Module||Purpose||
|{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java
stubs, {{FileDescriptorSet}} under {{META-INF/}}|
|{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional
{{ParseResponseDecorator}} hook for future outline enrichment|
|{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove
{{fields}}|
*Out of scope (follow-up tickets):*
* PDF/HTML/Markdown outline decorators (proto fields + separate extension)
* Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}}
(currently additional_struct MVP)
* Downstream consumer updates outside the Tika fork
h2. API change
||Removed||Replacement||
|{{FetchAndParseReply.fields}} (field 2,
reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
*Client migration (examples):*
{code:java}
// Body text
parse_response.content.body
// Title
parse_response.content.title
parse_response.dublin_core.title
// PDF
parse_response.pdf.doc_info_producer
parse_response.pdf.n_pages
// Creative Commons (alongside primary type)
parse_response.creative_commons.web_statement
{code}
h2. ParseResponse layout
# *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}},
{{{}content{}}}, optional {{embedded_docs}}
# *Dublin Core:* {{dublin_core}} (shared across formats)
# *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}},
{{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}},
{{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}},
{{{}generic{}}})
# *Creative Commons:* {{creative_commons}} when XMP rights metadata is present
h2. Architecture
{code:none}
Client
-> TikaGrpcServerImpl (tika-grpc)
-> Tika Pipes / parsers -> Metadata + body
-> ParseResponseMapper (tika-grpc-mapper)
-> format builders -> ParseResponse (tika-grpc-api)
<- FetchAndParseReply.parse_response
{code}
*Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} +
{{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for
stubs/descriptors.
h2. Implementation notes
* Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
* {{creative_commons}} is field 25 outside the document oneof so it can
coexist with PDF/Office/etc.
* {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional
post-processing (e.g. outlines) without coupling the core mapper to PDFBox/HTML
libraries
* Mapper tests use Tika parser test-jar fixtures (~35 tests in
{{{}tika-grpc-mapper{}}})
h2. Acceptance criteria
* ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is
removed/reserved
* ( ) {{tika-grpc-api}} jar bundles descriptor set at
{{META-INF/org.apache.tika.grpc.v1.descriptors}}
* ( ) PDF, Office, HTML, and at least one other format return the expected
typed oneof in integration tests
* ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
* ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for
clients
* ( ) Breaking change called out in release notes / migration guide
h2. Test plan
{code:bash}
./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
{code}
* Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} /
{{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
* Confirm no regression in fetcher CRUD and streaming RPCs
* Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP
rights
> Replace tika-grpc fields map with a typed Document parse contract
> -----------------------------------------------------------------
>
> Key: TIKA-4766
> URL: https://issues.apache.org/jira/browse/TIKA-4766
> Project: Tika
> Issue Type: New Feature
> Components: tika-pipes
> Affects Versions: 4.0.0
> Reporter: Kristian Rickert
> Priority: Major
> Labels: grpc, pipes, protobuf
>
> Replace the flat {{FetchAndParseReply.fields}} map ({{map<string,string>}})
> with a typed parse result.
> Approach (reshaped from the original {{ParseResponse}} design after review in
> [PR #2916|https://github.com/apache/tika/pull/2916]): a single small
> {{Document}} proto (~200 lines) rather than per-format metadata messages.
> * *Content*: a structured markdown block tree -- headings, paragraphs, lists,
> tables, code blocks, inline runs (CommonMark + GFM) -- plus the rendered
> markdown string {{ToMarkdownContentHandler}} already produces (TIKA-4730).
> * *Typed common metadata*: title, authors, keywords, languages,
> created/modified as {{Timestamp}}s, page/word/character counts, dimensions,
> rights.
> * *Lossless tagged tail*: every remaining metadata key,
> multivalue-preserving, typed only where Tika's own {{Property}} declares a
> type, string otherwise -- never guessed.
> * *Embedded documents* recurse as fully typed child {{Document}}s.
> * Format specifics live in per-parser {{DocumentTransformer}} code
> ({{tika-grpc-mapper}}), never in the wire contract, so metadata churn never
> forces a client rebuild.
> Modules: {{tika-grpc-api}} (proto + generated messages + bundled
> {{FileDescriptorSet}}), {{tika-grpc-mapper}}, {{tika-grpc}} integration.
> Breaking change for clients reading {{fields}} (field number reserved).
> PR: [https://github.com/apache/tika/pull/2921] (supersedes
> [#2916|https://github.com/apache/tika/pull/2916])
> Follow-ups will be tracked in separate issues: pluggable external parsers
> (opaque {{Any}} extension results from registered gRPC services), and a
> Markdown input parser.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)