[ 
https://issues.apache.org/jira/browse/TIKA-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092794#comment-18092794
 ] 

ASF GitHub Bot commented on TIKA-4766:
--------------------------------------

krickert commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4848745681

   Tim, you're right - I'll make a new proto solution, we can table this one.  
It's too complex.
   
   My use case - I want to take tika's output and use it as an input for 
opennlp in a typesafe way.
   
   But mirroring an open metadata taxonomy in protos is the wrong thing to sign 
the project up for. 
   
   I'd maintain it, but that's not scalable.  So let me drop that framing and 
come at it from the other side.
   
   The thing worth typing isn't Tika's metadata - it's the parsed document. 
Here's the shape I'd propose, and it's one small stable proto, not a per-format 
taxonomy:
   
   - **Content as structured blocks** - headings, paragraphs, lists, tables, 
code, images. It's a standard markdown document model, so it renders straight 
back to markdown and it's exactly what a RAG/embeddings pipeline wants to 
consume. This is the actual product, and it's anchored to a spec that doesn't 
churn.
   - **Common metadata typed** - title, authors, created/modified as 
`Timestamp`, page/word counts, language. The cross-format stuff everyone always 
wants, and where a date has to be a `Timestamp`, not a string 12 languages 
re-parse.
   - **Everything else in one native tagged tail** - typed where Tika already 
declares the type, string otherwise (never guessed). That's the lossless map 
that replaces the old `fields` map, just multivalue and type-aware.
   
   This is actually close to where you landed - the tail is your `map`, just 
multivalue and type-aware, and the typed surface is the common cross-format 
fields, a bit past Dublin Core but nowhere near a taxonomy mirror.
   
   On the maintenance worry, which is the real one: format specifics don't go 
in the wire. They go in a per-parser transformer (just code). One `Document` 
proto. Adding a parser is adding a transformer - the contract doesn't move, so 
clients never rebuild for it. And to be precise about the rebuild fear: in 
proto3, adding `optional` fields is backward and forward compatible. Existing 
clients keep working and simply don't see the new field. Nobody is forced to 
regenerate unless they actually want the new data. So our metadata churn lands 
in the mapper and the tail, never in a contract clients have to chase.
   
   To answer your question directly - what I need in Tika vs outside:
   - **In Tika:** the `Document` proto, a generic transformer, and the tagged 
tail replacing the `fields` map. Small and stable.
   - **Outside / pluggable:** the richer per-parser transformers can ship as 
add-on modules. Tika owns a clean contract; the heavy mapping is opt-in.
   
   On why bother typing it at all, since I know that's the undercurrent: the 
whole point of gRPC is that the message *is* the typed object. If the client 
still has to crawl and re-parse strings, then the serde is the gRPC and we've 
handed the work back to the user. Protobuf gives you clean JSON for free on top 
of that, and going the other way never gives you a typed contract. So this 
isn't type-safety for its own sake - it's what lets Tika be a first-class 
parser from Rust, Python and Go, not just Java, with one contract across all of 
them.
   
   I'll redo this - give me a day to reshape this.  It'll be far fewer fields 
to maintain and we'll have  a transformation interface exist.  If it doesn't 
we'll put it in a struct.




> Typed ParseResponse for Tika gRPC
> ---------------------------------
>
>                 Key: TIKA-4766
>                 URL: https://issues.apache.org/jira/browse/TIKA-4766
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 4.0.0
>            Reporter: Kristian Rickert
>            Priority: Major
>              Labels: grpc, pipes, protobuf
>
> h2. Summary
> Replace the flat {{FetchAndParseReply.fields}} map 
> ({{{}map<string,string>{}}}) with a typed 
> {{org.apache.tika.grpc.v1.ParseResponse}} on 
> {{{}FetchAndParseReply.parse_response{}}}.
> Parse output is mapped from Tika {{Metadata}} into protobuf messages aligned 
> with Tika property interfaces (PDF, Office, HTML, Image, Email, and the other 
> supported formats). Dublin Core fields are normalized on the response root. 
> Creative Commons / XMP rights metadata is exposed on a dedicated field when 
> present.
> This is a *breaking change* for gRPC clients that read string keys from 
> {{{}fields{}}}.
> h2. Motivation
>  * Clients today must parse hundreds of ad hoc string keys with no schema or 
> type safety.
>  * Format-specific metadata is easier to consume and evolve with protobuf + 
> bundled descriptors.
>  * Mapping logic is separated from the gRPC server so it can be tested and 
> reused (e.g. by downstream parse services).
> h2. Scope
> *New Maven modules (Tika fork: {{{}ai-pipestream/tika{}}}, branch 
> {{{}typed-parse-response-grpc{}}}):*
> ||Module||Purpose||
> |{{tika-grpc-api}}|Protobuf schema ({{{}org.apache.tika.grpc.v1{}}}), Java 
> stubs, {{FileDescriptorSet}} under {{META-INF/}}|
> |{{tika-grpc-mapper}}|{{ParseResponseMapper}} + format builders; optional 
> {{ParseResponseDecorator}} hook for future outline enrichment|
> |{{tika-grpc}}|Wire {{parse_response}} on fetch-and-parse RPCs; remove 
> {{fields}}|
> *Out of scope (follow-up tickets):*
>  * PDF/HTML/Markdown outline decorators (proto fields + separate extension)
>  * Full CF/NetCDF field mapping in {{ClimateForecastMetadataBuilder}} 
> (currently additional_struct MVP)
>  * Downstream consumer updates outside the Tika fork
> h2. API change
> ||Removed||Replacement||
> |{{FetchAndParseReply.fields}} (field 2, 
> reserved)|{{FetchAndParseReply.parse_response}} (field 5)|
> *Client migration (examples):*
> {code:java}
> // Body text
> parse_response.content.body
> // Title
> parse_response.content.title
> parse_response.dublin_core.title
> // PDF
> parse_response.pdf.doc_info_producer
> parse_response.pdf.n_pages
> // Creative Commons (alongside primary type)
> parse_response.creative_commons.web_statement
> {code}
> h2. ParseResponse layout
>  # *Envelope:* {{{}parse_id{}}}, {{{}parsed_at{}}}, {{{}status{}}}, 
> {{{}content{}}}, optional {{embedded_docs}}
>  # *Dublin Core:* {{dublin_core}} (shared across formats)
>  # *Format metadata:* oneof ({{{}pdf{}}}, {{{}office{}}}, {{{}html{}}}, 
> {{{}image{}}}, {{{}email{}}}, {{{}media{}}}, {{{}rtf{}}}, {{{}database{}}}, 
> {{{}font{}}}, {{{}epub{}}}, {{{}warc{}}}, {{{}climate_forecast{}}}, 
> {{{}generic{}}})
>  # *Creative Commons:* {{creative_commons}} when XMP rights metadata is 
> present
> h2. Architecture
> {code:none}
> Client
>   -> TikaGrpcServerImpl (tika-grpc)
>        -> Tika Pipes / parsers -> Metadata + body
>        -> ParseResponseMapper (tika-grpc-mapper)
>             -> format builders -> ParseResponse (tika-grpc-api)
>   <- FetchAndParseReply.parse_response
> {code}
> *Module dependency:* {{tika-grpc}} depends on {{tika-grpc-api}} + 
> {{{}tika-grpc-mapper{}}}. Clients may depend on {{tika-grpc-api}} alone for 
> stubs/descriptors.
> h2. Implementation notes
>  * Protos live under {{tika-grpc-api/src/main/proto/org/apache/tika/grpc/v1/}}
>  * {{creative_commons}} is field 25 outside the document oneof so it can 
> coexist with PDF/Office/etc.
>  * {{ParseResponseDecorator}} + {{ParseMapContext}} allow optional 
> post-processing (e.g. outlines) without coupling the core mapper to 
> PDFBox/HTML libraries
>  * Mapper tests use Tika parser test-jar fixtures (~35 tests in 
> {{{}tika-grpc-mapper{}}})
> h2. Acceptance criteria
>  * ( ) {{FetchAndParseReply}} exposes {{{}parse_response{}}}; {{fields}} is 
> removed/reserved
>  * ( ) {{tika-grpc-api}} jar bundles descriptor set at 
> {{META-INF/org.apache.tika.grpc.v1.descriptors}}
>  * ( ) PDF, Office, HTML, and at least one other format return the expected 
> typed oneof in integration tests
>  * ( ) {{mvn -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test}} passes
>  * ( ) README in {{tika-grpc}} and {{tika-grpc-api}} documents migration for 
> clients
>  * ( ) Breaking change called out in release notes / migration guide
> h2. Test plan
> {code:bash}
> ./mvnw -pl tika-grpc-api,tika-grpc-mapper,tika-grpc test
> {code}
>  * Parse sample PDF and HTML via gRPC; verify {{parse_response.hasPdf()}} / 
> {{{}hasHtml(){}}}, {{{}content.body{}}}, and representative typed fields
>  * Confirm no regression in fetcher CRUD and streaming RPCs
>  * Spot-check Dublin Core and Creative Commons overlay on a fixture with XMP 
> rights
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to