krickert commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4848745681

   Tim, you're right - I'll make a new proto solution, we can table this one.  
It's too complex.
   
   My use case - I want to take tika's output and use it as an input for 
opennlp in a typesafe way.
   
   But mirroring an open metadata taxonomy in protos is the wrong thing to sign 
the project up for. 
   
   I'd maintain it, but that's not scalable.  So let me drop that framing and 
come at it from the other side.
   
   The thing worth typing isn't Tika's metadata - it's the parsed document. 
Here's the shape I'd propose, and it's one small stable proto, not a per-format 
taxonomy:
   
   - **Content as structured blocks** - headings, paragraphs, lists, tables, 
code, images. It's a standard markdown document model, so it renders straight 
back to markdown and it's exactly what a RAG/embeddings pipeline wants to 
consume. This is the actual product, and it's anchored to a spec that doesn't 
churn.
   - **Common metadata typed** - title, authors, created/modified as 
`Timestamp`, page/word counts, language. The cross-format stuff everyone always 
wants, and where a date has to be a `Timestamp`, not a string 12 languages 
re-parse.
   - **Everything else in one native tagged tail** - typed where Tika already 
declares the type, string otherwise (never guessed). That's the lossless map 
that replaces the old `fields` map, just multivalue and type-aware.
   
   This is actually close to where you landed - the tail is your `map`, just 
multivalue and type-aware, and the typed surface is the common cross-format 
fields, a bit past Dublin Core but nowhere near a taxonomy mirror.
   
   On the maintenance worry, which is the real one: format specifics don't go 
in the wire. They go in a per-parser transformer (just code). One `Document` 
proto. Adding a parser is adding a transformer - the contract doesn't move, so 
clients never rebuild for it. And to be precise about the rebuild fear: in 
proto3, adding `optional` fields is backward and forward compatible. Existing 
clients keep working and simply don't see the new field. Nobody is forced to 
regenerate unless they actually want the new data. So our metadata churn lands 
in the mapper and the tail, never in a contract clients have to chase.
   
   To answer your question directly - what I need in Tika vs outside:
   - **In Tika:** the `Document` proto, a generic transformer, and the tagged 
tail replacing the `fields` map. Small and stable.
   - **Outside / pluggable:** the richer per-parser transformers can ship as 
add-on modules. Tika owns a clean contract; the heavy mapping is opt-in.
   
   On why bother typing it at all, since I know that's the undercurrent: the 
whole point of gRPC is that the message *is* the typed object. If the client 
still has to crawl and re-parse strings, then the serde is the gRPC and we've 
handed the work back to the user. Protobuf gives you clean JSON for free on top 
of that, and going the other way never gives you a typed contract. So this 
isn't type-safety for its own sake - it's what lets Tika be a first-class 
parser from Rust, Python and Go, not just Java, with one contract across all of 
them.
   
   I'll redo this - give me a day to reshape this.  It'll be far fewer fields 
to maintain and we'll have  a transformation interface exist.  If it doesn't 
we'll put it in a struct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to