krickert commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4844253570
The Claude review was useful and I went through all 7 points. Pushed in
`99883db0c`.
Short version: the "augment, not replace" shape you recommended and where
this landed are the same now. One lossless, populated channel as the source of
truth (`ParseResponse.metadata`), a typed convenience layer on top,
Content-Type trusted, and the oneof dropped for coexisting submessages.
Below is a summary of the changes that line up with the concerns that
yourself and Claude pointed out (summary provided by claude based on my initial
write up) -
| # | Concern | Status | What changed
|
| --- | ------------------------- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 1 | Dead wire fields | Fixed | `metadata` (field 12)
populated; `raw_properties` (14) reserved; empty Structs removed
|
| 2 | Four overlapping channels | Down to one | only
`ParseResponse.metadata` is a catch-all now; the 13 per-format
`additional_metadata`, `BaseFields.raw_metadata`, and 4 `GenericMetadata`
Structs are gone (numbers reserved) |
| 3 | Re-detects format, twice | Fixed | single `detect()`, reads
`Content-Type` first, exposed as a typed `primary_format` enum
|
| 4 | oneof fights Tika's model | Fixed + test | oneof to coexisting
`optional` peers; `ParseResponseCoexistenceTest`
|
| 5 | Maintaining the mirror | Bounded | desync no longer drops
data, see below
|
| 6 | Fidelity loss (Struct) | Fixed | Struct gone;
`MetadataEntry` keeps multivalue and native types; Property constants
|
| 7 | Scope creep | Partial | helper script removed;
EPUB extractor I can split if you want
|
Write up (provided by me, edited for grammar by claude)
**1. Dead fields.** `metadata` is populated for every key now,
`raw_properties` is reserved. `ParseResponseMetadataPopulationTest` asserts
`metadata.names().length == response.getMetadataCount()`, so there's no
always-empty public field anymore and it's guard-railed.
**2. Overlapping channels.** One catch-all: `ParseResponse.metadata`. Every
per-format `additional_metadata`, `BaseFields.raw_metadata` (that one was a
full dump embedded in every message), and the four `GenericMetadata` Structs
are removed, field numbers reserved. Precedence is written up in
`tika-grpc-api/README.md`.
**3. Detection.** One `detect()` call that leads with `Content-Type`
(extension and CC are fallbacks only when MIME is missing), surfaced as
`DocumentFormatCategory primary_format` so clients read an enum instead of
re-deriving.
**4. The oneof.** Gone, replaced with independent `optional` submessages
(same field numbers, so it stays wire-compatible). They coexist, so DC and CC
aren't special cases anymore. `ParseResponseCoexistenceTest` covers both a
format submessage plus the CC overlay on one response, and a parent PDF with an
embedded child typed `IMAGE` on its own `parsed_content`. That last bit is
where "EXIF on embedded images" actually belongs, instead of forcing a second
bucket onto the parent.
**5. Maintenance.** This is the big one. It's often the first "shock"
people get when surfacing a gRPC interface - it's it's also why most
implementations never take off. I believe in keeping the ugly parsing details
behind the contract and not pushing that work to the user. That's the center
of the decision - and there' s honestly no good compromise (avoid frameworks
like bazel as it turns into just as much mapping scaffolding and you're at the
whim of yet another layer - its best to own it and go first class. LLMs are
great at writing mapping code and will only get better).
Two things keep it bounded:
- No silent data loss when a field desyncs. The mirror is lossless and the
test above proves every Tika key lands in `metadata`, typed. So if core renames
or adds a Property and a typed field stops matching, the value doesn't
disappear into a lossy Struct, it's still in `metadata` with the right type.
Worst case is a convenience field is briefly missing, not a dropped value. That
kills the specific failure mode the review called out.
- It's cheap. I lifted this model from an OSS pipeline I run live and it's
been about an hour a year to keep current (thanks to LLMs this is now easy).
The interfaces don't move much at all. If they do, we still capture unknowns.
(Also, I'll be glad to maintain this - as I need to maintain it anyway for
other projects)
And the typed layer is opt-in. If you want zero coupling to the taxonomy you
read `metadata` and ignore the typed fields. The mirror is the contract, the
typed fields are convenience. My take is still that if the contract stays
dynamic, JSON is the better fit, the whole reason to be on gRPC is the typed
schema. Simply put - grpc is pretty bad at being dynamic. So don't fight it,
embrace the type safety and downstream client code looks great and all those
parsing casting errors you've seen for years just go away.
**6. Fidelity.** No more Struct, so no JSON-ish doubles and no lost
String[]. `MetadataEntry` keeps multivalue via `text_list` and native types
(int64/double/bool/Timestamp) coerced from Tika Property definitions. Swapped
the hardcoded "Keywords"/"Content-Length" for Property constants, and
multivalue keywords are joined.
**7. Scope.** The `commit-typed-parse-response.sh` script is gone. The
detector trusts Content-Type and sits in the mapper as routing, not parsing,
but I can move it. `EpubStructureExtractor` is a fair hit. I can split it, and
trim WARC/font/climate to a thinner first cut, into a follow-up if you'd rather
keep this PR lean.
On staging the break: nothing consumes the gRPC `fields` map yet and this is
4.0, so I did a hard removal instead of deprecate-then-remove. Easy to switch
to staged if you'd prefer not to break it in one shot. But again, the fields
map is not really the design you want - the JSON interface is far better for
this.
Why I care about the typing: it lets me wire tika into the opennlp-sandbox
gRPC work (OPENNLP-1833) and on to embeddings without re-parsing metadata Tika
already typed. I also created mapping tools that I'm going to share that make
the mappings between protobufs via CEL selectors - making mapping truely
dynamic while not resorting to java reflection.
I know this is a lot - and I'll keep up with responding to any concerns.
Where I'd want your read: how curated the typed surface should be (you
floated DC plus a dozen common fields), the EPUB-extractor and detector
altitude, and hard-remove vs staged. Which of those matter most to you?
I think that the popularity of this resides on being able to get other
languages like Rust and Python to play well with the interface. So a strongly
typed one - where java can feel like a pain - is actually strongly preferred in
a python IDE and would make this package very attractive as a first class
parser that exceeds speed and capabilities of any parser out there.
Tika is great because it's a powerhouse for speed. It's reliable. This
would polish up the surface and give that to 12 languages in a solid contract.
So to me the contract is most important and why this reply focuses so heavily
on it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]