tballison commented on PR #2916:
URL: https://github.com/apache/tika/pull/2916#issuecomment-4842589016

   I've only had a chance to look at this briefly. 
   
   The challenges as you know: a) under the hood, we're currently storing 
metadata keys as strings, b) metadata keys are an open set with 
user-customizable keys, c) there are a lot of keys.
   
   This was claude's review
   
   ```
   What's legitimate
   
     - Replacing a flat map<string,string> with a typed contract is a 
reasonable thing to want — string-keyed maps are a weak contract for gRPC 
clients.
     - The tika-grpc-api / tika-grpc-mapper / tika-grpc module split (clients 
depend on schema, not server) is good hygiene.
     - Unmapped keys aren't simply dropped: each format message has an 
additional_metadata Struct (e.g. pdf_metadata.proto:253) and the builders fill 
it via
     MetadataUtils.buildAdditionalMetadata(..., mappedFields). So it's not as 
lossy as the "replace the map" framing first suggests.
   
     What doesn't hold up
   
     1. Dead fields in the public wire contract. ParseResponse.raw_properties 
(field 14) and repeated MetadataEntry metadata (field 12) are declared in the 
proto with
     comments promising "anything not captured in typed metadata" — but the 
mapper populates neither (0 references in ParseResponseMapper.java). Shipping 
public proto
     fields that are always empty is a contract smell; a client will reasonably 
read raw_properties and get nothing.
   
     2. Four overlapping places a value can live. typed format field → 
per-format additional_metadata Struct → dublin_core → creative_commons, plus 
the two dead
     channels. There's no single predictable place to find a given property. 
That's harder to consume correctly than the map it replaces.
   
     3. It re-detects the format Tika already told you. 
DocumentTypeDetector.detect() (316 lines of heuristics over metadata) chooses 
the oneof branch — but Tika
     already set Content-Type. Re-deriving it heuristically is fragile and 
duplicative, and it's called twice (mapper lines 198 and 327).
   
     4. The oneof document_metadata fights Tika's data model. Tika metadata 
routinely spans namespaces (a PDF carrying XMP/DC, EXIF on embedded images, …). 
Forcing
     exactly one format bucket is precisely why DC and CC had to be pulled out 
of the oneof as special cases — the model is already breaking under its own 
weight, and
     every new cross-cutting namespace adds another special case.
   
     5. A hand-maintained parallel copy of Tika's entire metadata taxonomy. 
~5,000 lines of proto + ~4,000 lines of builders that must track 
org.apache.tika.metadata.*
     by hand forever. When core adds/renames a Property, the typed field 
silently desyncs and the value quietly drops into additional_metadata. That's a 
large
     permanent maintenance tax for marginal gain over a good catch-all.
   
     6. Fidelity loss. Catch-alls use google.protobuf.Struct, which is JSON-ish 
(numbers are doubles, no native String[]); Tika metadata is uniformly 
multivalued
     strings. And small tells confirm looseness: hardcoded keys 
"Keywords"/"Content-Length" instead of Property constants, and content.keywords 
is a scalar though Tika
     keywords are multivalue.
   
     7. Scope creep past "typed parse response." EpubStructureExtractor (483 
lines) and the detector live in the gRPC mapper — that's parser/detection logic 
migrating
     into the serialization layer, the wrong altitude. 
WARC/font/climate-forecast/database schemas widen the surface a lot for a first 
cut. There's also a
     commit-typed-parse-response.sh script checked in, which doesn't belong in 
the PR.
   
   My recommendation
   
     The shape I'd push toward is augment, not replace:
     - Keep one lossless, uniform, actually-populated channel as source of 
truth (a repeated key/value that preserves multivalue), then layer a small 
curated set of
     typed convenience fields (DC + a dozen genuinely common PDF/office fields) 
on top.
     - Trust Content-Type; drop the format oneof in favor of coexisting 
sub-messages (or just DC + raw).
     - Don't move EPUB structure extraction / type detection into the mapper.
     - Wire or delete the dead fields, and stage the break (add parse_response 
alongside fields, deprecate, then remove) rather than a 10.7k-line breaking 
mega-PR.
   
   ```
   
   
   5. gives me great concern from a maintenance perspective. I haven't looked 
carefully enough to know if this is real.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to