tballison commented on PR #2916: URL: https://github.com/apache/tika/pull/2916#issuecomment-4846751029
Even with agents to help out, I can't stomach 11k lines of code to nail down maybe 80% of an open set. I'm really worried about maintenance within the project and then clients having to rebuild their protos when we change metadata definitions. We've had churn on value types EVEN for dublin core over the history of the project. Even if we limit custom handling to that, clients will still have to rebuild their protos when we make changes. I'd be ok, maybe, with special handling for dublin core and some of the tika core properties: media type, etc. Fellow devs (@nddipiazza) what do you think about this? From claude: The lossless catch-all is the right idea and the part that belongs in Tika — it's what should replace the removed fields map. I'd simplify its shape, though: from repeated MetadataEntry with a typed oneof to a plain multivalue map<string, StringList>. That keeps the native dict lookup clients had with the old map<string,string>, fixes the real gap (multivalue), and drops the per-value typing — which for dynamic keys forces clients to branch on a 6-way union on every read without giving them a compile-time typed accessor anyway. A new or renamed metadata key still never forces a client rebuild, because a key is data, not schema. On top of that map I'd add only special-cased DC + a few core props as typed strings. @krickert what, specifically, do you need within the Tika project and what can you do outside of Tika to meet your objectives? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
