janhoy commented on PR #3670: URL: https://github.com/apache/solr/pull/3670#issuecomment-3328791839
My thought was to land tikaserver in Solr 9.x as opt-in while deprecating local. The server variant need not respond with exactly same metadata, and some of the tests which test specifically 1.x functionaly can be moved to that test class. But for simple use cases that 90% of users need, like extracting text and normal metadata from PDF, Word etc we get feature parity. Then we remove the local Tika parser in 10.0 and make server the default. I.e. users will have a transition path even in 9.x. I started with the JSON output from Tika Server, but since it does not support streaming but a full copy in memory, I'm moving to the `/tika` endpoint with XML response, where the TikaServer streams XHTML as parsing happens, without buffering all in memory first. Same on SolrCell side, I'm successfully parsing the XHTML with SAX, parsing all the `<meta>` tags. Next is to feed the sax stream into `SolrContentHandler` which will handle the capturing stuff. This shuold both give a small mem footprint and unlock more of the SolrCell features. While it is true that Tika 1.x and Tika 3.x has many breaking changes, that is mainly for the Java API. The XML parse result which is a `content` string and a `metadata` map stays the same, so no conceptual difference there. The metadata keys are a bit different/normalized, but we don't need to bridge that. We can simply document that when using `tikaserver` they should look for `dc:title` istead of `title`, and SolrCell allows you to map those to whatever schema field you like already. The big question is of course whether we manage to get a stable tika server impl which is production ready before 10.0, and whether the refactoring leaves the old `local` impl as stable as it has been, the memory footprint may have increased etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
