janhoy commented on PR #3670:
URL: https://github.com/apache/solr/pull/3670#issuecomment-3328791839

   My thought was to land tikaserver in Solr 9.x as opt-in while deprecating 
local. The server variant need not respond with exactly same metadata, and some 
of the tests which test specifically 1.x functionaly can be moved to that test 
class. But for simple use cases that 90% of users need, like extracting text 
and normal metadata from PDF, Word etc we get feature parity. Then we remove 
the local Tika parser in 10.0 and make server the default. I.e. users will have 
a transition path even in 9.x.
   
   I started with the JSON output from Tika Server, but since it does not 
support streaming but a full copy in memory, I'm moving to the `/tika` endpoint 
with XML response, where the TikaServer streams XHTML as parsing happens, 
without buffering all in memory first. Same on SolrCell side, I'm successfully 
parsing the XHTML with SAX, parsing all the `<meta>` tags. Next is to feed the 
sax stream into `SolrContentHandler` which will handle the capturing stuff. 
This shuold both give a small mem footprint and unlock more of the SolrCell 
features.
   
   While it is true that Tika 1.x and Tika 3.x has many breaking changes, that 
is mainly for the Java API. The XML parse result which is a `content` string 
and a `metadata` map stays the same, so no conceptual difference there. The 
metadata keys are a bit different/normalized, but we don't need to bridge that. 
We can simply document that when using `tikaserver` they should look for 
`dc:title` istead of `title`, and SolrCell allows you to map those to whatever 
schema field you like already.
   
   The big question is of course whether we manage to get a stable tika server 
impl which is production ready before 10.0, and whether the refactoring leaves 
the old `local` impl as stable as it has been, the memory footprint may have 
increased etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to