Re: [PR] SOLR-7632 TikaServer as pluggable backend to existing extraction handler [solr]

via GitHub Thu, 25 Sep 2025 08:09:34 -0700


janhoy commented on PR #3670:
URL: https://github.com/apache/solr/pull/3670#issuecomment-3334645293


   So, pushed a commit with some nice changes:
   
   * Refactor some logic back to ExtractingDocumentLoader, simplify 
ExtractionBackend interface to two methods
   * Add `backCompatibility=true` config option to enable duplicating some 
metadata like Tika 1.x did, e.g. both `dc:title` and `title`
   * Fix true SAX streaming parser for Tika-Server XML response. We now have or 
own `TikaXmlResponseSaxContentHandler` which takes care of pulling metadata 
from the response, while delegating other SAX parsing to whatever 
`ContentHandler` is passed to the parse method. This lets us re-use existing 
code to extract plain-text, xml, or capturing, xpath style tags
   
   Not all tests pass, but two more are green.
   
   <img width="321" height="384" alt="Skjermbilde 2025-09-25 kl  17 02 08" 
src="https://github.com/user-attachments/assets/74f683f7-9c98-4106-9fa8-2c712529897e";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-7632 TikaServer as pluggable backend to existing extraction handler [solr]

Reply via email to