Re: [PR] SOLR-7632 TikaServer as pluggable backend to existing extraction handler [solr]

via GitHub Thu, 25 Sep 2025 17:43:43 -0700


janhoy commented on PR #3670:
URL: https://github.com/apache/solr/pull/3670#issuecomment-3336397547


   The `textXpath` test tries to capture `<a>` tags directly under `<body>`. 
But it also captures the `<div><a>` tag. I checked the XML I get from local 
Tika, and it is different from XML we get from Tika Server 3. From TikaServer 
all the `<div>` tags are stripped, so that the `<a>` element appears to be just 
below `<body>`. I believe it is because the default HTML parser is now JSoup, 
which has some other rules. See https://issues.apache.org/jira/browse/TIKA-2562
   
   Thus, this test document can be rewritten to use something else than div, 
and the test will work.
   
   I believe the same is the issue with `testCapture` test, as it relies on 
capturing `<div>`.
   
   That gives us a solution for the remaining three failing tests 🥳


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-7632 TikaServer as pluggable backend to existing extraction handler [solr]

Reply via email to