janhoy commented on PR #3670: URL: https://github.com/apache/solr/pull/3670#issuecomment-3336397547
The `textXpath` test tries to capture `<a>` tags directly under `<body>`. But it also captures the `<div><a>` tag. I checked the XML I get from local Tika, and it is different from XML we get from Tika Server 3. From TikaServer all the `<div>` tags are stripped, so that the `<a>` element appears to be just below `<body>`. I believe it is because the default HTML parser is now JSoup, which has some other rules. See https://issues.apache.org/jira/browse/TIKA-2562 Thus, this test document can be rewritten to use something else than div, and the test will work. I believe the same is the issue with `testCapture` test, as it relies on capturing `<div>`. That gives us a solution for the remaining three failing tests 🥳 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
