janhoy commented on PR #3670: URL: https://github.com/apache/solr/pull/3670#issuecomment-3314351420
Status: * Parses docs using TikaServer * Can switch between `xml` (html) and `text` format of the content field * Randomized the choice of backend for the main test class * ExtractOnly not fully implemented for tikaserver, some tests fail TBD: * The whole xpath / SAX parsing of XML response is missing * We use JDK HTTP client, could perhaps use Jetty client. See other POC for example, including making timeouts configurable * Must make sure that `tikaserver.url` is only configurable on requesthandler config in solrconfig, not as a request parameter (security) * RefGuide docs, especially how to start TikaServer etc * Remove the DummyExtractionBackend Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch. Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
