janhoy commented on PR #3670:
URL: https://github.com/apache/solr/pull/3670#issuecomment-3314351420

   Status: 
   * Parses docs using TikaServer
   * Can switch between `xml` (html) and `text` format of the content field
   * Randomized the choice of backend for the main test class
   * ExtractOnly not fully implemented for tikaserver, some tests fail
   
   TBD:
   * The whole xpath / SAX parsing of XML response is missing
   * We use JDK HTTP client, could perhaps use Jetty client. See other POC for 
example, including making timeouts configurable
   * Must make sure that `tikaserver.url` is only configurable on 
requesthandler config in solrconfig, not as a request parameter (security)
   * RefGuide docs, especially how to start TikaServer etc
   * Remove the DummyExtractionBackend
   
   Anyone, please feel free to hack away on this if it looks exciting, 
committing directly to the PR branch.
   
   Question: Would it bring value to isolate the refactoring in one PR and then 
another one to add the tikaserver impl?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to