On 9/13/22 20:07, Jonathan Tan wrote:
The SolrZkClient takes a startUpTimeOut and a startUpZkTimeOut property, and only checks if ZK is available within that period. Once that timeout has exceeded, then it declares that the SOLR node was unable to load the cores, and then it does nothing else. Subsequent incoming requests (like /solr/admin/info/system) then check for the CoreContainer state, and if that's not in a good state (and it won't be if ZK wasn't available at startup), then it'd just fail the request, and do nothing else.
Sounds like that needs a little work. I think Solr should not ever get into a state where it stops trying to connect to ZK. If a single node is still available, then Solr cannot run in read-write mode, but it should keep working in read-only mode, and always try to reconnect to the full ensemble when/if it becomes available.
I'm reasonably sure that at the moment, a DNS lookup for the ZK hosts only happens when Solr first starts. That is probably another thing that could use some work in Solr. I could be wrong ... I am not familiar with the code for SolrCloud.
As of ZK 3.5, ZK supports dynamic reconfiguration of the ensemble. I read something somewhere by one of our devs saying that Solr doesn't support this, but I would think support would depend on the ZK client, not Solr. If you are using ZK's dynamic reconfig when you add/remove ZK nodes, then IMHO Solr should just work when membership in the ZK cluster changes.
Thanks, Shawn