[ https://issues.apache.org/jira/browse/SOLR-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912139#comment-17912139 ]
Matthew Biscocho commented on SOLR-17519: ----------------------------------------- Using a static set of URLs would simplify ClusterStateProvider but losing that "dynamic node discovery" changes the way the user is probably used to using this. Correct me if I am wrong, but CloudSolrClient used to get live nodes from ZK directly so this probably was never an issue even with nodes movements or migrations? If I want cluster state, I would assume CloudSolrClient has the latest nodes that Solr knows and not just the ones passed but this can be documented to change. Like if I initialize node-0 and ask for cluster state and get back "node-0 and node-1:, I assumed CloudSolrClient knows about node-1 to get state even if node-0 is gone because clusterstate returned it. Or maybe thats wrong way of thinking. We do node migrations regularly, so this would put the maintaining of the Solr nodes on the user and their point of failure if the list becomes stale which maybe can be cumbersome at scale but its debatable that it should be maintained by the user anyways. I think the proposal makes sense though and is easily documentable, understandable for the user and can be easily maintained going forward which would have probably avoided this bug all together which is a plus. > CloudSolrClient with HTTP ClusterState can forget live nodes and then fail > -------------------------------------------------------------------------- > > Key: SOLR-17519 > URL: https://issues.apache.org/jira/browse/SOLR-17519 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud, SolrJ > Reporter: David Smiley > Priority: Major > Labels: newdev, pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > When using CloudSolrClient with HTTP URLs to Solr for the cluster state: > If all live nodes disappear temporarily (hard cluster restart?), the client > can permanently fail to talk to the cluster, and thus would need to be > restarted to recover. > Credit [~ilan] on the dev list: > {quote}The current implementation removes non live nodes from the set of > nodes to connect to. Getting the live nodes requires connecting to a specific > node in the cluster that is therefore live when that happens. Worst case, if > there is a single node up in the cluster, the client ends with a single node > in its connection candidates list. For the issue to manifest, that Solr node > then has to go down. Subsequently, even if other nodes are up, the client > only has the address of a down node and can't connect. > The fix is not a big deal. Nodes initially passed as configuration to the > client should never be removed from the set of candidate nodes to connect to, > even if they are not live. Other live nodes could be added to that set (and > removed from it if we so desire when they are no longer live) to increase > resiliency in case the cluster does have live nodes but all initially > configured nodes are not live. The design issue is treating the configured > set of nodes to connect to and the set of live nodes as one thing. > {quote} > See org.apache.solr.client.solrj.impl.BaseHttpClusterStateProvider -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org