[ https://issues.apache.org/jira/browse/SOLR-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912109#comment-17912109 ]
David Smiley commented on SOLR-17519: ------------------------------------- Here's another proposal that I'd like an opinion on, designed to both simplify and actually give the deployer a bit of control that they don't have today: Only use the configured URLs for all cluster state API interactions. Easy. No need to even convert a "liveNode" url-ish thing to a URL. The deployer's job is to pick the node list wisely – nodes that will always be there, notwithstanding restarts. But that's true today as well. Perhaps the deployer designates a node-0 node (and one or two others fallback) that already has a node role to be the Overseer and/or for doing coordinator stuff. The risk with my proposal is that by *not* considering live nodes (those returned from Solr), perhaps the initial list becomes inaccessible. But that's a risk anyway at the time a client starts. CC [~mbiscocho] > CloudSolrClient with HTTP ClusterState can forget live nodes and then fail > -------------------------------------------------------------------------- > > Key: SOLR-17519 > URL: https://issues.apache.org/jira/browse/SOLR-17519 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud, SolrJ > Reporter: David Smiley > Priority: Major > Labels: newdev, pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > When using CloudSolrClient with HTTP URLs to Solr for the cluster state: > If all live nodes disappear temporarily (hard cluster restart?), the client > can permanently fail to talk to the cluster, and thus would need to be > restarted to recover. > Credit [~ilan] on the dev list: > {quote}The current implementation removes non live nodes from the set of > nodes to connect to. Getting the live nodes requires connecting to a specific > node in the cluster that is therefore live when that happens. Worst case, if > there is a single node up in the cluster, the client ends with a single node > in its connection candidates list. For the issue to manifest, that Solr node > then has to go down. Subsequently, even if other nodes are up, the client > only has the address of a down node and can't connect. > The fix is not a big deal. Nodes initially passed as configuration to the > client should never be removed from the set of candidate nodes to connect to, > even if they are not live. Other live nodes could be added to that set (and > removed from it if we so desire when they are no longer live) to increase > resiliency in case the cluster does have live nodes but all initially > configured nodes are not live. The design issue is treating the configured > set of nodes to connect to and the set of live nodes as one thing. > {quote} > See org.apache.solr.client.solrj.impl.BaseHttpClusterStateProvider -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org