[jira] [Commented] (SOLR-17519) CloudSolrClient with HTTP ClusterState can forget live nodes and then fail

David Smiley (Jira) Fri, 10 Jan 2025 12:25:04 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912109#comment-17912109
 ]


David Smiley commented on SOLR-17519:
-------------------------------------

Here's another proposal that I'd like an opinion on, designed to both simplify 
and actually give the deployer a bit of control that they don't have today:   
Only use the configured URLs for all cluster state API interactions.  Easy.  No 
need to even convert a "liveNode" url-ish thing to a URL.  The deployer's job 
is to pick the node list wisely – nodes that will always be there, 
notwithstanding restarts.  But that's true today as well.  Perhaps the deployer 
designates a node-0 node (and one or two others fallback) that already has a 
node role to be the Overseer and/or for doing coordinator stuff.  The risk with 
my proposal is that by *not* considering live nodes (those returned from Solr), 
perhaps the initial list becomes inaccessible.  But that's a risk anyway at the 
time a client starts.
CC [~mbiscocho] 

> CloudSolrClient with HTTP ClusterState can forget live nodes and then fail
> --------------------------------------------------------------------------
>
>                 Key: SOLR-17519
>                 URL: https://issues.apache.org/jira/browse/SOLR-17519
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud, SolrJ
>            Reporter: David Smiley
>            Priority: Major
>              Labels: newdev, pull-request-available
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> When using CloudSolrClient with HTTP URLs to Solr for the cluster state:
> If all live nodes disappear temporarily (hard cluster restart?), the client 
> can permanently fail to talk to the cluster, and thus would need to be 
> restarted to recover.
> Credit [~ilan] on the dev list:
> {quote}The current implementation removes non live nodes from the set of 
> nodes to connect to. Getting the live nodes requires connecting to a specific 
> node in the cluster that is therefore live when that happens. Worst case, if 
> there is a single node up in the cluster, the client ends with a single node 
> in its connection candidates list. For the issue to manifest, that Solr node 
> then has to go down. Subsequently, even if other nodes are up, the client 
> only has the address of a down node and can't connect.
> The fix is not a big deal. Nodes initially passed as configuration to the 
> client should never be removed from the set of candidate nodes to connect to, 
> even if they are not live. Other live nodes could be added to that set (and 
> removed from it if we so desire when they are no longer live) to increase 
> resiliency in case the cluster does have live nodes but all initially 
> configured nodes are not live. The design issue is treating the configured 
> set of nodes to connect to and the set of live nodes as one thing.
> {quote}
> See org.apache.solr.client.solrj.impl.BaseHttpClusterStateProvider



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-17519) CloudSolrClient with HTTP ClusterState can forget live nodes and then fail

Reply via email to