[jira] [Commented] (SOLR-17519) CloudSolrClient with HTTP ClusterState can forget live nodes and then fail

Houston Putman (Jira) Tue, 14 Jan 2025 14:27:34 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913086#comment-17913086
 ]


Houston Putman commented on SOLR-17519:
---------------------------------------

{quote}Another reason to exclusively use the configured Solr URLs (s) is that 
the operator might want to use a load balancer (may even _already_ be using 
one), not actually directly a Solr node itself. It might be a single dependable 
URL.
{quote}
This is absolutely true, but then the user shouldn't be using the 
"CloudSolrClient" correct? I think the naming is bad here. If I'm not mistaken, 
the "CloudSolrClient"s act more like "LiveNodeSolrClients". Instead of the 
LoadBalancingSolrClient which just load balances between a set of URLs. So in 
this case, the operator or people using a single load balancer shouldn't use 
the "CloudSolrClient", correct? Maybe this requires a better name or better 
documentation. I'm not sure.

While I think this is true, I also don't think that the 
HttpClusterStateProvider should be locked into this pattern, even if this is 
the only use case for it (Not sure what else it's used for). The 
HttpClusterStateProvider should have the option to use dynamic node discovery, 
or just stick with its initial set of urls (as David is pushing for). The 
CloudSolrClient should probably then use the dynamic reloading, for the reasons 
listed above.

To summarize, I like the PR, but I think the dynamic part should be 
opt-out-able in the HttpClusterStateProvider

> CloudSolrClient with HTTP ClusterState can forget live nodes and then fail
> --------------------------------------------------------------------------
>
>                 Key: SOLR-17519
>                 URL: https://issues.apache.org/jira/browse/SOLR-17519
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud, SolrJ
>            Reporter: David Smiley
>            Priority: Major
>              Labels: newdev, pull-request-available
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> When using CloudSolrClient with HTTP URLs to Solr for the cluster state:
> If all live nodes disappear temporarily (hard cluster restart?), the client 
> can permanently fail to talk to the cluster, and thus would need to be 
> restarted to recover.
> Credit [~ilan] on the dev list:
> {quote}The current implementation removes non live nodes from the set of 
> nodes to connect to. Getting the live nodes requires connecting to a specific 
> node in the cluster that is therefore live when that happens. Worst case, if 
> there is a single node up in the cluster, the client ends with a single node 
> in its connection candidates list. For the issue to manifest, that Solr node 
> then has to go down. Subsequently, even if other nodes are up, the client 
> only has the address of a down node and can't connect.
> The fix is not a big deal. Nodes initially passed as configuration to the 
> client should never be removed from the set of candidate nodes to connect to, 
> even if they are not live. Other live nodes could be added to that set (and 
> removed from it if we so desire when they are no longer live) to increase 
> resiliency in case the cluster does have live nodes but all initially 
> configured nodes are not live. The design issue is treating the configured 
> set of nodes to connect to and the set of live nodes as one thing.
> {quote}
> See org.apache.solr.client.solrj.impl.BaseHttpClusterStateProvider



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-17519) CloudSolrClient with HTTP ClusterState can forget live nodes and then fail

Reply via email to