[jira] [Commented] (SOLR-6236) Need an optional fallback mechanism for selecting a leader when all replicas are in leader-initiated recovery.

David Smiley (Jira) Tue, 28 Feb 2023 06:57:18 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694600#comment-17694600
 ]


David Smiley commented on SOLR-6236:
------------------------------------

Closing because we don't have LIR anymore thanks to SOLR-11702 (ZK shard terms).

> Need an optional fallback mechanism for selecting a leader when all replicas 
> are in leader-initiated recovery.
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6236
>                 URL: https://issues.apache.org/jira/browse/SOLR-6236
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Timothy Potter
>            Priority: Major
>         Attachments: SOLR-6236.patch
>
>
> Offshoot from discussion in SOLR-6235, key points are:
> Tim: In ElectionContext, when running shouldIBeLeader, the node will choose 
> to not be the leader if it is in LIR. However, this could lead to no leader. 
> My thinking there is the state is bad enough that we would need manual 
> intervention to clear one of the LIR znodes to allow a replica to get past 
> this point. But maybe we can do better here?
> Shalin: Good question. With careful use of minRf, the user can retry 
> operations and maintain consistency even if we arbitrarily elect a leader in 
> this case. But most people won't use minRf and don't care about consistency 
> as much as availability. For them there should be a way to get out of this 
> mess easily. We can have a collection property (boolean + timeout value) to 
> force elect a leader even if all shards were in LIR. What do you think?
> Mark: Indeed, it's a current limitation that you can have all nodes in a 
> shard thinking they cannot be leader, even when all of them are available. 
> This is not required by the distributed model we have at all, it's just a 
> consequence of being over restrictive on the initial implementation - if all 
> known replicas are participating, you should be able to get a leader. So I'm 
> not sure if this case should be optional. But iff not all known replicas are 
> participating and you still want to force a leader, that should be optional - 
> I think it should default to false though. I think the system should default 
> to reasonable data safety in these cases.
> How best to solve this, I'm not quite sure, but happy to look at a patch. How 
> do you plan on monitoring and taking action? Via the Overseer? It seems 
> tricky to do it from the replicas.
> Tim: We have a similar issue where a replica attempting to be the leader 
> needs to wait a while to see other replicas before declaring itself the 
> leader, see ElectionContext around line 200:
> int leaderVoteWait = cc.getZkController().getLeaderVoteWait();
> if (!weAreReplacement)
> { waitForReplicasToComeUp(weAreReplacement, leaderVoteWait); }
> So one quick idea might be to have the code that checks if it's in LIR see if 
> all replicas are in LIR and if so, wait out the leaderVoteWait period and 
> check again. If all are still in LIR, then move on with becoming the leader 
> (in the spirit of availability).
> {quote}
> But iff not all known replicas are participating and you still want to force 
> a leader, that should be optional - I think it should default to false 
> though. I think the system should default to reasonable data safety in these 
> cases.
> {quote}
> Shalin: That's the same case as the leaderVoteWait situation and we do go 
> ahead after that amount of time even if all replicas aren't participating. 
> Therefore, I think that we should handle it the same way. But to help people 
> who care about consistency over availability, there should be a configurable 
> property which bans this auto-promotion completely.
> In any case, we should switch to coreNodeName instead of coreName and open an 
> issue to improve the leader election part.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-6236) Need an optional fallback mechanism for selecting a leader when all replicas are in leader-initiated recovery.

Reply via email to