[
https://issues.apache.org/jira/browse/SOLR-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720336#comment-14720336
]
Mark Miller commented on SOLR-7844:
-----------------------------------
I'm also sad to not see a way to get the version back from a created node? Just
the path?
> Zookeeper session expiry during shard leader election can cause multiple
> leaders.
> ---------------------------------------------------------------------------------
>
> Key: SOLR-7844
> URL: https://issues.apache.org/jira/browse/SOLR-7844
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.10.4
> Reporter: Mike Roberts
> Assignee: Mark Miller
> Fix For: Trunk, 5.4
>
> Attachments: SOLR-7844.patch, SOLR-7844.patch, SOLR-7844.patch
>
>
> If the ZooKeeper session expires for a host during shard leader election, the
> ephemeral leader_elect nodes are removed. However the threads that were
> processing the election are still present (and could believe the host won the
> election). They will then incorrectly create leader nodes once a new
> ZooKeeper session is established.
> This introduces a subtle race condition that could cause two hosts to become
> leader.
> Scenario:
> a three machine cluster, all of the machines are restarting at approximately
> the same time.
> The first machine starts, writes a leader_elect ephemeral node, it's the only
> candidate in the election so it wins and starts the leadership process. As it
> knows it has peers, it begins to block waiting for the peers to arrive.
> During this period of blocking[1] the ZK connection drops and the session
> expires.
> A new ZK session is established, and ElectionContext.cancelElection is
> called. Then register() is called and a new set of leader_elect ephemeral
> nodes are created.
> During the period between the ZK session expiring, and new set of
> leader_elect nodes being created the second machine starts.
> It creates its leader_elect ephemeral nodes, as there are no other nodes it
> wins the election and starts the leadership process. As its still missing one
> of its peers, it begins to block waiting for the third machine to join.
> There is now a race between machine1 & machine2, both of whom think they are
> the leader.
> So far, this isn't too bad, because the machine that loses the race will fail
> when it tries to create the /collection/name/leader/shard1 node (as it
> already exists), and will rejoin the election.
> While this is happening, machine3 has started and has queued for leadership
> behind machine2.
> If the loser of the race is machine2, when it rejoins the election it cancels
> the current context, deleting it's leader_elect ephemeral nodes.
> At this point, machine3 believes it has become leader (the watcher it has on
> the leader_elect node fires), and it runs the LeaderElector::checkIfIAmLeader
> method. This method DELETES the current /collection/name/leader/shard1 node,
> then starts the leadership process (as all three machines are now running, it
> does not block to wait).
> So, machine1 won the race with machine2 and declared its leadership and
> created the nodes. However, machine3 has just deleted them, and recreated
> them for itself. So machine1 and machine3 both believe they are the leader.
> I am thinking that the fix should be to cancel & close all election contexts
> immediately on reconnect (we do cancel them, however it's run serially which
> has blocking issues, and just canceling does not cause the wait loop to
> exit). That election context logic already has checks on the closed flag, so
> they should exit if they see it has been closed.
> I'm working on a patch for this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]