[jira] [Commented] (SOLR-5952) Recovery race/ error

Jessica Cheng (JIRA) Wed, 02 Apr 2014 20:33:07 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958489#comment-13958489
 ]


Jessica Cheng commented on SOLR-5952:
-------------------------------------

{quote}
Oh - I thought you were talking about the isLeader flag that is kept on the 
CloudDescriptor.
{quote}

Ah, I see. Well, I guess it could've been either. I'd just assumed that 
clusterstate was the one that said it was the leader and CloudDescriptor was 
the one that said it wasn't, based on the if statement below failing:

{quote}
if (isLeader && !cloudDesc.isLeader()) {
    throw new SolrException(ErrorCode.SERVER_ERROR, "Cloud state still says we 
are leader.");
}
{quote}

where isLeader was determined from clusterstate.

> Recovery race/ error
> --------------------
>
>                 Key: SOLR-5952
>                 URL: https://issues.apache.org/jira/browse/SOLR-5952
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7
>            Reporter: Jessica Cheng
>            Assignee: Mark Miller
>              Labels: leader, recovery, solrcloud, zookeeper
>             Fix For: 4.8, 5.0
>
>         Attachments: recovery-failure-host1-log.txt, 
> recovery-failure-host2-log.txt
>
>
> We're seeing some shard recovery errors in our cluster when a zookeeper 
> "error event" happened. In this particular case, we had two replicas. The 
> event from the logs look roughly like this:
> 18:40:36 follower (host2) disconnected from zk
> 18:40:38 original leader started recovery (there was no log about why it 
> needed recovery though) and failed because cluster state still says it's the 
> leader
> 18:40:39 follower successfully connected to zk after some trouble
> 19:03:35 follower register core/replica
> 19:16:36 follower registration fails due to no leader (NoNode for 
> /collections/test-1/leaders/shard2)
> Essentially, I think the problem is that the isLeader property on the cluster 
> state is never cleaned up, so neither replicas are able to recover/register 
> in order to participate in leader election again.
> Looks like from the code that the only place that the isLeader property is 
> cleared from the cluster state is from ElectionContext.runLeaderProcess, 
> which assumes that the replica with the next election seqId will notice the 
> leader's node disappearing and run the leader process. This assumption fails 
> in this scenario because the follower experienced the same zookeeper "error 
> event" and never handled the event of the leader going away. (Mark, this is 
> where I was saying in SOLR-3582 that maybe the watcher in 
> LeaderElector.checkIfIamLeader does need to handle "Expired" by somehow 
> realizing that the leader is gone and clearing the isLeader state at least, 
> but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5952) Recovery race/ error

Reply via email to