[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

Cao Manh Dat (JIRA) Fri, 02 Mar 2018 01:52:41 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383405#comment-16383405
 ]


Cao Manh Dat commented on SOLR-12011:
-------------------------------------

Thanks a lot for your hard works, [~shalinmangar]

The latest patch for this ticket will commit soon.

> Consistence problem when in-sync replicas are DOWN
> --------------------------------------------------
>
>                 Key: SOLR-12011
>                 URL: https://issues.apache.org/jira/browse/SOLR-12011
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

Reply via email to