[ 
https://issues.apache.org/jira/browse/SOLR-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17897749#comment-17897749
 ] 

Mark Robert Miller commented on SOLR-17557:
-------------------------------------------

There is always a most up to date replica. If you peersync, and another replica 
gives you updates, it’s more up to date. Peersyncing against all the replicas 
is really just checking, am I the most up to date, or is there a better leader 
and I should fail this leader attempt so they can try. It’s kind of like an 
optimization that we say, okay, we are not the most up to date, but we can 
peersync the updates we are missing from the other replicas. That optimization 
fails if the number of updates we are missing is larger than the peersync 
window, and we decline leadership and give the next replica a chance to claim 
it. You could remove sending any peersync updates and there would be no change 
in data loss. 

So I’m suggesting dropping the optimization essentially, with the thought that 
maybe it’s not such an optimization when you don’t have to do peersync to begin 
with to see if you are up to date, because the LIR terms immediately tell us if 
we are up to date. Given peersync means talking to every replica and the 
transferring data, or just bailing on the leader attempt anyway, just bailing 
to begin with based on the LIR term removes code and complexity and is almost 
certainly faster. 

If no replica is up that has the highest term, electing a leader will mean data 
loss currently. But you can just check if that a replica with the highest LIR 
term is in live nodes, and if it’s not, consider the next highest term a proper 
leader to keep that behavior. 

> PeerSync should only be called when the ZkShardTerm is not the highest
> ----------------------------------------------------------------------
>
>                 Key: SOLR-17557
>                 URL: https://issues.apache.org/jira/browse/SOLR-17557
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Houston Putman
>            Priority: Major
>
> Currently when a leader is elected for a shard, PeerSync is called after 
> election to make sure that the new leader is not missing documents that other 
> replicas have.
> With the "new" LeaderInitiatedRecovery (LIR) implementation based on 
> ZkShardTerms, we now have a much better idea as to which replicas have all 
> the documents that the old leader had. So if the newly elected leader has the 
> highest ZkShardTerm (i.e. it was already in sync with the old leader before 
> the leader election), then we shouldn't need to run PeerSync.
> For the break-glass scenario where the newly elected leader does *not* have 
> the highest ZkShardTerm, then we will probably still want to run PeerSync, 
> just to be safe, as there will probably be data loss and we want to minimize 
> how much data that is.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to