[jira] [Commented] (SOLR-8760) PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to stall new leadership

Christine Poerschke (JIRA) Tue, 01 Mar 2016 04:38:38 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173689#comment-15173689
 ]


Christine Poerschke commented on SOLR-8760:
-------------------------------------------

[[email protected]] and [[email protected]] - would you perhaps have any 
recall on the role of the {{completeList}} flag in the 
[PeerSync.java|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L387]
 logic?

SOLR-3126 added the flag and 
[3bbd90ebd552740b82697115409de48650bfe8b4|https://github.com/apache/lucene-solr/commit/3bbd90ebd552740b82697115409de48650bfe8b4#diff-d7fada5b4fec0b0efc216a64235043a1]
 and 
[e2ebd116a11bc45f528001cf9157a6e69b9553ef|https://github.com/apache/lucene-solr/commit/e2ebd116a11bc45f528001cf9157a6e69b9553ef#diff-d7fada5b4fec0b0efc216a64235043a1]
 are the relevant commits.

----

Here's something I tried to help understand the {{completeList}} vs. 
{{!completeList}} boolean-ness:
{code}
- boolean completeList = otherVersions.size() < nUpdates;  // do we have their 
complete list of updates?
+ boolean weWantedMoreThanWeGot = otherVersions.size() < nUpdates;
+ boolean weGotWhatWeNeeded = !weWantedMoreThanWeGot;

- if (!completeList && Math.abs(otherVersion) < ourLowThreshold) break;
+ // stop only if the supplier of other versions was 'sufficiently informed'
+ // i.e. we got all the nUpdates versions that we needed and asked for
+ if (weGotWhatWeNeeded && Math.abs(otherVersion) < ourLowThreshold) break;
{code}
However, why the logic might be to stop/break only if the supplier of the other 
versions was 'sufficiently informed' eludes me thus far i.e. would the 
'otherVersion vs. ourLowThreshold' comparison not be sufficient?

> PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to 
> stall new leadership
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8760
>                 URL: https://issues.apache.org/jira/browse/SOLR-8760
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Ramsey Haddad
>            Priority: Minor
>         Attachments: solr-8760-fixA.patch, solr-8760-fixB.patch
>
>
> When we are doing rolling restarts of our Solr servers, we are sometimes 
> hitting painfully long times without a shard leader. What happens is that a 
> new leader is elected, but first needs to fully sync old updates before it 
> assumes the leadership role and accepts new updates. The syncing process is 
> taking unusually long because of an interaction between having one of our 
> hourly garbage collection DBQs in the update logs and the replaying of old 
> ADDs. If there is a single DBQ, and 1000 older ADDs that are getting 
> replayed, then the DBQ is replayed 1000 times, instead of once. This itself 
> may be hard to fix. But, the thing that is easier to fix is that most of the 
> ADDs getting replayed shouldn't need to get replayed in the first place, 
> since they are older than ourLowThreshold.
> The problem can be fixed by eliminating or by modifying the way that the 
> "completeList" term is used to effect the PeerSync lists.
> We propose two alternatives to fix this:
> FixA: Based on my possibly incomplete understanding of PeerSync, the 
> completeList term should be eliminated. If updates older than ourLowThreshold 
> need to replayed, then aren't all the prerequisities for PeerSync violated 
> and hence we should fall back to SnapPull? (My gut suspects that a later bug 
> fix to PeerSync fixed whatever issue completeList was trying to deal with.)
> FixB: The patch that added the completeList term mentions that it is needed 
> for the replay of some DELETEs. Well, if that is true and we do need to 
> replay some DELETEs older than ourLowThreshold, then there is still no need 
> to replay any ADDs older than ourLowThreshold, right??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8760) PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to stall new leadership

Reply via email to