[
https://issues.apache.org/jira/browse/SOLR-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173689#comment-15173689
]
Christine Poerschke commented on SOLR-8760:
-------------------------------------------
[[email protected]] and [[email protected]] - would you perhaps have any
recall on the role of the {{completeList}} flag in the
[PeerSync.java|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L387]
logic?
SOLR-3126 added the flag and
[3bbd90ebd552740b82697115409de48650bfe8b4|https://github.com/apache/lucene-solr/commit/3bbd90ebd552740b82697115409de48650bfe8b4#diff-d7fada5b4fec0b0efc216a64235043a1]
and
[e2ebd116a11bc45f528001cf9157a6e69b9553ef|https://github.com/apache/lucene-solr/commit/e2ebd116a11bc45f528001cf9157a6e69b9553ef#diff-d7fada5b4fec0b0efc216a64235043a1]
are the relevant commits.
----
Here's something I tried to help understand the {{completeList}} vs.
{{!completeList}} boolean-ness:
{code}
- boolean completeList = otherVersions.size() < nUpdates; // do we have their
complete list of updates?
+ boolean weWantedMoreThanWeGot = otherVersions.size() < nUpdates;
+ boolean weGotWhatWeNeeded = !weWantedMoreThanWeGot;
- if (!completeList && Math.abs(otherVersion) < ourLowThreshold) break;
+ // stop only if the supplier of other versions was 'sufficiently informed'
+ // i.e. we got all the nUpdates versions that we needed and asked for
+ if (weGotWhatWeNeeded && Math.abs(otherVersion) < ourLowThreshold) break;
{code}
However, why the logic might be to stop/break only if the supplier of the other
versions was 'sufficiently informed' eludes me thus far i.e. would the
'otherVersion vs. ourLowThreshold' comparison not be sufficient?
> PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to
> stall new leadership
> ------------------------------------------------------------------------------------------------
>
> Key: SOLR-8760
> URL: https://issues.apache.org/jira/browse/SOLR-8760
> Project: Solr
> Issue Type: Bug
> Reporter: Ramsey Haddad
> Priority: Minor
> Attachments: solr-8760-fixA.patch, solr-8760-fixB.patch
>
>
> When we are doing rolling restarts of our Solr servers, we are sometimes
> hitting painfully long times without a shard leader. What happens is that a
> new leader is elected, but first needs to fully sync old updates before it
> assumes the leadership role and accepts new updates. The syncing process is
> taking unusually long because of an interaction between having one of our
> hourly garbage collection DBQs in the update logs and the replaying of old
> ADDs. If there is a single DBQ, and 1000 older ADDs that are getting
> replayed, then the DBQ is replayed 1000 times, instead of once. This itself
> may be hard to fix. But, the thing that is easier to fix is that most of the
> ADDs getting replayed shouldn't need to get replayed in the first place,
> since they are older than ourLowThreshold.
> The problem can be fixed by eliminating or by modifying the way that the
> "completeList" term is used to effect the PeerSync lists.
> We propose two alternatives to fix this:
> FixA: Based on my possibly incomplete understanding of PeerSync, the
> completeList term should be eliminated. If updates older than ourLowThreshold
> need to replayed, then aren't all the prerequisities for PeerSync violated
> and hence we should fall back to SnapPull? (My gut suspects that a later bug
> fix to PeerSync fixed whatever issue completeList was trying to deal with.)
> FixB: The patch that added the completeList term mentions that it is needed
> for the replay of some DELETEs. Well, if that is true and we do need to
> replay some DELETEs older than ourLowThreshold, then there is still no need
> to replay any ADDs older than ourLowThreshold, right??
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]