[
https://issues.apache.org/jira/browse/SOLR-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171913#comment-15171913
]
Ramsey Haddad commented on SOLR-8760:
-------------------------------------
More details about the conditions leading up to this problem are in:
http://mail-archives.apache.org/mod_mbox/lucene-dev/201602.mbox/%3ccac2x+z3at7ileypotx3xzrp5qysklaatgm-xtjn1a8zpxus...@mail.gmail.com%3E
> PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to
> stall new leadership
> ------------------------------------------------------------------------------------------------
>
> Key: SOLR-8760
> URL: https://issues.apache.org/jira/browse/SOLR-8760
> Project: Solr
> Issue Type: Bug
> Reporter: Ramsey Haddad
> Priority: Minor
> Attachments: solr-8760-fixA.patch, solr-8760-fixB.patch
>
>
> When we are doing rolling restarts of our Solr servers, we are sometimes
> hitting painfully long times without a shard leader. What happens is that a
> new leader is elected, but first needs to fully sync old updates before it
> assumes the leadership role and accepts new updates. The syncing process is
> taking unusually long because of an interaction between having one of our
> hourly garbage collection DBQs in the update logs and the replaying of old
> ADDs. If there is a single DBQ, and 1000 older ADDs that are getting
> replayed, then the DBQ is replayed 1000 times, instead of once. This itself
> may be hard to fix. But, the thing that is easier to fix is that most of the
> ADDs getting replayed shouldn't need to get replayed in the first place,
> since they are older than ourLowThreshold.
> The problem can be fixed by eliminating or by modifying the way that the
> "completeList" term is used to effect the PeerSync lists.
> We propose two alternatives to fix this:
> FixA: Based on my possibly incomplete understanding of PeerSync, the
> completeList term should be eliminated. If updates older than ourLowThreshold
> need to replayed, then aren't all the prerequisities for PeerSync violated
> and hence we should fall back to SnapPull? (My gut suspects that a later bug
> fix to PeerSync fixed whatever issue completeList was trying to deal with.)
> FixB: The patch that added the ourLowThreshold term mentions that it is
> needed for the replay of some DELETEs. Well, if that is true and we do need
> to replay some DELETEs older than ourLowThreshold, then there is still no
> need to replay any ADDs older than ourLowThreshold, right??
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]