[ https://issues.apache.org/jira/browse/SOLR-17656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-17656: -------------------------------------- Attachment: SOLR-17656.patch Status: Open (was: Open) The attached patch implements this idea via a new {{skipLeaderRecovery}} replica property. This patch leverages existing {{skipRecovery}} logic in {{ZkController}} – all it does is add a new condition in which {{skipRecovery}} may be set to {{{}true{}}}. The new logic includes sanity checks to ensure that event if {{{}skipLeaderRecovery==true{}}}, the property will be ignored (and an error logged) if either: * the replica type {{requireTransactionLog}} * the replica does not have *ANY* local index commit (ie: was restarted before it ever did a single fetch from the leader) {{TestPullReplica}} has been updated to confirm that this new property can allow a PULL replica on a restarted solr node to become ACTIVE even if the leader is DOWN. Feedback welcome. I feel like it would be pretty useful, and that the patch is basically good to go from a code standpoint – but it obviously needs some ref-guide updates. (i was holding off pending any objections to the name and/or restrictions on when it's respected. I'm assuming it would make sense to document this [HERE|https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#recovery-with-pull-replicas]) > Add expert level option to allowe PULL replicas to go ACTIVE w/o RECOVERING > --------------------------------------------------------------------------- > > Key: SOLR-17656 > URL: https://issues.apache.org/jira/browse/SOLR-17656 > Project: Solr > Issue Type: New Feature > Reporter: Chris M. Hostetter > Assignee: Chris M. Hostetter > Priority: Major > Attachments: SOLR-17656.patch > > > In situations where a Solr cluster undergoes a rolling restart (or some other > "catastrophic" failure situations requiring/causing solr node restarts) there > can be a snowball effect of poor performance (or even solr node crashing) due > to fewer then normal replicas serving query requests while replicas on > restarting nodes are DOWN or RECOVERING – especially if shard leaders are > also affected, and (restarting) replicas first must wait for a leader > election before they can recover (or wait to finish recovery from an > over-worked leader). > For NRT type usecases, RECOVERING is really a necessary evil to ensure every > replicas is up to date before handling NRT requests – but in the case of PULL > replicas, which are expected to routinely "lag" behind their leader, I've > talked to a lot of Solr users w/usecases where they would be happy to have > PULL replicas back online serving "stale" data ASAP, and let normal > IndexFetching "catchup" with the leader later. > I propose we support a new "advanced" replica property that can be set on > PULL replicas by expert level users, to indicate: on (re)init, these replicas > may skip RECOVERING and go directly to ACTIVE. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org