[ 
https://issues.apache.org/jira/browse/SOLR-17656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-17656:
--------------------------------------
    Attachment: SOLR-17656.patch
        Status: Open  (was: Open)

The attached patch implements this idea via a new {{skipLeaderRecovery}} 
replica property.

This patch leverages existing {{skipRecovery}} logic in {{ZkController}} – all 
it does is add a new condition in which {{skipRecovery}} may be set to 
{{{}true{}}}.

The new logic includes sanity checks to ensure that event if 
{{{}skipLeaderRecovery==true{}}}, the property will be ignored (and an error 
logged) if either:
 * the replica type {{requireTransactionLog}}
 * the replica does not have *ANY* local index commit (ie: was restarted before 
it ever did a single fetch from the leader)

{{TestPullReplica}} has been updated to confirm that this new property can 
allow a PULL replica on a restarted solr node to become ACTIVE even if the 
leader is DOWN.

Feedback welcome. I feel like it would be pretty useful, and that the patch is 
basically good to go from a code standpoint – but it obviously needs some 
ref-guide updates.

(i was holding off pending any objections to the name and/or restrictions on 
when it's respected. I'm assuming it would make sense to document this 
[HERE|https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#recovery-with-pull-replicas])

> Add expert level option to allowe PULL replicas to go ACTIVE w/o RECOVERING
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-17656
>                 URL: https://issues.apache.org/jira/browse/SOLR-17656
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Chris M. Hostetter
>            Assignee: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-17656.patch
>
>
> In situations where a Solr cluster undergoes a rolling restart (or some other 
> "catastrophic" failure situations requiring/causing solr node restarts) there 
> can be a snowball effect of poor performance (or even solr node crashing) due 
> to fewer then normal replicas serving query requests while replicas on 
> restarting nodes are DOWN or RECOVERING – especially if shard leaders are 
> also affected, and (restarting) replicas first must wait for a leader 
> election before they can recover (or wait to finish recovery from an 
> over-worked leader).
> For NRT type usecases, RECOVERING is really a necessary evil to ensure every 
> replicas is up to date before handling NRT requests – but in the case of PULL 
> replicas, which are expected to routinely "lag" behind their leader, I've 
> talked to a lot of Solr users w/usecases where they would be happy to have 
> PULL replicas back online serving "stale" data ASAP, and let normal 
> IndexFetching "catchup" with the leader later.
> I propose we support a new "advanced" replica property that can be set on 
> PULL replicas by expert level users, to indicate: on (re)init, these replicas 
> may skip RECOVERING and go directly to ACTIVE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to