[jira] [Commented] (SOLR-17656) Add expert level option to allowe PULL replicas to go ACTIVE w/o RECOVERING

Chris M. Hostetter (Jira) Thu, 13 Mar 2025 12:22:06 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17935318#comment-17935318
 ]


Chris M. Hostetter commented on SOLR-17656:
-------------------------------------------

{quote} So that made me eventually think that it was a client problem and not 
actually a problem with the node. 

...

I added the state wait at the beginning because I wasn't entirely sure this was 
the issue at the time, but now I know that part is useless, it can be removed.
{quote}
Ha ha ... ok ... my confusion was thinking the "state wait" was the 
*_important_* part of your fix, and that the client changes were just you doing 
a little refactoring/cleanup of duplicated code in the same commit.

Now that i understand that your changes to what {{SolrClient}} gets used in the 
test are what *REALLY* made the failures stop, that makes things (a little) 
less confusing ... maybe the underling issue relates to connection caching in 
the low level {{org.apache.http.client.HttpClient}} , and the jetty restarts 
breaking those connections in a way that the client doesn't notice until it 
tries to send a request?

(but i don't really understand the retry weirdness you mentioned)

 

> Add expert level option to allowe PULL replicas to go ACTIVE w/o RECOVERING
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-17656
>                 URL: https://issues.apache.org/jira/browse/SOLR-17656
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Chris M. Hostetter
>            Assignee: Chris M. Hostetter
>            Priority: Major
>             Fix For: main (10.0), 9.9
>
>         Attachments: SOLR-17656-1.patch, SOLR-17656.patch
>
>
> In situations where a Solr cluster undergoes a rolling restart (or some other 
> "catastrophic" failure situations requiring/causing solr node restarts) there 
> can be a snowball effect of poor performance (or even solr node crashing) due 
> to fewer then normal replicas serving query requests while replicas on 
> restarting nodes are DOWN or RECOVERING – especially if shard leaders are 
> also affected, and (restarting) replicas first must wait for a leader 
> election before they can recover (or wait to finish recovery from an 
> over-worked leader).
> For NRT type usecases, RECOVERING is really a necessary evil to ensure every 
> replicas is up to date before handling NRT requests – but in the case of PULL 
> replicas, which are expected to routinely "lag" behind their leader, I've 
> talked to a lot of Solr users w/usecases where they would be happy to have 
> PULL replicas back online serving "stale" data ASAP, and let normal 
> IndexFetching "catchup" with the leader later.
> I propose we support a new "advanced" replica property that can be set on 
> PULL replicas by expert level users, to indicate: on (re)init, these replicas 
> may skip RECOVERING and go directly to ACTIVE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-17656) Add expert level option to allowe PULL replicas to go ACTIVE w/o RECOVERING

Reply via email to