[ https://issues.apache.org/jira/browse/SOLR-17656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17933573#comment-17933573 ]
Chris M. Hostetter commented on SOLR-17656: ------------------------------------------- [~houston] - i appreciate your help w/the test – but after reading your change a couple of times now, and looking at the old version vs the new version, and re-running w/your exact seed (and confirming that {{testSkipLeaderRecoveryProperty}} ran before {{testRealTimeGet}}) I still can't for the life of me understand when/why that failures would have happened (in a way that your commit would now reliably prevent it from happening) If the issue is that one of the jetties never restarted (or restarted too slowly), then... * I'd expect {{testSkipLeaderRecoveryProperty}} should have failed as well? .. and if it didn't we shoudl also harden it's assertions ** but even then, i'd expect {{testRealTimeGet}} to timeout or fail when creating it's collection -- not when adding a doc to that collection * But also: {{testSkipLeaderRecoveryProperty}} isn't the first test in this class to stop/restart jetty instances -- so if that's the problem it seems like it could happen in other test methods as well, and the better fix would be to put your "wait for live nodes" logic in a before/after method ...can you help walk me through what exactly the problem was and how your patch fixed it? > Add expert level option to allowe PULL replicas to go ACTIVE w/o RECOVERING > --------------------------------------------------------------------------- > > Key: SOLR-17656 > URL: https://issues.apache.org/jira/browse/SOLR-17656 > Project: Solr > Issue Type: New Feature > Reporter: Chris M. Hostetter > Assignee: Chris M. Hostetter > Priority: Major > Fix For: main (10.0), 9.9 > > Attachments: SOLR-17656-1.patch, SOLR-17656.patch > > > In situations where a Solr cluster undergoes a rolling restart (or some other > "catastrophic" failure situations requiring/causing solr node restarts) there > can be a snowball effect of poor performance (or even solr node crashing) due > to fewer then normal replicas serving query requests while replicas on > restarting nodes are DOWN or RECOVERING – especially if shard leaders are > also affected, and (restarting) replicas first must wait for a leader > election before they can recover (or wait to finish recovery from an > over-worked leader). > For NRT type usecases, RECOVERING is really a necessary evil to ensure every > replicas is up to date before handling NRT requests – but in the case of PULL > replicas, which are expected to routinely "lag" behind their leader, I've > talked to a lot of Solr users w/usecases where they would be happy to have > PULL replicas back online serving "stale" data ASAP, and let normal > IndexFetching "catchup" with the leader later. > I propose we support a new "advanced" replica property that can be set on > PULL replicas by expert level users, to indicate: on (re)init, these replicas > may skip RECOVERING and go directly to ACTIVE. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org