[ https://issues.apache.org/jira/browse/SOLR-17656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17935274#comment-17935274 ]
Houston Putman commented on SOLR-17656: --------------------------------------- [~hossman] I could never reproduce locally either, but the smoking gun for me was that the node was available (the logs said it was fine, and the collection creation command worked without issue) but the actual client request got the IOException. So that made me eventually think that it was a client problem and not actually a problem with the node. If you look at the logs of the failing tests, they include many of these: {quote} o.a.h.i.e.RetryExec I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to \{s}->https://127.0.0.1:38235: The target server failed to respond o.a.h.i.e.RetryExec Retrying request to \{s}->[https://127.0.0.1:38235|https://127.0.0.1:38235/] {quote} The weird thing here, is that the bad requests are only supposed to be retried if the request is non-admin, but all of these requests being retried seem to be Collections API calls. The weirder thing, is that this same retry logic is not used when the same exception happens during the "/get" call in {{testRealTimeGet()}} , so it fails in the way described above. These NoHttpResponseExceptions happen all over the test, probably because of the amount of restarts being done, but because they get retried almost every time it's not a big deal. I don't know why it's not being retried for the "/get" request, but all I do know is that creating a new client is a safe way to see the test pass, and ultimately the client is not what is being tested here. I added the state wait at the beginning because I wasn't entirely sure this was the issue at the time, but now I know that part is useless, it can be removed. > Add expert level option to allowe PULL replicas to go ACTIVE w/o RECOVERING > --------------------------------------------------------------------------- > > Key: SOLR-17656 > URL: https://issues.apache.org/jira/browse/SOLR-17656 > Project: Solr > Issue Type: New Feature > Reporter: Chris M. Hostetter > Assignee: Chris M. Hostetter > Priority: Major > Fix For: main (10.0), 9.9 > > Attachments: SOLR-17656-1.patch, SOLR-17656.patch > > > In situations where a Solr cluster undergoes a rolling restart (or some other > "catastrophic" failure situations requiring/causing solr node restarts) there > can be a snowball effect of poor performance (or even solr node crashing) due > to fewer then normal replicas serving query requests while replicas on > restarting nodes are DOWN or RECOVERING – especially if shard leaders are > also affected, and (restarting) replicas first must wait for a leader > election before they can recover (or wait to finish recovery from an > over-worked leader). > For NRT type usecases, RECOVERING is really a necessary evil to ensure every > replicas is up to date before handling NRT requests – but in the case of PULL > replicas, which are expected to routinely "lag" behind their leader, I've > talked to a lot of Solr users w/usecases where they would be happy to have > PULL replicas back online serving "stale" data ASAP, and let normal > IndexFetching "catchup" with the leader later. > I propose we support a new "advanced" replica property that can be set on > PULL replicas by expert level users, to indicate: on (re)init, these replicas > may skip RECOVERING and go directly to ACTIVE. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org