[ 
https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690673#comment-17690673
 ] 

Ishan Chattopadhyaya commented on SOLR-6405:
--------------------------------------------

Through my testing with solr-bench, I've seen many cases (say 1 in 25-30) where 
nodes come up, recovery of replicas happen for a few replicas and then that 
doesn't complete for all replicas (and the restarted node stays with some 
replicas in DOWN state). I tracked them down to Solr not re-connecting to 
ZooKeeper after a session loss.

I should add that this test is repeatable for me, but in order to reproduce 
this, I have to wait several hours of running (or even days). This situation 
was so annoying while developing the test suite (because of infinite hang/wait 
for all replicas to come up) that I bailed out on those with a timeout and 
failed the test and moved on. But definitely something on my radar to 
revisit/address/fix. FYI [~noblepaul].

> ZooKeeper calls can easily not be retried enough on ConnectionLoss.
> -------------------------------------------------------------------
>
>                 Key: SOLR-6405
>                 URL: https://issues.apache.org/jira/browse/SOLR-6405
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 4.10, 6.0
>
>         Attachments: SOLR-6405.patch
>
>
> The current design requires that we are sure we retry on connection loss 
> until session expiration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to