[ 
https://issues.apache.org/jira/browse/SOLR-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519503#comment-15519503
 ] 

Alan Woodward commented on SOLR-9555:
-------------------------------------

The race looks something like this:
* A node goes down, and then restarts
* The leader tries to send a document to the starting node, and gets a 503 'not 
ready yet'
* The node publishes its state as RECOVERING
* The leader's LIR thread publishes the recovering node's state as DOWN
* The node sends a PREPRECOVERY request to the leader
* The leader waits for the node's state to be RECOVERING, but as it's just been 
set as DOWN by the LIR thread, everything hangs

I *think* the fix is for the LIR thread to only set the node's state as DOWN if 
it's current state is ACTIVE, using ZK versions to check that the leader's 
state is up-to-date, but I'd like to get comments from people who know this bit 
of the code better - [~shalinmangar] [~thelabdude] does this look right to you?

> Recovery can hang if a node is put into LIR as it is starting up
> ----------------------------------------------------------------
>
>                 Key: SOLR-9555
>                 URL: https://issues.apache.org/jira/browse/SOLR-9555
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Alan Woodward
>
> See 
> https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/17888/consoleFull 
> for an example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to