[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

Andrey Prokopenko (JIRA) Mon, 11 May 2015 04:39:07 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537838#comment-14537838
 ]


Andrey Prokopenko commented on SOLR-7021:
-----------------------------------------

In my experience I've overcome this issue by first stop the cluster, then 
starting nodes in the same shard which were not in recovery state at the time 
when all the replicas in particular shard went down.
Seems we have a deadlock conditions here: node is trying to recover from the 
leader, which in turn cannot recover itself.


> Leader will not publish core as active without recovering first, but never 
> recovers
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-7021
>                 URL: https://issues.apache.org/jira/browse/SOLR-7021
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>            Priority: Critical
>              Labels: recovery, solrcloud, zookeeper
>
> A little background: 1 core solr-cloud cluster across 3 nodes, each with its 
> own shard and each shard with a single replica hence each replica is itself a 
> leader. 
> For reasons we won't get into, we witnessed a shard go down in our cluster. 
> We restarted the cluster but our core/shards still did not come back up. 
> After inspecting the logs, we found this:
> {code}
> 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is 
> http://xxx.xxx.xxx.35:8081/solr/xyzcore/
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - I am the leader, no recovery necessary
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - publishing core=xyzcore state=active collection=xyzcore
> 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - publishing core=xyzcore state=down collection=xyzcore
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - 
> :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' 
> as active without recovering first!
>       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
> {code}
> And at this point the necessary shards never recover correctly and hence our 
> core never returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

Reply via email to