[
https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537838#comment-14537838
]
Andrey Prokopenko commented on SOLR-7021:
-----------------------------------------
In my experience I've overcome this issue by first stop the cluster, then
starting nodes in the same shard which were not in recovery state at the time
when all the replicas in particular shard went down.
Seems we have a deadlock conditions here: node is trying to recover from the
leader, which in turn cannot recover itself.
> Leader will not publish core as active without recovering first, but never
> recovers
> -----------------------------------------------------------------------------------
>
> Key: SOLR-7021
> URL: https://issues.apache.org/jira/browse/SOLR-7021
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.10
> Reporter: James Hardwick
> Priority: Critical
> Labels: recovery, solrcloud, zookeeper
>
> A little background: 1 core solr-cloud cluster across 3 nodes, each with its
> own shard and each shard with a single replica hence each replica is itself a
> leader.
> For reasons we won't get into, we witnessed a shard go down in our cluster.
> We restarted the cluster but our core/shards still did not come back up.
> After inspecting the logs, we found this:
> {code}
> 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO cloud.ZkController
> - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is
> http://xxx.xxx.xxx.35:8081/solr/xyzcore/
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController
> - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController
> - I am the leader, no recovery necessary
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController
> - publishing core=xyzcore state=active collection=xyzcore
> 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO cloud.ZkController
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController
> - publishing core=xyzcore state=down collection=xyzcore
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer -
> :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore'
> as active without recovering first!
> at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
> {code}
> And at this point the necessary shards never recover correctly and hence our
> core never returns to a functional state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]