[
https://issues.apache.org/jira/browse/SOLR-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160156#comment-14160156
]
Shalin Shekhar Mangar edited comment on SOLR-6591 at 10/6/14 11:37 AM:
-----------------------------------------------------------------------
What happens here is:
# The Stress Collection Creation thread in that test is trying to create
collections (which have stateFormat=2)
# The overseer gets a "state" message from a new core created using core admin
API. This should implicitly create a new collection:
{code}
[junit4] 2> 561673 T45931 oasc.Overseer$ClusterStateUpdater.updateState
Update state numShards=1 message={
[junit4] 2> "collection":"halfcollectionblocker",
[junit4] 2> "base_url":"http://127.0.0.1:42021",
[junit4] 2> "state":"down",
[junit4] 2> "numShards":"1",
[junit4] 2> "node_name":"127.0.0.1:42021_",
[junit4] 2> "roles":null,
[junit4] 2> "shard":null,
[junit4] 2> "operation":"state",
[junit4] 2> "core":"halfcollection_shard1_replica1"}
[junit4] 2> 561674 T45931
oasc.Overseer$ClusterStateUpdater.createCollection Create collection
halfcollectionblocker with shards [shard1]
[junit4] 2> 561674 T45931
oasc.Overseer$ClusterStateUpdater.createCollection state version
halfcollectionblocker 1
[junit4] 2> 561679 T45931 oasc.Overseer$ClusterStateUpdater.updateState
Assigning new node to shard shard=shard1
{code}
# Right after the above message, the overseer gets a message to create
'awholynewstresscollection_collection4_1' (I'm assuming through a "state"
message). This fails with the following message:
{code}
[junit4] 2> 561682 T45931 oasc.Overseer$ClusterStateUpdater.run ERROR
Exception in Overseer main queue loop
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
for /collections/awholynewstresscollection_collection4_1/state.json
[junit4] 2> at
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
[junit4] 2> at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
[junit4] 2> at
org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
[junit4] 2> at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:382)
[junit4] 2> at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:379)
[junit4] 2> at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
[junit4] 2> at
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:379)
[junit4] 2> at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.updateZkStates(Overseer.java:358)
[junit4] 2> at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:311)
[junit4] 2> at java.lang.Thread.run(Thread.java:745)
[junit4] 2>
{code}
# This exception causes the "state" messaged executed for
'halfcollectionblocker' collection to be lost. The message is still present in
the work queue but because the overseer is healthy, it will continue to execute
the main queue.
{code}
[junit4] 2> 881993 T46259 oasc.ZkController.waitForShardId waiting to find
shard id in clusterstate for halfcollection_shard1_replica1
[junit4] 2> 1202711 T46259 oasc.CoreContainer.create ERROR Error creating
core [halfcollection_shard1_replica1]: Could not get shard id for core:
halfcollection_shard1_replica1 org.apache.solr.common.SolrException: Could not
get shard id for core: halfcollection_shard1_replica1
[junit4] 2> at
org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1425)
[junit4] 2> at
org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1371)
[junit4] 2> at
org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1513)
[junit4] 2> at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:504)
[junit4] 2> at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:484)
[junit4] 2> at
org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
{code}
was (Author: shalinmangar):
What happens here is:
# The Stress Collection Creation thread in that test is going on trying to
create collections (which have stateFormat=2)
# The overseer gets a "state" message from a new core created using core admin
API. This should implicitly create a new collection:
{code}
[junit4] 2> 561673 T45931 oasc.Overseer$ClusterStateUpdater.updateState
Update state numShards=1 message={
[junit4] 2> "collection":"halfcollectionblocker",
[junit4] 2> "base_url":"http://127.0.0.1:42021",
[junit4] 2> "state":"down",
[junit4] 2> "numShards":"1",
[junit4] 2> "node_name":"127.0.0.1:42021_",
[junit4] 2> "roles":null,
[junit4] 2> "shard":null,
[junit4] 2> "operation":"state",
[junit4] 2> "core":"halfcollection_shard1_replica1"}
[junit4] 2> 561674 T45931
oasc.Overseer$ClusterStateUpdater.createCollection Create collection
halfcollectionblocker with shards [shard1]
[junit4] 2> 561674 T45931
oasc.Overseer$ClusterStateUpdater.createCollection state version
halfcollectionblocker 1
[junit4] 2> 561679 T45931 oasc.Overseer$ClusterStateUpdater.updateState
Assigning new node to shard shard=shard1
{code}
# Right after the above message, the overseer gets a message to create
'awholynewstresscollection_collection4_1' (I'm assuming through a "state"
message). This fails with the following message:
{code}
[junit4] 2> 561682 T45931 oasc.Overseer$ClusterStateUpdater.run ERROR
Exception in Overseer main queue loop
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
for /collections/awholynewstresscollection_collection4_1/state.json
[junit4] 2> at
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
[junit4] 2> at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
[junit4] 2> at
org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
[junit4] 2> at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:382)
[junit4] 2> at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:379)
[junit4] 2> at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
[junit4] 2> at
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:379)
[junit4] 2> at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.updateZkStates(Overseer.java:358)
[junit4] 2> at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:311)
[junit4] 2> at java.lang.Thread.run(Thread.java:745)
[junit4] 2>
{code}
# This exception causes the "state" messaged executed for
'halfcollectionblocker' collection to be lost. The message is still present in
the work queue but because the overseer is healthy and it will continue to
execute the main queue.
{code}
[junit4] 2> 881993 T46259 oasc.ZkController.waitForShardId waiting to find
shard id in clusterstate for halfcollection_shard1_replica1
[junit4] 2> 1202711 T46259 oasc.CoreContainer.create ERROR Error creating
core [halfcollection_shard1_replica1]: Could not get shard id for core:
halfcollection_shard1_replica1 org.apache.solr.common.SolrException: Could not
get shard id for core: halfcollection_shard1_replica1
[junit4] 2> at
org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1425)
[junit4] 2> at
org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1371)
[junit4] 2> at
org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1513)
[junit4] 2> at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:504)
[junit4] 2> at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:484)
[junit4] 2> at
org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
{code}
> Cluster state updates can be lost on exception in main queue loop
> -----------------------------------------------------------------
>
> Key: SOLR-6591
> URL: https://issues.apache.org/jira/browse/SOLR-6591
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: Trunk
> Reporter: Shalin Shekhar Mangar
> Fix For: Trunk
>
>
> I found this bug while going through the failure on jenkins:
> https://builds.apache.org/job/Lucene-Solr-NightlyTests-trunk/648/
> {code}
> 2 tests failed.
> REGRESSION:
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testDistribSearch
> Error Message:
> Error CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create
> core [halfcollection_shard1_replica1] Caused by: Could not get shard id for
> core: halfcollection_shard1_replica1
> Stack Trace:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error
> CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create core
> [halfcollection_shard1_replica1] Caused by: Could not get shard id for core:
> halfcollection_shard1_replica1
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:570)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:215)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
> at
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testErrorHandling(CollectionsAPIDistributedZkTest.java:583)
> at
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.doTest(CollectionsAPIDistributedZkTest.java:205)
> at
> org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:869)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1618)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]