[
https://issues.apache.org/jira/browse/SOLR-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shalin Shekhar Mangar updated SOLR-6591:
----------------------------------------
Attachment: SOLR-6591.patch
I found a related (same?) bug while investigating #2 above. The overseer loop
can sometimes use stale cluster state for collections with stateFormat > 1.
This happens because ZkStateReader.removeZKWatch removes collection from the
'watchedCollections' set but doesn't remove the cached state in the
watchedCollectionStates map. So when the replica of a collection is unloaded,
the watch is also removed but the cached state still exists. If the overseer
happens to be on the same node which had hosted the replica then it will
continue reading the old state causing replica information or leader
information to be lost.
I've added a test which reproduces the problem (it hangs for a long time on
getLeaderRetry before failing to create the collection). The patch fixes the
problem by removing collection from watchedCollectionStates in
ZkStateReader.removeZKWatch.
> Cluster state updates can be lost on exception in main queue loop
> -----------------------------------------------------------------
>
> Key: SOLR-6591
> URL: https://issues.apache.org/jira/browse/SOLR-6591
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: Trunk
> Reporter: Shalin Shekhar Mangar
> Assignee: Shalin Shekhar Mangar
> Fix For: Trunk
>
> Attachments: SOLR-6591.patch
>
>
> I found this bug while going through the failure on jenkins:
> https://builds.apache.org/job/Lucene-Solr-NightlyTests-trunk/648/
> {code}
> 2 tests failed.
> REGRESSION:
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testDistribSearch
> Error Message:
> Error CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create
> core [halfcollection_shard1_replica1] Caused by: Could not get shard id for
> core: halfcollection_shard1_replica1
> Stack Trace:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error
> CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create core
> [halfcollection_shard1_replica1] Caused by: Could not get shard id for core:
> halfcollection_shard1_replica1
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:570)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:215)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
> at
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testErrorHandling(CollectionsAPIDistributedZkTest.java:583)
> at
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.doTest(CollectionsAPIDistributedZkTest.java:205)
> at
> org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:869)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1618)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]