[ 
https://issues.apache.org/jira/browse/SOLR-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-6591:
----------------------------------------
    Attachment: SOLR-6591.patch

I found a related (same?) bug while investigating #2 above. The overseer loop 
can sometimes use stale cluster state for collections with stateFormat > 1. 
This happens because ZkStateReader.removeZKWatch removes collection from the 
'watchedCollections' set but doesn't remove the cached state in the 
watchedCollectionStates map. So when the replica of a collection is unloaded, 
the watch is also removed but the cached state still exists. If the overseer 
happens to be on the same node which had hosted the replica then it will 
continue reading the old state causing replica information or leader 
information to be lost.

I've added a test which reproduces the problem (it hangs for a long time on 
getLeaderRetry before failing to create the collection). The patch fixes the 
problem by removing collection from watchedCollectionStates in 
ZkStateReader.removeZKWatch.

> Cluster state updates can be lost on exception in main queue loop
> -----------------------------------------------------------------
>
>                 Key: SOLR-6591
>                 URL: https://issues.apache.org/jira/browse/SOLR-6591
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: Trunk
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: Trunk
>
>         Attachments: SOLR-6591.patch
>
>
> I found this bug while going through the failure on jenkins:
> https://builds.apache.org/job/Lucene-Solr-NightlyTests-trunk/648/
> {code}
> 2 tests failed.
> REGRESSION:  
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testDistribSearch
> Error Message:
> Error CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create 
> core [halfcollection_shard1_replica1] Caused by: Could not get shard id for 
> core: halfcollection_shard1_replica1
> Stack Trace:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error 
> CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create core 
> [halfcollection_shard1_replica1] Caused by: Could not get shard id for core: 
> halfcollection_shard1_replica1
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:570)
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:215)
>         at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
>         at 
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testErrorHandling(CollectionsAPIDistributedZkTest.java:583)
>         at 
> org.apache.solr.cloud.CollectionsAPIDistributedZkTest.doTest(CollectionsAPIDistributedZkTest.java:205)
>         at 
> org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:869)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1618)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to