[ 
https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654472#comment-14654472
 ] 

Scott Blum commented on SOLR-7869:
----------------------------------

But what's the right fix?  Having looked through the code a bit now, 
OverSeer.ClusterUpdater has a *very* baked-in assumption that no one else is 
updating cluster state.  Copies of ClusterState float around and get updated 
over and over during processing, with the assumption that the local node is 
performing an atomic sequence of operations to get to a desired end state.  How 
can external changes be merged in?  My impulse was to catch 
BadVersionException, refresh ClusterState from ZK, then re-apply all the queued 
updates against the refreshed state.  However, I'm afraid that approach 
violates all of ClusterUpdater's assumptions.  I think the only thing to do is 
just clobber whatever is in ZK with what Overseer wants to write, even though 
that seems less than ideal.

> Overseer does not handle BadVersionException correctly
> ------------------------------------------------------
>
>                 Key: SOLR-7869
>                 URL: https://issues.apache.org/jira/browse/SOLR-7869
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2.1
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-low
>             Fix For: 5.3, Trunk
>
>
> If the /clusterstate.json is modified externally then the Overseer can go 
> into an infinite loop upon a BadVersionException alternately trying to 
> execute main queue and then the work queue:
> {code}
> ERROR - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer 
> work queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /clusterstate.json
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: 
> queueSize: 1, message = {
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr";,
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"} current state version: 9
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null 
> message={
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr";,
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"}
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already 
> registered
> ERROR - 2015-08-04 18:49:56.225; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer 
> main queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /clusterstate.json
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.225; [   ] 
> org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted 
> to ver 8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to