[
https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655057#comment-14655057
]
Shalin Shekhar Mangar commented on SOLR-7869:
---------------------------------------------
Thanks Scott!
bq. My impulse was to catch BadVersionException, refresh ClusterState from ZK,
then re-apply all the queued updates against the refreshed state.
That is the right fix. That is how I intended it to work but I obviously didn't
write enough tests.
bq. However, I'm afraid that approach violates all of ClusterUpdater's
assumptions.
Originally the overseer would force update the cluster state at the beginning
of the loop, apply the updates and write to ZK. This was wasteful because most
of the time, the Overseer is the only guy writing to ZK state. This is why I
introduced a local cluster state which is written to ZK using CAS removing the
need for refreshing the cluster state. If that CAS fails then that means that
someone has changed state externally or due to a bug multiple overseers have
started processing. At this point, we go back to the beginning of the loop,
check if we are still leader, force refresh the cluster state, process work
queue and the continue on to the main queue.
> Overseer does not handle BadVersionException correctly
> ------------------------------------------------------
>
> Key: SOLR-7869
> URL: https://issues.apache.org/jira/browse/SOLR-7869
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.2.1
> Reporter: Shalin Shekhar Mangar
> Assignee: Shalin Shekhar Mangar
> Labels: difficulty-medium, impact-low
> Fix For: 5.3, Trunk
>
> Attachments: SOLR-7869.patch
>
>
> If the /clusterstate.json is modified externally then the Overseer can go
> into an infinite loop upon a BadVersionException alternately trying to
> execute main queue and then the work queue:
> {code}
> ERROR - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer
> work queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /clusterstate.json
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168)
> at java.lang.Thread.run(Thread.java:745)
> INFO - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage:
> queueSize: 1, message = {
> "operation":"state",
> "state":"down",
> "base_url":"http://127.0.1.1:7574/solr",
> "core":"test_shard1_replica1",
> "roles":null,
> "node_name":"127.0.1.1:7574_solr",
> "shard":null,
> "collection":"test",
> "core_node_name":"core_node1"} current state version: 9
> INFO - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null
> message={
> "operation":"state",
> "state":"down",
> "base_url":"http://127.0.1.1:7574/solr",
> "core":"test_shard1_replica1",
> "roles":null,
> "node_name":"127.0.1.1:7574_solr",
> "shard":null,
> "collection":"test",
> "core_node_name":"core_node1"}
> INFO - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already
> registered
> ERROR - 2015-08-04 18:49:56.225; [ ]
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer
> main queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /clusterstate.json
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213)
> at java.lang.Thread.run(Thread.java:745)
> INFO - 2015-08-04 18:49:56.225; [ ]
> org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted
> to ver 8
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]