[
https://issues.apache.org/jira/browse/SOLR-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308716#comment-14308716
]
Gopal Patwa commented on SOLR-5961:
-----------------------------------
we also had similar problem today as Ugo mention in our Production system, this
was cause after machine reboot for zookeeper (5 node) and 8 node solr cloud
(single shard) to install some unix security patch.
JDK 7, Solr 4.10.3, CentOS
But after reboot, we saw huge amount of message were in overseer/queue
./zkCli.sh -server localhost:2181 ls /search/catalog/overseer/queue | sed
's/,/\n/g' | wc -l
178587
We have very small cluster (8 nodes), how come overseer/queue should have 17k+
messages, due to this leader node took almost few hours to come from recovery.
Logs from zookeeper:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /overseer/queue
> Solr gets crazy on /overseer/queue state change
> -----------------------------------------------
>
> Key: SOLR-5961
> URL: https://issues.apache.org/jira/browse/SOLR-5961
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.7.1
> Environment: CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes
> (separate machines)
> Reporter: Maxim Novikov
> Assignee: Shalin Shekhar Mangar
> Priority: Critical
>
> No idea how to reproduce it, but sometimes Solr stars littering the log with
> the following messages:
> 419158 [localhost-startStop-1-EventThread] INFO
> org.apache.solr.cloud.DistributedQueue ? LatchChildWatcher fired on path:
> /overseer/queue state: SyncConnected type NodeChildrenChanged
> 419190 [Thread-3] INFO org.apache.solr.cloud.Overseer ? Update state
> numShards=1 message={
> "operation":"state",
> "state":"recovering",
> "base_url":"http://${IP_ADDRESS}/solr",
> "core":"${CORE_NAME}",
> "roles":null,
> "node_name":"${NODE_NAME}_solr",
> "shard":"shard1",
> "collection":"${COLLECTION_NAME}",
> "numShards":"1",
> "core_node_name":"core_node2"}
> It continues spamming these messages with no delay and the restarting of all
> the nodes does not help. I have even tried to stop all the nodes in the
> cluster first, but then when I start one, the behavior doesn't change, it
> gets crazy nuts with this " /overseer/queue state" again.
> PS The only way to handle this was to stop everything, manually clean up all
> the data in ZooKeeper related to Solr, and then rebuild everything from
> scratch. As you should understand, it is kinda unbearable in the production
> environment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]