[ 
https://issues.apache.org/jira/browse/SOLR-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146217#comment-14146217
 ] 

Ugo Matrangolo edited comment on SOLR-5961 at 9/24/14 11:23 AM:
----------------------------------------------------------------

Happened again :/

After a routine maintenance of our network causing a 30 secs connectivity 
hiccup the SOLR cluster started to spam overseer/queue with more than 47k+ 
events.

{code}
[zk: zookeeper4:2181(CONNECTED) 26] get /gilt/config/solr/overseer/queue
null
cZxid = 0x290008df29
ctime = Fri Aug 29 02:06:47 GMT+00:00 2014
mZxid = 0x290008df29
mtime = Fri Aug 29 02:06:47 GMT+00:00 2014
pZxid = 0x290023cedd
cversion = 60632
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 47822
[zk: zookeeper4:2181(CONNECTED) 27]
{code}

This time we tried to wait for it to heal itself and we watched the numChildren 
count go down but then up again: no way it was going to fix alone.

As usual we had to shutdown all the cluster, rmr /overseer/queue and restart.

Annoying :/


was (Author: ugo.matrangolo):
Happened again :/

After a routine maintenance of our network causing a 30 secs connectivity 
hiccup the SOLR cluster started to spam overseer/queue with more than 47k+ 
events.

{code}
[zk: zookeeper4:2181(CONNECTED) 26] get /gilt/config/solr/overseer/queue
null
cZxid = 0x290008df29
ctime = Fri Aug 29 02:06:47 GMT+00:00 2014
mZxid = 0x290008df29
mtime = Fri Aug 29 02:06:47 GMT+00:00 2014
pZxid = 0x290023cedd
cversion = 60632
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 47822
[zk: zookeeper4:2181(CONNECTED) 27]
{code}

This time we tried to wait for it to heal itself and we watched the numChildren 
count go down but then up again: no way it was going to fix alone.

As usual we had to shutdown all the cluster, rmr /overseer/queue and restart.

Annoying :/

> Solr gets crazy on /overseer/queue state change
> -----------------------------------------------
>
>                 Key: SOLR-5961
>                 URL: https://issues.apache.org/jira/browse/SOLR-5961
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.1
>         Environment: CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes 
> (separate machines)
>            Reporter: Maxim Novikov
>            Priority: Critical
>
> No idea how to reproduce it, but sometimes Solr stars littering the log with 
> the following messages:
> 419158 [localhost-startStop-1-EventThread] INFO  
> org.apache.solr.cloud.DistributedQueue  ? LatchChildWatcher fired on path: 
> /overseer/queue state: SyncConnected type NodeChildrenChanged
> 419190 [Thread-3] INFO  org.apache.solr.cloud.Overseer  ? Update state 
> numShards=1 message={
>   "operation":"state",
>   "state":"recovering",
>   "base_url":"http://${IP_ADDRESS}/solr";,
>   "core":"${CORE_NAME}",
>   "roles":null,
>   "node_name":"${NODE_NAME}_solr",
>   "shard":"shard1",
>   "collection":"${COLLECTION_NAME}",
>   "numShards":"1",
>   "core_node_name":"core_node2"}
> It continues spamming these messages with no delay and the restarting of all 
> the nodes does not help. I have even tried to stop all the nodes in the 
> cluster first, but then when I start one, the behavior doesn't change, it 
> gets crazy nuts with this " /overseer/queue state" again.
> PS The only way to handle this was to stop everything, manually clean up all 
> the data in ZooKeeper related to Solr, and then rebuild everything from 
> scratch. As you should understand, it is kinda unbearable in the production 
> environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to