On 7 Mar 2014, at 5:54 pm, Attila Megyeri <amegy...@minerva-soft.com> wrote:
> Thanks for the quick response! > >> -----Original Message----- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Friday, March 07, 2014 3:48 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 7 Mar 2014, at 5:31 am, Attila Megyeri <amegy...@minerva-soft.com> >> wrote: >> >>> Hello, >>> >>> We have a strange issue with Corosync/Pacemaker. >>> From time to time, something unexpected happens and suddenly the >> crm_mon output remains static. >>> When I check the cpu usage, I see that one of the cores uses 100% cpu, but >> cannot actually match it to either the corosync or one of the pacemaker >> processes. >>> >>> In such a case, this high CPU usage is happening on all 7 nodes. >>> I have to manually go to each node, stop pacemaker, restart corosync, then >> start pacemeker. Stoping pacemaker and corosync does not work in most of >> the cases, usually a kill -9 is needed. >>> >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>> >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. >>> >>> Logs are usually flooded with CPG related messages, such as: >>> >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> >>> OR >>> >>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=10933): Try again ( >>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=10933): Try again ( >>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=10933): Try again ( >> >> That is usually a symptom of corosync getting into a horribly confused state. >> Version? Distro? Have you checked for an update? >> Odd that the user of all that CPU isn't showing up though. >> >>> > > As I wrote I use Ubuntu trusty, the exact package versions are: > > corosync 2.3.0-1ubuntu5 > pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. > > There are no updates available. The only option is to install from sources, > but that would be very difficult to maintain and I'm not sure I would get rid > of this issue. > > What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 > > >>> >>> HTOP show something like this (sorted by TIME+ descending): >>> >>> >>> >>> 1 [||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 59, 4 >> thr; 2 running >>> 2 [| 0.7%] Load average: 1.00 >>> 0.99 1.02 >>> Mem[|||||||||||||||||||||||||||||||| 165/994MB] Uptime: 1 >> day, 10:22:03 >>> Swp[ 0/509MB] >>> >>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command >>> 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 >>> /usr/sbin/corosync >>> 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 >>> /usr/sbin/snmpd - >> Lsd -Lf /dev/null -u snmp -g snm >>> 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 >> /usr/lib/pacemaker/cib >>> 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 >> /usr/lib/pacemaker/stonithd >>> 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 >>> /usr/sbin/watchdog >>> 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 >> /usr/lib/pacemaker/crmd >>> 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 >> /usr/lib/pacemaker/lrmd >>> 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 >> /usr/lib/pacemaker/attrd >>> 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd >>> 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read >>> process >>> 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 >> /usr/lib/pacemaker/pengine >>> 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: >>> write process >>> 1835 ntp 20 0 27216 1980 1408 S 0.0 0.2 0:11.80 >>> /usr/sbin/ntpd -p >> /var/run/ntpd.pid -g -u 105:112 >>> 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 >>> /usr/sbin/irqbalance >>> 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 >>> /usr/bin/monit -c >> /etc/monit/monitrc >>> 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 >> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>> 3079 root 0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop >>> -a -w >> /var/log/atop/atop_20140306 6 >>> 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd >>> 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03 >> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>> 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init >>> 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd >>> 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd >>> 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38 >> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>> 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38 >> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>> 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37 >> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>> 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop >>> 4367 kamailio 20 0 291M 10000 4864 S 0.0 1.0 0:00.36 >> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>> >>> >>> My questions: >>> - Is this a cororync or pacameker issue? >>> - What are the CPG messages? Is it possible that we have a firewall issue? >>> >>> >>> Any hints would be great! >>> >>> Thanks, >>> Attila >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org