Hi David, Jan, For the time being corosync 2.3.3 looks stable with libqb 0.17.0 with both build from source. Thank you very much for the guidance!
Attila > -----Original Message----- > From: David Vossel [mailto:dvos...@redhat.com] > Sent: Thursday, March 13, 2014 9:22 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > > > ----- Original Message ----- > > From: "Jan Friesse" <jfrie...@redhat.com> > > To: "The Pacemaker cluster resource manager" > > <pacemaker@oss.clusterlabs.org> > > Sent: Thursday, March 13, 2014 4:03:28 AM > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > ... > > > > >>>> > > >>>> Also can you please try to set debug: on in corosync.conf and > > >>>> paste full corosync.log then? > > >>> > > >>> I set debug to on, and did a few restarts but could not reproduce > > >>> the issue > > >> yet - will post the logs as soon as I manage to reproduce. > > >>> > > >> > > >> Perfect. > > >> > > >> Another option you can try to set is netmtu (1200 is usually safe). > > > > > > Finally I was able to reproduce the issue. > > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately > > > (not when node was up again). > > > > > > The corosync log with debug on is available at: > > > http://pastebin.com/kTpDqqtm > > > > > > > > > To be honest, I had to wait much longer for this reproduction as > > > before, even though there was no change in the corosync > > > configuration - just potentially some system updates. But anyway, > > > the issue is unfortunately still there. > > > Previously, when this issue came, cpu was at 100% on all nodes - > > > this time only on ctmgr, which was the DC... > > > > > > I hope you can find some useful details in the log. > > > > > > > Attila, > > what seems to be interesting is > > > > Configuration ERRORs found during PE processing. Please run > > "crm_verify -L" to identify issues. > > > > I'm unsure how much is this problem but I'm really not pacemaker expert. > > > > Anyway, I have theory what may happening and it looks like related > > with IPC (and probably not related to network). But to make sure we > > will not try fixing already fixed bug, can you please build: > > - New libqb (0.17.0). There are plenty of fixes in IPC > > - Corosync 2.3.3 (already plenty IPC fixes) > > yes, there was a libqb/corosync interoperation problem that showed these > same symptoms last year. Updating to the latest corosync and libqb will likely > resolve this. > > > - And maybe also newer pacemaker > > > > I know you were not very happy using hand-compiled sources, but please > > give them at least a try. > > > > Thanks, > > Honza > > > > > Thanks, > > > Attila > > > > > > > > > > > >> > > >> Regards, > > >> Honza > > >> > > >>> > > >>> There are also a few things that might or might not be related: > > >>> > > >>> 1) Whenever I want to edit the configuration with "crm configure > > >>> edit", > > > > ... > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org