Hi, On Tue, Oct 01, 2013 at 11:07:35AM +0200, Joschi Brauchle wrote: > Hello everyone, > > on two (recently upgraded) SLES11SP3 machines, we are running an > active/passive NFS fileserver and several other high availability > services using corosync + pacemaker (see version numbers below). > > We are having severe problems with resource monitors timing out > during our system backup at night, where the active machine is under > high IO load. These problems did not exist under SLES11SP1, from > which we just upgraded some days ago. > > > After some diagnosis, it turns out that actually all cluster > resources which are started by pacemaker are running with realtime > priority, which includes our backup service. This seems not to be > correct! > > > See this output of "ps --forest -Ao cls,rtprio,pri,comm --sort cls": > ------------ > RR 1 41 corosync > RR 1 41 \_ cib > RR 1 41 \_ stonithd > RR 1 41 \_ lrmd > RR 1 41 \_ attrd > RR 1 41 \_ pengine > RR 1 41 \_ crmd > RR 1 41 \_ mgmtd > RR 1 41 krb5kdc > RR 1 41 slapd > RR 1 41 cupsd > RR 1 41 rpc.svcgssd > RR 1 41 rpc.gssd > RR 1 41 rpc.idmapd > RR 1 41 rpc.mountd > RR 1 41 rpc.statd > RR 1 41 rpc.rquotad > RR 1 41 httpd2-prefork > RR 1 41 \_ httpd2-prefork > RR 1 41 \_ httpd2-prefork > RR 1 41 \_ httpd2-prefork > RR 1 41 \_ httpd2-prefork > RR 1 41 \_ httpd2-prefork > RR 1 41 \_ httpd2-prefork > RR 1 41 dsmcad > ------------ > Clearly, corosync itself **plus all cluster services** (like cups, > slapd, httpd2) are running with realtime priority (process class > being "RR").
Oops. Looks like neither corosync nor lrmd reset the priority and scheduler for their children. > As far as we remember from SLES11SP1, the resources were not running > in realtime priority there. Hence, this looks like a bug in the more > recent pacemaker/corosync version?!? Looks like it. Can you please open a support call. Thanks, Dejan > We suspect that the backup software "dsmcad" running in realtime > priority causes the monitors to time out, as the system is under > heavy IO load and may not respond in time for the monitors. > > > More details about our setup: > ------------ > # hb_report -V > cluster-glue: 1.0.11 (8347e8c9b94f111400dd844f11bc6ede98cc11a5) > # zypper -q if cluster-glue pacemaker corosync > Information for package cluster-glue: > > Repository: SLE11-HAE-SP3-Pool > Name: cluster-glue > Version: 1.0.11-0.15.28 > Arch: x86_64 > ... > Information for package pacemaker: > > Repository: SLE11-HAE-SP3-Pool > Name: pacemaker > Version: 1.1.9-0.19.102 > Arch: x86_64 > ... > Information for package corosync: > > Repository: SLE11-HAE-SP3-Pool > Name: corosync > Version: 1.4.5-0.18.15 > Arch: x86_64 > ------------ > > I can provide more required information on request. We would be glad > for any hits or suggestions on how to fix this problem. > > Best regards, > J Brauchle > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org