Hello everyone,on two (recently upgraded) SLES11SP3 machines, we are running an active/passive NFS fileserver and several other high availability services using corosync + pacemaker (see version numbers below).
We are having severe problems with resource monitors timing out during our system backup at night, where the active machine is under high IO load. These problems did not exist under SLES11SP1, from which we just upgraded some days ago.
After some diagnosis, it turns out that actually all cluster resources which are started by pacemaker are running with realtime priority, which includes our backup service. This seems not to be correct!
See this output of "ps --forest -Ao cls,rtprio,pri,comm --sort cls": ------------ RR 1 41 corosync RR 1 41 \_ cib RR 1 41 \_ stonithd RR 1 41 \_ lrmd RR 1 41 \_ attrd RR 1 41 \_ pengine RR 1 41 \_ crmd RR 1 41 \_ mgmtd RR 1 41 krb5kdc RR 1 41 slapd RR 1 41 cupsd RR 1 41 rpc.svcgssd RR 1 41 rpc.gssd RR 1 41 rpc.idmapd RR 1 41 rpc.mountd RR 1 41 rpc.statd RR 1 41 rpc.rquotad RR 1 41 httpd2-prefork RR 1 41 \_ httpd2-prefork RR 1 41 \_ httpd2-prefork RR 1 41 \_ httpd2-prefork RR 1 41 \_ httpd2-prefork RR 1 41 \_ httpd2-prefork RR 1 41 \_ httpd2-prefork RR 1 41 dsmcad ------------Clearly, corosync itself **plus all cluster services** (like cups, slapd, httpd2) are running with realtime priority (process class being "RR").
As far as we remember from SLES11SP1, the resources were not running in realtime priority there. Hence, this looks like a bug in the more recent pacemaker/corosync version?!?
We suspect that the backup software "dsmcad" running in realtime priority causes the monitors to time out, as the system is under heavy IO load and may not respond in time for the monitors.
More details about our setup: ------------ # hb_report -V cluster-glue: 1.0.11 (8347e8c9b94f111400dd844f11bc6ede98cc11a5) # zypper -q if cluster-glue pacemaker corosync Information for package cluster-glue: Repository: SLE11-HAE-SP3-Pool Name: cluster-glue Version: 1.0.11-0.15.28 Arch: x86_64 ... Information for package pacemaker: Repository: SLE11-HAE-SP3-Pool Name: pacemaker Version: 1.1.9-0.19.102 Arch: x86_64 ... Information for package corosync: Repository: SLE11-HAE-SP3-Pool Name: corosync Version: 1.4.5-0.18.15 Arch: x86_64 ------------I can provide more required information on request. We would be glad for any hits or suggestions on how to fix this problem.
Best regards, J Brauchle
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org