[Pacemaker] Bug? Resources running with realtime priority - possibly causing monitor timeouts

Joschi Brauchle Tue, 01 Oct 2013 02:16:23 -0700

Hello everyone,

on two (recently upgraded) SLES11SP3 machines, we are running an active/passive NFS fileserver and several other high availability services using corosync + pacemaker (see version numbers below).

We are having severe problems with resource monitors timing out during our system backup at night, where the active machine is under high IO load. These problems did not exist under SLES11SP1, from which we just upgraded some days ago.

After some diagnosis, it turns out that actually all cluster resources which are started by pacemaker are running with realtime priority, which includes our backup service. This seems not to be correct!



See this output of "ps --forest -Ao cls,rtprio,pri,comm --sort cls":
------------
 RR      1  41 corosync
 RR      1  41  \_ cib
 RR      1  41  \_ stonithd
 RR      1  41  \_ lrmd
 RR      1  41  \_ attrd
 RR      1  41  \_ pengine
 RR      1  41  \_ crmd
 RR      1  41  \_ mgmtd
 RR      1  41 krb5kdc
 RR      1  41 slapd
 RR      1  41 cupsd
 RR      1  41 rpc.svcgssd
 RR      1  41 rpc.gssd
 RR      1  41 rpc.idmapd
 RR      1  41 rpc.mountd
 RR      1  41 rpc.statd
 RR      1  41 rpc.rquotad
 RR      1  41 httpd2-prefork
 RR      1  41  \_ httpd2-prefork
 RR      1  41  \_ httpd2-prefork
 RR      1  41  \_ httpd2-prefork
 RR      1  41  \_ httpd2-prefork
 RR      1  41  \_ httpd2-prefork
 RR      1  41  \_ httpd2-prefork
 RR      1  41 dsmcad
------------

Clearly, corosync itself **plus all cluster services** (like cups, slapd, httpd2) are running with realtime priority (process class being "RR").

As far as we remember from SLES11SP1, the resources were not running in realtime priority there. Hence, this looks like a bug in the more recent pacemaker/corosync version?!?

We suspect that the backup software "dsmcad" running in realtime priority causes the monitors to time out, as the system is under heavy IO load and may not respond in time for the monitors.



More details about our setup:
------------
# hb_report -V
cluster-glue: 1.0.11 (8347e8c9b94f111400dd844f11bc6ede98cc11a5)
# zypper -q if cluster-glue pacemaker corosync
Information for package cluster-glue:

Repository: SLE11-HAE-SP3-Pool
Name: cluster-glue
Version: 1.0.11-0.15.28
Arch: x86_64
...
Information for package pacemaker:

Repository: SLE11-HAE-SP3-Pool
Name: pacemaker
Version: 1.1.9-0.19.102
Arch: x86_64
...
Information for package corosync:

Repository: SLE11-HAE-SP3-Pool
Name: corosync
Version: 1.4.5-0.18.15
Arch: x86_64
------------

I can provide more required information on request. We would be glad for any hits or suggestions on how to fix this problem.


Best regards,
J Brauchle

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] Bug? Resources running with realtime priority - possibly causing monitor timeouts

Reply via email to