Re: [Pacemaker] Occasional nonsensical resource agent errors

Ken Gaillot Tue, 15 Jul 2014 12:42:33 -0700

On 07/15/2014 02:31 PM, Andrew Daugherity wrote:

Message: 1
Date: Sat, 12 Jul 2014 09:42:57 -0400
From: Ken Gaillot <kjgai...@gleim.com>
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] Occasional nonsensical resource agent errors
        since Debian 3.2.57-3+deb7u1 kernel update


Hi,

We run multiple deployments of corosync+pacemaker on Debian "wheezy" for
high-availability of various resources. The configurations are unchanged
and ran without any issues for many months. However, since we applied
the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting
resource agent errors on rare occasions, with error messages that are
clearly incorrect.


[....]

Given the odd error messages from the resource agent, I suspect it's a
memory corruption error of some sort. We've been unable to find anything
else useful in the logs, and we'll probably end up reverting to the
prior kernel version. But given the rarity of the issue, it would be a
long while before we could be confident that fixed it.

Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel
or later? Has anyone had any similar issues?


Just curious, I see you're running Xen; are you setting dom0_mem?  I had similar 
issues with SLES 11 SP2 and SP3 (but not <= SP1) that was apparently random 
memory corruption due to a kernel bug.  It was mostly random but I did eventually 
find a repeatable test case: checksum verification of a kernel build tree with 
mtree; on affected systems there would usually be a few files that failed to 
verify.

I had been setting dom0_mem=768M, as that was a good balance between maximizing 
memory available for VMs while keeping enough for services in Dom0 (including 
pacemaker/corosync), and I set node attributes for pacemaker utilization to 1GB 
less than physical RAM, leaving 256M available for Xen overhead, etc.  Raising 
it to 2048M (or not setting it at all) was a sufficient workaround to avoid the 
bug, but I have finally received a fixed kernel from Novell support.

Note: this fix has not yet made it into any official updates for SLES 11 -- 
Novell/SUSE say it will be in the next kernel version, whenever that happens.  
Recent openSUSE kernels are also affected (and have yet to be fixed).

-Andrew


Hi Andrew,

Thanks for the feedback!

Our "aries/taurus" cluster are Xen dom0s, and we pin dom0_mem so there'sat least 1GB RAM reported in the dom0 OS. (The version of Xen+Linuxkernel in wheezy has an issue where the reported RAM is less than thedom0_mem value, so dom0_mem is actually higher.)

However we are also seeing the issue on our "talos/pomona" cluster,which are not dom0s, so I don't suspect Xen itself. But it could be thesame kernel issue.

mtree isn't packaged for Debian, and I'm not familiar with it, althoughI did see a Linux port on Google code. How do you use it for your testcase? What do the detected differences signify?

Do you know what kernel and Xen versions were in SP2/3, and whatspecifically was fixed in the kernel they gave you?


-- Ken Gaillot <kjgai...@gleim.com>
Network Operations Center, Gleim Publications

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Occasional nonsensical resource agent errors

Reply via email to