On 07/15/2014 02:31 PM, Andrew Daugherity wrote:
Message: 1
Date: Sat, 12 Jul 2014 09:42:57 -0400
From: Ken Gaillot <kjgai...@gleim.com>
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] Occasional nonsensical resource agent errors
since Debian 3.2.57-3+deb7u1 kernel update
Hi,
We run multiple deployments of corosync+pacemaker on Debian "wheezy" for
high-availability of various resources. The configurations are unchanged
and ran without any issues for many months. However, since we applied
the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting
resource agent errors on rare occasions, with error messages that are
clearly incorrect.
[....]
Given the odd error messages from the resource agent, I suspect it's a
memory corruption error of some sort. We've been unable to find anything
else useful in the logs, and we'll probably end up reverting to the
prior kernel version. But given the rarity of the issue, it would be a
long while before we could be confident that fixed it.
Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel
or later? Has anyone had any similar issues?
Just curious, I see you're running Xen; are you setting dom0_mem? I had similar
issues with SLES 11 SP2 and SP3 (but not <= SP1) that was apparently random
memory corruption due to a kernel bug. It was mostly random but I did eventually
find a repeatable test case: checksum verification of a kernel build tree with
mtree; on affected systems there would usually be a few files that failed to
verify.
I had been setting dom0_mem=768M, as that was a good balance between maximizing
memory available for VMs while keeping enough for services in Dom0 (including
pacemaker/corosync), and I set node attributes for pacemaker utilization to 1GB
less than physical RAM, leaving 256M available for Xen overhead, etc. Raising
it to 2048M (or not setting it at all) was a sufficient workaround to avoid the
bug, but I have finally received a fixed kernel from Novell support.
Note: this fix has not yet made it into any official updates for SLES 11 --
Novell/SUSE say it will be in the next kernel version, whenever that happens.
Recent openSUSE kernels are also affected (and have yet to be fixed).
-Andrew
Hi Andrew,
Thanks for the feedback!
Our "aries/taurus" cluster are Xen dom0s, and we pin dom0_mem so there's
at least 1GB RAM reported in the dom0 OS. (The version of Xen+Linux
kernel in wheezy has an issue where the reported RAM is less than the
dom0_mem value, so dom0_mem is actually higher.)
However we are also seeing the issue on our "talos/pomona" cluster,
which are not dom0s, so I don't suspect Xen itself. But it could be the
same kernel issue.
mtree isn't packaged for Debian, and I'm not familiar with it, although
I did see a Linux port on Google code. How do you use it for your test
case? What do the detected differences signify?
Do you know what kernel and Xen versions were in SP2/3, and what
specifically was fixed in the kernel they gave you?
-- Ken Gaillot <kjgai...@gleim.com>
Network Operations Center, Gleim Publications
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org