> Message: 6
> Date: Tue, 15 Jul 2014 15:36:40 -0400
> From: Ken Gaillot <kjgai...@gleim.com>
> To: The Pacemaker cluster resource manager
>       <pacemaker@oss.clusterlabs.org>
> Subject: Re: [Pacemaker] Occasional nonsensical resource agent errors
> Message-ID: <53c582c8.6090...@gleim.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Hi Andrew,
> 
> Thanks for the feedback!
> 
> Our "aries/taurus" cluster are Xen dom0s, and we pin dom0_mem so there's 
> at least 1GB RAM reported in the dom0 OS. (The version of Xen+Linux 
> kernel in wheezy has an issue where the reported RAM is less than the 
> dom0_mem value, so dom0_mem is actually higher.)
> 
> However we are also seeing the issue on our "talos/pomona" cluster, 
> which are not dom0s, so I don't suspect Xen itself. But it could be the 
> same kernel issue.
> 
> mtree isn't packaged for Debian, and I'm not familiar with it, although 
> I did see a Linux port on Google code. How do you use it for your test 
> case? What do the detected differences signify?
That mtree-port from Google code is what I used; fortunately for me it was 
packaged in the OBS already: http://software.opensuse.org/package/mtree
It looks like the only build-dep it has is openssl-devel, so not too hard to 
build.  I'm sure there's other utilities that accomplish the same thing (e.g. 
tripwire) but I was familiar with mtree from BSD-land, so it's what I used.

Backtracking a bit, when I saw these strange errors, running 'rpm -Va' (verify 
installed files from all packages; there's probably a dpkg equivalent but I 
don't know it off-hand) would sometimes, but not consistently, produce errors.  

I decided that perhaps I needed a bigger dataset, and I had been playing with 
zfsonlinux on another box, which had several kernel trees extracted for that, 
so I tarballed the build dirs (2.6GB, 171k files), checksummed them with mtree, 
then copied the tarball and checksum file to the boxen with problems and 
verified it there.  I actually had to boot into a known good kernel (in my 
case, kernel-default rather than kernel-xen) to get a clean untar.

Under the problematic kernels, a small number of files would fail to verify 
(which files failed tended to change, but I would almost always get some 
errors).  Occasionally the filesystem would also report I/O errors (much more 
likely to happen under btrfs than xfs or ext3), but after rebooting and running 
fsck/xfsrepair/btrfs scrub etc. the FS would check out clean.

Basic mtree usage--
  Generate checksum file:
    1) cd /path/to/testroot
    2) mtree -c -K sha256digest > /path/to/checksumfile  [outside testroot]
  Verify:
    1) cd /path/to/testroot
    2) mtree -f /path/to/checksumfile
Like diff, only differences (in file size/mode/checksum/etc.) are reported and 
no output means everything verifies.

> Do you know what kernel and Xen versions were in SP2/3, and what 
> specifically was fixed in the kernel they gave you?
SLES 11 SP2 and SP3 seem to be based on the same 3.0.x kernel tree (SP1, which 
was unaffected, was 2.6.32.x).  When SP2 was still supported (it has now 
dropped out of support) the versions tended to track closely but not exactly.  
Xen in SP3 is 4.2.4; SP2 was 4.1.x.  In a matter of fortuitous timing, the 
official kernel update for SLES 11 SP3 was released yesterday; the version with 
the fix is 3.0.101-0.35.1.  The relevant changelog is this:
====
* Thu Jun 05 2014 jbeul...@suse.com
- swiotlb: don't assume PA 0 is invalid (bnc#865882).
====
Unfortunately that bug is private, even to me, but the git tree is public:
http://kernel.opensuse.org/cgit/kernel-source/commit/?id=0a9fc1a8654e9f62d7a8173fef83c6949ed67e35
http://kernel.opensuse.org/cgit/kernel-source/commit/?h=SLE11-SP3&id=4461f4df6e363235e2ef3b61c41617f7c22dc510

The master aka opensuse-factory branch is on 3.16 (was 3.15 at time of this 
commit), while SLE11-SP3 remains on 3.0.x with backported fixes.  This may not 
be the bug you're hitting, but if you can find a reproducible test case, that's 
half the battle.

-Andrew


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to