On Oct 1, 2013, at 2:41 PM, pacemaker-requ...@oss.clusterlabs.org wrote:
> Message: 4
> Date: Tue, 1 Oct 2013 19:22:12 +0200
> From: Dejan Muhamedagic <deja...@fastmail.fm>
> To: pacemaker@oss.clusterlabs.org
> Subject: Re: [Pacemaker] Bug? Resources running with realtime priority
>       - possibly causing monitor timeouts
> Message-ID: <20131001172212.GC6892@walrus.homenet>
> Content-Type: text/plain; charset=us-ascii
> 
> Hi,
> 
> On Tue, Oct 01, 2013 at 11:07:35AM +0200, Joschi Brauchle wrote:
>> Hello everyone,
>> 
>> on two (recently upgraded) SLES11SP3 machines, we are running an
>> active/passive NFS fileserver and several other high availability
>> services using corosync + pacemaker (see version numbers below).
>> 
>> We are having severe problems with resource monitors timing out
>> during our system backup at night, where the active machine is under
>> high IO load. These problems did not exist under SLES11SP1, from
>> which we just upgraded some days ago.
>> 
>> After some diagnosis, it turns out that actually all cluster
>> resources which are started by pacemaker are running with realtime
>> priority, which includes our backup service. This seems not to be
>> correct!
>> 
> Oops. Looks like neither corosync nor lrmd reset the priority and
> scheduler for their children.
> 
>> As far as we remember from SLES11SP1, the resources were not running
>> in realtime priority there. Hence, this looks like a bug in the more
>> recent pacemaker/corosync version?!?
> 
> Looks like it. Can you please open a support call.
Dejan,

Any idea if SP2 is also affected?

Fortunately, it shouldn't affect me, since I'm just managing VMs (and mounting 
filesystems) with pacemaker, and not spawning a bunch of long-running processes.


Joschi,

As a workaround (and potential best practice anyway), try setting 
elevator=deadline in the kernel boot parameters.  This will give better 
response under heavy I/O load.  I'm not sure how effective it will be with 
everything running realtime priority, but assuming you're I/O-bound rather than 
CPU-bound, it should help, and is something I now set on all cluster members.

Before setting this, during periods of high I/O on the SAN (such as migrating 
several VMs at once during 'rcopenais stop' on one node), occasionally monitor 
operations would time out and pacemaker would stop and start unrelated VMs 
needlessly, thinking they had failed.  Afterwards, no more problems.


Andrew Daugherity
Systems Analyst
Division of Research, Texas A&M University
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to