Re: [Pacemaker] Pacemaker mount failures

Dejan Muhamedagic Mon, 02 Jun 2014 04:01:33 -0700

Hi,

On Fri, May 30, 2014 at 12:17:00PM +0100, Stuart Taylor wrote:
> Hi
> 
> I wonder if anyone on the list can help me - I’m new to Pacemaker so 
> apologies if I’m posting in the wrong place.
> 
> I have a four-node cluster running Pacemaker 1.1.10 with Corosync 1.4.1 on 
> CentOS 6.4.  Resource-wise I have eight Lustre storage targets on an iSCSI 
> SAN - two each colocated with a single heartbeat IP address on each node.  I 
> have redundant Corosync rings and Stonith is configured, and failover in 
> general works very well.  
> 
> My problem is that three of the storage targets refuse to mount via Pacemaker 
> on particular nodes, for no particular reason I can identify.  These 
> resources won’t start on the nodes they’re configured to in the constraints - 
> which is fine if all nodes are up, but not if certain nodes fail.
> 
> If I stop the resources I can manually mount the targets on the node without 
> any problem - so it seems to be a Pacemaker, rather than filesystem problem.
> 
> My resources look like this: http://pastebin.com/qQ1BR1yW and constraints 
> like this: http://pastebin.com/4w85MWUV
> 
> crm_mon -f gives the following output:
> 
> Last updated: Fri May 30 12:02:59 2014
> Last change: Fri May 30 12:02:38 2014 via crm_resource on oss-02
> Stack: classic openais (with plugin)
> Current DC: oss-02 - partition with quorum
> Version: 1.1.10-14.el6_5.3-368c726
> 4 Nodes configured, 4 expected votes
> 16 Resources configured
> 
> 
> Online: [ oss-01 oss-02 oss-03 oss-04 ]
> 
> ost-01  (ocf::heartbeat:Filesystem):    Started oss-01
> ost-02  (ocf::heartbeat:Filesystem):    Started oss-02
> stonith-oss-01  (stonith:fence_ipmilan):        Started oss-03
> stonith-oss-02  (stonith:fence_ipmilan):        Started oss-04
> ost-03  (ocf::heartbeat:Filesystem):    Started oss-04
> stonith-oss-03  (stonith:fence_ipmilan):        Started oss-01
> ost-05  (ocf::heartbeat:Filesystem):    Started oss-01
> ost-06  (ocf::heartbeat:Filesystem):    Started oss-02
> ost-07  (ocf::heartbeat:Filesystem):    Started oss-04
> ost-04  (ocf::heartbeat:Filesystem):    Started oss-03
> ost-08  (ocf::heartbeat:Filesystem):    Started oss-03
> oss-01-hb     (ocf::heartbeat:IPaddr2):       Started oss-01
> oss-02-hb     (ocf::heartbeat:IPaddr2):       Started oss-02
> oss-03-hb     (ocf::heartbeat:IPaddr2):       Started oss-04
> oss-04-hb     (ocf::heartbeat:IPaddr2):       Started oss-03
> stonith-oss-04  (stonith:fence_ipmilan):        Started oss-02
> 
> Migration summary:
> * Node oss-01: 
> * Node oss-02: 
> * Node oss-04: 
>    ost-04: migration-threshold=1000000 fail-count=1000000 last-failure='Fri 
> May 30 11:25:11 2014'
>    ost-08: migration-threshold=1000000 fail-count=1000000 last-failure='Fri 
> May 30 11:25:11 2014'
> * Node oss-03: 
>    ost-03: migration-threshold=1000000 fail-count=1000000 last-failure='Fri 
> May 30 10:47:02 2014'
> 
> ost-03 is supposed to mount on oss-03, and ost-04 & ost-08 on oss-04, but 
> they fail to do so and the colo-ed IP resources are therefore swapped between 
> oss-03 and oss-04.
> 
> Log entries typically look like this, which doesn’t give me much to go on:
> 
> May 30 11:25:11 oss-04 lrmd[2179]:   notice: operation_finished: 
> ost-08_start_0:2994:stderr [ mount.lustre: mount /dev/sdi at /lustre/ost-08 
> failed: Unknown error 524 ]


The mount command obviously failed. Whatever the difference may
be between you mounting the filesystem by hand and the Filesystem
RA. And whatever error 524 means.

> Does anyone know / can anyone suggest how I might debug why Pacemaker can’t 
> mount these targets?

Assuming you have recent enough resource-agents and crmsh, you
can trace the Filesystem RA, say:

# crm resource trace ost-08 start

This should make pacemaker try to start ost-08 again:

# crm resource cleanup ost-08

Then look for the trace file in /var/lib/heartbeat/trace_ra.

Alternatively, you can add 'set -x' somewhere in the Filesystem
RA, then look at the logs.

Thanks,

Dejan

> 
> Many thanks
> Stuart
> 
> Stuart Taylor
> System Administrator
> Edinburgh Genomics
> 
> Web: http://genomics.ed.ac.uk/
> Tel: 0131 651 7403
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Pacemaker mount failures

Reply via email to