Hi

I wonder if anyone on the list can help me - I’m new to Pacemaker so apologies 
if I’m posting in the wrong place.

I have a four-node cluster running Pacemaker 1.1.10 with Corosync 1.4.1 on 
CentOS 6.4.  Resource-wise I have eight Lustre storage targets on an iSCSI SAN 
- two each colocated with a single heartbeat IP address on each node.  I have 
redundant Corosync rings and Stonith is configured, and failover in general 
works very well.  

My problem is that three of the storage targets refuse to mount via Pacemaker 
on particular nodes, for no particular reason I can identify.  These resources 
won’t start on the nodes they’re configured to in the constraints - which is 
fine if all nodes are up, but not if certain nodes fail.

If I stop the resources I can manually mount the targets on the node without 
any problem - so it seems to be a Pacemaker, rather than filesystem problem.

My resources look like this: http://pastebin.com/qQ1BR1yW and constraints like 
this: http://pastebin.com/4w85MWUV

crm_mon -f gives the following output:

Last updated: Fri May 30 12:02:59 2014
Last change: Fri May 30 12:02:38 2014 via crm_resource on oss-02
Stack: classic openais (with plugin)
Current DC: oss-02 - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
4 Nodes configured, 4 expected votes
16 Resources configured


Online: [ oss-01 oss-02 oss-03 oss-04 ]

ost-01  (ocf::heartbeat:Filesystem):    Started oss-01
ost-02  (ocf::heartbeat:Filesystem):    Started oss-02
stonith-oss-01  (stonith:fence_ipmilan):        Started oss-03
stonith-oss-02  (stonith:fence_ipmilan):        Started oss-04
ost-03  (ocf::heartbeat:Filesystem):    Started oss-04
stonith-oss-03  (stonith:fence_ipmilan):        Started oss-01
ost-05  (ocf::heartbeat:Filesystem):    Started oss-01
ost-06  (ocf::heartbeat:Filesystem):    Started oss-02
ost-07  (ocf::heartbeat:Filesystem):    Started oss-04
ost-04  (ocf::heartbeat:Filesystem):    Started oss-03
ost-08  (ocf::heartbeat:Filesystem):    Started oss-03
oss-01-hb       (ocf::heartbeat:IPaddr2):       Started oss-01
oss-02-hb       (ocf::heartbeat:IPaddr2):       Started oss-02
oss-03-hb       (ocf::heartbeat:IPaddr2):       Started oss-04
oss-04-hb       (ocf::heartbeat:IPaddr2):       Started oss-03
stonith-oss-04  (stonith:fence_ipmilan):        Started oss-02

Migration summary:
* Node oss-01: 
* Node oss-02: 
* Node oss-04: 
   ost-04: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 
30 11:25:11 2014'
   ost-08: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 
30 11:25:11 2014'
* Node oss-03: 
   ost-03: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 
30 10:47:02 2014'

ost-03 is supposed to mount on oss-03, and ost-04 & ost-08 on oss-04, but they 
fail to do so and the colo-ed IP resources are therefore swapped between oss-03 
and oss-04.

Log entries typically look like this, which doesn’t give me much to go on:

May 30 11:25:11 oss-04 lrmd[2179]:   notice: operation_finished: 
ost-08_start_0:2994:stderr [ mount.lustre: mount /dev/sdi at /lustre/ost-08 
failed: Unknown error 524 ]

Does anyone know / can anyone suggest how I might debug why Pacemaker can’t 
mount these targets?

Many thanks
Stuart

Stuart Taylor
System Administrator
Edinburgh Genomics

Web: http://genomics.ed.ac.uk/
Tel: 0131 651 7403


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to