I haven't been able to find any documentation outside of the man pages to help 
troubleshoot this, so I've come to the experts...

I'm attempting to setup the following:

Services:              NFS and Samba
                 ------------------------
Filesystems:     /mnt/media | /mnt/datusr
                 ------------------------
Replicated LVMs: vgrData0   | vgrData1
                 ------------------------
Block Devices:   drbd0      | drbd1
                 ------------------------
Underlying LVMs: vgData0    | vgData1
                 ------------------------
Disks:           sdb1       | sdb2

I'm able to get all of this to work manually, without the heartbeat service 
running. My eventual intended configuration is:

crm configure
primitive drbd_data0 ocf:linbit:drbd params drbd_resource="data0" op monitor 
interval="15s"
primitive drbd_data1 ocf:linbit:drbd params drbd_resource="data1" op monitor 
interval="15s"
ms ms_drbd_data0 drbd_data0 meta master-max="1" master-node-max="1" 
clone-max="2" clone-node-max="1" notify="true"
ms ms_drbd_data1 drbd_data1 meta master-max="1" master-node-max="1" 
clone-max="2" clone-node-max="1" notify="true"
primitive lvm_data0 ocf:heartbeat:LVM params volgrpname="vgrData0" 
exclusive="yes" op monitor depth="0" timeout="30" interval="10"
primitive lvm_data1 ocf:heartbeat:LVM params volgrpname="vgrData1" 
exclusive="yes" op monitor depth="0" timeout="30" interval="10"
primitive fs_data0 ocf:heartbeat:Filesystem params 
device="/dev/vgrData0/lvrData0" directory="/mnt/media" fstype="ext4"
primitive fs_data1 ocf:heartbeat:Filesystem params 
device="/dev/vgrData1/lvrData1" directory="/mnt/datusr" fstype="ext4"
primitive ip_data ocf:heartbeat:IPaddr2 params ip="192.168.67.101" nic="eth0"
primitive svc_nfs lsb:nfs
primitive svc_samba lsb:smb
colocation col_data00 inf: ms_drbd_data0:Master ms_drbd_data1:Master
colocation col_data01 inf: ms_drbd_data0:Master lvm_data0
colocation col_data02 inf: ms_drbd_data0:Master fs_data0
colocation col_data03 inf: ms_drbd_data0:Master lvm_data1
colocation col_data04 inf: ms_drbd_data0:Master fs_data1
colocation col_data05 inf: ms_drbd_data0:Master ip_data
colocation col_data06 inf: ms_drbd_data0:Master svc_nfs
colocation col_data07 inf: ms_drbd_data0:Master svc_samba
order ord_data00 inf: ms_drbd_data0:promote ms_drbd_data1:promote
order ord_data01 inf: ms_drbd_data0:promote lvm_data0:start
order ord_data02 inf: lvm_data0:start fs_data0:start
order ord_data03 inf: ms_drbd_data1:promote lvm_data1:start
order ord_data04 inf: lvm_data1:start fs_data1:start
order ord_data05 inf: fs_data0:start fs_data1:start
order ord_data06 inf: fs_data1:start ip_data:start
order ord_data07 inf: ip_data:start svc_nfs:start
order ord_data08 inf: ip_data:start svc_samba:start
commit
bye

However, I've only been able to get the following so far to work:

crm configure
primitive drbd_data0 ocf:linbit:drbd params drbd_resource="data0" op monitor 
interval="15s"
primitive drbd_data1 ocf:linbit:drbd params drbd_resource="data1" op monitor 
interval="15s"
ms ms_drbd_data0 drbd_data0 meta master-max="1" master-node-max="1" 
clone-max="2" clone-node-max="1" notify="true"
ms ms_drbd_data1 drbd_data1 meta master-max="1" master-node-max="1" 
clone-max="2" clone-node-max="1" notify="true"
primitive lvm_data0 ocf:heartbeat:LVM params volgrpname="vgrData0" 
exclusive="yes" op monitor depth="0" timeout="30" interval="10"
primitive fs_data0 ocf:heartbeat:Filesystem params 
device="/dev/vgrData0/lvrData0" directory="/mnt/media" fstype="ext4"
colocation col_data00 inf: ms_drbd_data0:Master ms_drbd_data1:Master
colocation col_data01 inf: ms_drbd_data0:Master lvm_data0
colocation col_data02 inf: ms_drbd_data0:Master fs_data0
order ord_data00 inf: ms_drbd_data0:promote ms_drbd_data1:promote
order ord_data01 inf: ms_drbd_data0:promote lvm_data0:start
order ord_data02 inf: lvm_data0:start fs_data0:start
commit
bye

And to get the above to work, I have to issue:

crm resource failcount lvm_data0 delete node01
crm resource failcount lvm_data0 delete node02
crm resource failcount fs_data0 delete node01
crm resource failcount fs_data0 delete node02
crm resource cleanup lvm_data0
crm resource cleanup fs_data0

Then everything starts just fine on it's own. After a reboot however, it pukes 
again. The only resources that start reliably are the drbd resources.

Looking through the logs (attached), it appears pacemaker may be attempting to 
verify the replicated LV (/dev/vgrData0/lvrData0) is down before starting the 
drbd resources. Since the replicated LVs are using the drbd devices as their 
backing block device, this first scan will fail, and I think this is were the 
failcounts for the replicated LV is hitting INFINITY before we've ever even 
tried to start the LV.

So assuming my analysis is correct, what ideas might you have to best address 
this issue? I believe I either need a way to prevent Pacemaker from attempting 
to find the replicated LV before the drbd resources are online, or I need a way 
to have Pacemaker automatically clear/cleanup the failcounts/status for the LV 
after the drbd resources come online.

Thoughts? Suggestions?

Thanks in advance.

DJ
                                          
_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/

Attachment: pacemakerlog01-node01.log
Description: Binary data

_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to