Hi Andreas,
> hmm ... what is that fence-peer script doing? If you want to use > resource-level fencing with the help of dopd, activate the > drbd-peer-outdater script in the line above ... and double check if the > path is correct fence-peer is just a wrapper for drbd-peer-outdater that does some additional logging. In my testing dopd has been working well. >> I am thinking of making the following changes to the CIB (as per the >> official DRBD >> guide >> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in >> order to add the DRBD lsb service and require that it start before the >> ocf:linbit:drbd resources. Does this look correct? > > Where did you read that? No, deactivate the startup of DRBD on system > boot and let Pacemaker manage it completely. > >> primitive p_drbd-init lsb:drbd op monitor interval="30" >> colocation c_drbd_together inf: >> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master >> ms_drbd_mount2:Master >> order drbd_init_first inf: ms_drbd_vmstore:promote >> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start >> >> This doesn't seem to require that drbd be also running on the node where >> the ocf:linbit:drbd resources are slave (which it would need to do to be >> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere? >> (clone cl_drbd p_drbd-init ?) > > This is really not needed. I was following the official DRBD Users Guide: http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html If I am understanding your previous message correctly, I do not need to add a lsb primitive for the drbd daemon? It will be started/stopped/managed automatically by my ocf:linbit:drbd resources (and I can remove the /etc/rc* symlinks)? Thanks, Andrew ----- Original Message ----- From: "Andreas Kurz" < andr...@hastexo.com > To: pacemaker@oss.clusterlabs.org Sent: Wednesday, March 28, 2012 7:27:34 AM Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover On 03/28/2012 12:13 AM, Andrew Martin wrote: > Hi Andreas, > > Thanks, I've updated the colocation rule to be in the correct order. I > also enabled the STONITH resource (this was temporarily disabled before > for some additional testing). DRBD has its own network connection over > the br1 interface (192.168.5.0/24 network), a direct crossover cable > between node1 and node2: > global { usage-count no; } > common { > syncer { rate 110M; } > } > resource vmstore { > protocol C; > startup { > wfc-timeout 15; > degr-wfc-timeout 60; > } > handlers { > #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; > fence-peer "/usr/local/bin/fence-peer"; hmm ... what is that fence-peer script doing? If you want to use resource-level fencing with the help of dopd, activate the drbd-peer-outdater script in the line above ... and double check if the path is correct > split-brain "/usr/lib/drbd/notify-split-brain.sh > m...@example.com "; > } > net { > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > after-sb-2pri disconnect; > cram-hmac-alg md5; > shared-secret "xxxxx"; > } > disk { > fencing resource-only; > } > on node1 { > device /dev/drbd0; > disk /dev/sdb1; > address 192.168.5.10:7787; > meta-disk internal; > } > on node2 { > device /dev/drbd0; > disk /dev/sdf1; > address 192.168.5.11:7787; > meta-disk internal; > } > } > # and similar for mount1 and mount2 > > Also, here is my ha.cf. It uses both the direct link between the nodes > (br1) and the shared LAN network on br0 for communicating: > autojoin none > mcast br0 239.0.0.43 694 1 0 > bcast br1 > warntime 5 > deadtime 15 > initdead 60 > keepalive 2 > node node1 > node node2 > node quorumnode > crm respawn > respawn hacluster /usr/lib/heartbeat/dopd > apiauth dopd gid=haclient uid=hacluster > > I am thinking of making the following changes to the CIB (as per the > official DRBD > guide > http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html ) in > order to add the DRBD lsb service and require that it start before the > ocf:linbit:drbd resources. Does this look correct? Where did you read that? No, deactivate the startup of DRBD on system boot and let Pacemaker manage it completely. > primitive p_drbd-init lsb:drbd op monitor interval="30" > colocation c_drbd_together inf: > p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master > ms_drbd_mount2:Master > order drbd_init_first inf: ms_drbd_vmstore:promote > ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start > > This doesn't seem to require that drbd be also running on the node where > the ocf:linbit:drbd resources are slave (which it would need to do to be > a DRBD SyncTarget) - how can I ensure that drbd is running everywhere? > (clone cl_drbd p_drbd-init ?) This is really not needed. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > Thanks, > > Andrew > ------------------------------------------------------------------------ > *From: *"Andreas Kurz" < andr...@hastexo.com > > *To: *pacemaker@oss.clusterlabs.org > *Sent: *Monday, March 26, 2012 5:56:22 PM > *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to > master on failover > > On 03/24/2012 08:15 PM, Andrew Martin wrote: >> Hi Andreas, >> >> My complete cluster configuration is as follows: >> ============ >> Last updated: Sat Mar 24 13:51:55 2012 >> Last change: Sat Mar 24 13:41:55 2012 >> Stack: Heartbeat >> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition >> with quorum >> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c >> 3 Nodes configured, unknown expected votes >> 19 Resources configured. >> ============ >> >> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby) >> Online: [ node2 node1 ] >> >> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] >> Masters: [ node2 ] >> Slaves: [ node1 ] >> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] >> Masters: [ node2 ] >> Slaves: [ node1 ] >> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] >> Masters: [ node2 ] >> Slaves: [ node1 ] >> Resource Group: g_vm >> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2 >> p_vm(ocf::heartbeat:VirtualDomain):Started node2 >> Clone Set: cl_daemons [g_daemons] >> Started: [ node2 node1 ] >> Stopped: [ g_daemons:2 ] >> Clone Set: cl_sysadmin_notify [p_sysadmin_notify] >> Started: [ node2 node1 ] >> Stopped: [ p_sysadmin_notify:2 ] >> stonith-node1(stonith:external/tripplitepdu):Started node2 >> stonith-node2(stonith:external/tripplitepdu):Started node1 >> Clone Set: cl_ping [p_ping] >> Started: [ node2 node1 ] >> Stopped: [ p_ping:2 ] >> >> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \ >> attributes standby="off" >> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \ >> attributes standby="off" >> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \ >> attributes standby="on" >> primitive p_drbd_mount2 ocf:linbit:drbd \ >> params drbd_resource="mount2" \ >> op monitor interval="15" role="Master" \ >> op monitor interval="30" role="Slave" >> primitive p_drbd_mount1 ocf:linbit:drbd \ >> params drbd_resource="mount1" \ >> op monitor interval="15" role="Master" \ >> op monitor interval="30" role="Slave" >> primitive p_drbd_vmstore ocf:linbit:drbd \ >> params drbd_resource="vmstore" \ >> op monitor interval="15" role="Master" \ >> op monitor interval="30" role="Slave" >> primitive p_fs_vmstore ocf:heartbeat:Filesystem \ >> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \ >> op start interval="0" timeout="60s" \ >> op stop interval="0" timeout="60s" \ >> op monitor interval="20s" timeout="40s" >> primitive p_libvirt-bin upstart:libvirt-bin \ >> op monitor interval="30" >> primitive p_ping ocf:pacemaker:ping \ >> params name="p_ping" host_list="192.168.1.10 192.168.1.11" >> multiplier="1000" \ >> op monitor interval="20s" >> primitive p_sysadmin_notify ocf:heartbeat:MailTo \ >> params email=" m...@example.com " \ >> params subject="Pacemaker Change" \ >> op start interval="0" timeout="10" \ >> op stop interval="0" timeout="10" \ >> op monitor interval="10" timeout="10" >> primitive p_vm ocf:heartbeat:VirtualDomain \ >> params config="/vmstore/config/vm.xml" \ >> meta allow-migrate="false" \ >> op start interval="0" timeout="120s" \ >> op stop interval="0" timeout="120s" \ >> op monitor interval="10" timeout="30" >> primitive stonith-node1 stonith:external/tripplitepdu \ >> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx" >> pdu_password="xxx" hostname_to_stonith="node1" >> primitive stonith-node2 stonith:external/tripplitepdu \ >> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx" >> pdu_password="xxx" hostname_to_stonith="node2" >> group g_daemons p_libvirt-bin >> group g_vm p_fs_vmstore p_vm >> ms ms_drbd_mount2 p_drbd_mount2 \ >> meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" >> ms ms_drbd_mount1 p_drbd_mount1 \ >> meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" >> ms ms_drbd_vmstore p_drbd_vmstore \ >> meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" >> clone cl_daemons g_daemons >> clone cl_ping p_ping \ >> meta interleave="true" >> clone cl_sysadmin_notify p_sysadmin_notify >> location l-st-node1 stonith-node1 -inf: node1 >> location l-st-node2 stonith-node2 -inf: node2 >> location l_run_on_most_connected p_vm \ >> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping >> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master >> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm > > As Emmanuel already said, g_vm has to be in the first place in this > collocation constraint .... g_vm must be colocated with the drbd masters. > >> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote >> ms_drbd_mount2:promote cl_daemons:start g_vm:start >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ >> cluster-infrastructure="Heartbeat" \ >> stonith-enabled="false" \ >> no-quorum-policy="stop" \ >> last-lrm-refresh="1332539900" \ >> cluster-recheck-interval="5m" \ >> crmd-integration-timeout="3m" \ >> shutdown-escalation="5m" >> >> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite >> PDUMH20ATNET that I'm using as the STONITH device: >> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf > > And why don't using it? .... stonith-enabled="false" > >> >> As you can see, I left the DRBD service to be started by the operating >> system (as an lsb script at boot time) however Pacemaker controls >> actually bringing up/taking down the individual DRBD devices. > > Don't start drbd on system boot, give Pacemaker the full control. > > The >> behavior I observe is as follows: I issue "crm resource migrate p_vm" on >> node1 and failover successfully to node2. During this time, node2 fences >> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile >> node2's DRBD devices are UpToDate. I then shutdown both nodes and then >> bring them back up. They reconnect to the cluster (with quorum), and >> node1's DRBD devices are still Outdated as expected and node2's DRBD >> devices are still UpToDate, as expected. At this point, DRBD starts on >> both nodes, however node2 will not set DRBD as master: >> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby) >> Online: [ node2 node1 ] >> >> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] >> Slaves: [ node1 node2 ] >> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] >> Slaves: [ node1 node 2 ] >> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] >> Slaves: [ node1 node2 ] > > There should really be no interruption of the drbd replication on vm > migration that activates the dopd ... drbd has its own direct network > connection? > > Please share your ha.cf file and your drbd configuration. Watch out for > drbd messages in your kernel log file, that should give you additional > information when/why the drbd connection was lost. > > Regards, > Andreas > > -- > Need help with Pacemaker? > http://www.hastexo.com/now > >> >> I am having trouble sorting through the logging information because >> there is so much of it in /var/log/daemon.log, but I can't find an >> error message printed about why it will not promote node2. At this point >> the DRBD devices are as follows: >> node2: cstate = WFConnection dstate=UpToDate >> node1: cstate = StandAlone dstate=Outdated >> >> I don't see any reason why node2 can't become DRBD master, or am I >> missing something? If I do "drbdadm connect all" on node1, then the >> cstate on both nodes changes to "Connected" and node2 immediately >> promotes the DRBD resources to master. Any ideas on why I'm observing >> this incorrect behavior? >> >> Any tips on how I can better filter through the pacemaker/heartbeat logs >> or how to get additional useful debug information? >> >> Thanks, >> >> Andrew >> >> ------------------------------------------------------------------------ >> *From: *"Andreas Kurz" < andr...@hastexo.com > >> *To: *pacemaker@oss.clusterlabs.org >> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM >> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to >> master on failover >> >> On 01/25/2012 08:58 PM, Andrew Martin wrote: >>> Hello, >>> >>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6 >>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes >>> the following resources: >>> - primitives for DRBD storage devices >>> - primitives for mounting the filesystem on the DRBD storage >>> - primitives for some mount binds >>> - primitive for starting apache >>> - primitives for starting samba and nfs servers (following instructions >>> here < http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf >) >>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs) >> >> not enough information ... please share at least your complete cluster >> configuration >> >> Regards, >> Andreas >> >> -- >> Need help with Pacemaker? >> http://www.hastexo.com/now >> >>> >>> Perhaps this is best described through the output of crm_mon: >>> Online: [ node1 node2 ] >>> >>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged) >>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 > (unmanaged) >>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1 >>> (unmanaged) FAILED >>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] >>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1 >>> (unmanaged) FAILED >>> Slaves: [ node2 ] >>> Resource Group: g_core >>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1 >>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1 >>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1 >>> Resource Group: g_apache >>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1 >>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1 >>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1 >>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1 >>> p_apache (ocf::heartbeat:apache): Started node1 >>> Resource Group: g_fileservers >>> p_lsb_smb (lsb:smbd): Started node1 >>> p_lsb_nmb (lsb:nmbd): Started node1 >>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1 >>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1 >>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started node1 >>> >>> I have read through the Pacemaker Explained >>> >> > < > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained > > >>> documentation, however could not find a way to further debug these >>> problems. First, I put node1 into standby mode to attempt failover to >>> the other node (node2). Node2 appeared to start the transition to >>> master, however it failed to promote the DRBD resources to master (the >>> first step). I have attached a copy of this session in commands.log and >>> additional excerpts from /var/log/syslog during important steps. I have >>> attempted everything I can think of to try and start the DRBD resource >>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting >>> heartbeat) but cannot bring it out of the slave state. However, if I set >>> it to unmanaged and then run drbdadm primary all in the terminal, >>> pacemaker is satisfied and continues starting the rest of the resources. >>> It then failed when attempting to mount the filesystem for mount2, the >>> p_fs_mount2 resource. I attempted to mount the filesystem myself and was >>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then >>> it mounted. The rest of the resources started as expected until the >>> p_exportfs_mount2 resource, which failed as follows: >>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2 >>> (unmanaged) FAILED >>> >>> I ran cleanup on this and it started, however when running this test >>> earlier today no command could successfully start this exportfs resource. >>> >>> How can I configure pacemaker to better resolve these problems and be >>> able to bring the node up successfully on its own? What can I check to >>> determine why these failures are occuring? /var/log/syslog did not seem >>> to contain very much useful information regarding why the failures >> occurred. >>> >>> Thanks, >>> >>> Andrew >>> >>> >>> >>> >>> This body part will be downloaded on demand. >> >> >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org