On 04/10/2012 04:29 PM, Andrew Martin wrote: > Hi Andreas, > > ----- Original Message ----- > >> From: "Andreas Kurz" <andr...@hastexo.com> >> To: pacemaker@oss.clusterlabs.org >> Sent: Tuesday, April 10, 2012 5:28:15 AM >> Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to >> master on failover > >> On 04/10/2012 06:17 AM, Andrew Martin wrote: >>> Hi Andreas, >>> >>> Yes, I attempted to generalize hostnames and usernames/passwords in >>> the >>> archive. Sorry for making it more confusing :( >>> >>> I completely purged pacemaker from all 3 nodes and reinstalled >>> everything. I then completely rebuild the CIB by manually adding in >>> each >>> primitive/constraint one at a time and testing along the way. After >>> doing this DRBD appears to be working at least somewhat better - >>> the >>> ocf:linbit:drbd devices are started and managed by pacemaker. >>> However, >>> if for example a node is STONITHed when it comes back up it will >>> not >>> restart the ocf:linbit:drbd resources until I manually load the >>> DRBD >>> kernel module, bring the DRBD devices up (drbdadm up all), and >>> cleanup >>> the resources (e.g. crm resource cleanup ms_drbd_vmstore). Is it >>> possible that the DRBD kernel module needs to be loaded at boot >>> time, >>> independent of pacemaker? > >> No, this is done by the drbd OCF script on start. > > >>> >>> Here's the new CIB (mostly the same as before): >>> http://pastebin.com/MxrqBXMp
There is that libvirt-bin upstart job resource but not cloned, producing this: Resource p_libvirt-bin (upstart::libvirt-bin) is active on 2 nodes attempting recovery ... errors. I'd say having upstart respawning libvirtd is quite fine. Removing this primitive and therefore also from the group with its dependencies is ok. >>> >>> Typically quorumnode stays in the OFFLINE (standby) state, though >>> occasionally it changes to pending. I have just tried >>> cleaning /var/lib/heartbeat/crm on quorumnode again so we will see >>> if >>> that helps keep it in the OFFLINE (standby) state. I have it >>> explicitly >>> set to standby in the CIB configuration and also created a rule to >>> prevent some of the resources from running on it? >>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \ >>> attributes standby="on" >>> ... > >> The node should be in "ONLINE (standby)" state if you start heartbeat >> and pacemaker is enabled with "crm yes" or "crm respawn"in ha.cf > > I have never seen it listed as ONLINE (standby). Here's the ha.cf on > quorumnode: > autojoin none > mcast eth0 239.0.0.43 694 1 0 > warntime 5 > deadtime 15 > initdead 60 > keepalive 2 > node node1 > node node2 > node quorumnode > crm respawn > > And here's the ha.cf on node[12]: > autojoin none > mcast br0 239.0.0.43 694 1 0 > bcast br1 > warntime 5 > deadtime 15 > initdead 60 > keepalive 2 > node node1 > node node2 > node quorumnode > crm respawn > respawn hacluster /usr/lib/heartbeat/dopd > apiauth dopd gid=haclient uid=hacluster > > The only difference between these boxes is that quorumnode is a CentOS 5.5 > box so it is stuck at heartbeat 3.0.3, whereas node[12] are both on Ubuntu > 10.04 using the Ubuntu HA PPA, so they are running heartbeat 3.0.5. Would > this make a difference? > hmmm ... heartbeat 3.0.3 is about 2 years old IIRC and there have been some important fixes ... any heartbeat logs from quorumnode? tried using ucast for br0/eth0? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now >>> location loc_not_on_quorumnode g_vm -inf: quorumnode >>> >>> Would it be wise to create additional constraints to prevent all >>> resources (including each ms_drbd resource) from running on it, >>> even >>> though this should be implied by standby? > >> There is no need for that. A node in standby will never run resources >> and if there is no DRBD and installed on that node your resources >> won't >> start anyways. > > I've removed this constraint > >>> >>> Below is a portion of the log from when I started a node yet DRBD >>> failed >>> to start. As you can see it thinks the DRBD device is operating >>> correctly as it proceeds to starting subsequent resources, e.g. >>> Apr 9 20:22:55 node1 Filesystem[2939]: [2956]: WARNING: Couldn't >>> find >>> device [/dev/drbd0]. Expected /dev/??? to exist >>> http://pastebin.com/zTCHPtWy > >> The only thing i can read from that log fragments is, that probes are >> running ... not enough information. Really interesting would be logs >> from the DC. > > Here is the log from the DC for that same time period: > http://pastebin.com/d4PGGLPi > >>> >>> After seeing these messages in the log I run >>> # service drbd start >>> # drbdadm up all >>> # crm resource cleanup ms_drbd_vmstore >>> # crm resource cleanup ms_drbd_mount1 >>> # crm resource clenaup ms_drbd_mount2 > >> That should all not be needed ... what is the output of "crm_mon >> -1frA" >> before you do all that cleanups? > > I will get this output the next time I can put the cluster in this state. > >>> After this sequence of commands the DRBD resources appear to be >>> functioning normally and the subsequent resources start. Any ideas >>> on >>> why DRBD is not being started as expected, or why the cluster is >>> continuing with starting resources that according to the >>> o_drbd-fs-vm >>> constraint should not start until DRBD is master? > >> No idea, maybe creating a crm_report archive and sending it to the >> list >> can shed some light on that problem. > >> Regards, >> Andreas > >> -- >> Need help with Pacemaker? >> http://www.hastexo.com/now > > Thanks, > > Andrew > >>> >>> Thanks, >>> >>> Andrew >>> ------------------------------------------------------------------------ >>> *From: *"Andreas Kurz" <andr...@hastexo.com> >>> *To: *pacemaker@oss.clusterlabs.org >>> *Sent: *Monday, April 2, 2012 6:33:44 PM >>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to >>> master on failover >>> >>> On 04/02/2012 05:47 PM, Andrew Martin wrote: >>>> Hi Andreas, >>>> >>>> Here is the crm_report: >>>> http://dl.dropbox.com/u/2177298/pcmk-Mon-02-Apr-2012.bz2 >>> >>> You tried to do some obfuscation on parts of that archive? ... >>> doesn't >>> really make it easier to debug .... >>> >>> Does the third node ever change its state? >>> >>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending >>> >>> Looking at the logs and the transition graph says it aborts due to >>> un-runable operations on that node which seems to be related to >>> it's >>> pending state. >>> >>> Try to get that node up (or down) completely ... maybe a fresh >>> start-over with a clean /var/lib/heartbeat/crm directory is >>> sufficient. >>> >>> Regards, >>> Andreas >>> >>>> >>>> Hi Emmanuel, >>>> >>>> Here is the configuration: >>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \ >>>> attributes standby="off" >>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \ >>>> attributes standby="off" >>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \ >>>> attributes standby="on" >>>> primitive p_drbd_mount2 ocf:linbit:drbd \ >>>> params drbd_resource="mount2" \ >>>> op start interval="0" timeout="240" \ >>>> op stop interval="0" timeout="100" \ >>>> op monitor interval="10" role="Master" timeout="20" >>>> start-delay="1m" \ >>>> op monitor interval="20" role="Slave" timeout="20" >>>> start-delay="1m" >>>> primitive p_drbd_mount1 ocf:linbit:drbd \ >>>> params drbd_resource="mount1" \ >>>> op start interval="0" timeout="240" \ >>>> op stop interval="0" timeout="100" \ >>>> op monitor interval="10" role="Master" timeout="20" >>>> start-delay="1m" \ >>>> op monitor interval="20" role="Slave" timeout="20" >>>> start-delay="1m" >>>> primitive p_drbd_vmstore ocf:linbit:drbd \ >>>> params drbd_resource="vmstore" \ >>>> op start interval="0" timeout="240" \ >>>> op stop interval="0" timeout="100" \ >>>> op monitor interval="10" role="Master" timeout="20" >>>> start-delay="1m" \ >>>> op monitor interval="20" role="Slave" timeout="20" >>>> start-delay="1m" >>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \ >>>> params device="/dev/drbd0" directory="/mnt/storage/vmstore" >>> fstype="ext4" \ >>>> op start interval="0" timeout="60s" \ >>>> op stop interval="0" timeout="60s" \ >>>> op monitor interval="20s" timeout="40s" >>>> primitive p_libvirt-bin upstart:libvirt-bin \ >>>> op monitor interval="30" >>>> primitive p_ping ocf:pacemaker:ping \ >>>> params name="p_ping" host_list="192.168.3.1 192.168.3.2" >>> multiplier="1000" \ >>>> op monitor interval="20s" >>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \ >>>> params email="m...@example.com" \ >>>> params subject="Pacemaker Change" \ >>>> op start interval="0" timeout="10" \ >>>> op stop interval="0" timeout="10" \ >>>> op monitor interval="10" timeout="10" >>>> primitive p_vm ocf:heartbeat:VirtualDomain \ >>>> params config="/mnt/storage/vmstore/config/vm.xml" \ >>>> meta allow-migrate="false" \ >>>> op start interval="0" timeout="180" \ >>>> op stop interval="0" timeout="180" \ >>>> op monitor interval="10" timeout="30" >>>> primitive stonith-node1 stonith:external/tripplitepdu \ >>>> params pdu_ipaddr="192.168.3.100" pdu_port="1" pdu_username="xxx" >>>> pdu_password="xxx" hostname_to_stonith="node1" >>>> primitive stonith-node2 stonith:external/tripplitepdu \ >>>> params pdu_ipaddr="192.168.3.100" pdu_port="2" pdu_username="xxx" >>>> pdu_password="xxx" hostname_to_stonith="node2" >>>> group g_daemons p_libvirt-bin >>>> group g_vm p_fs_vmstore p_vm >>>> ms ms_drbd_mount2 p_drbd_mount2 \ >>>> meta master-max="1" master-node-max="1" clone-max="2" >>>> clone-node-max="1" >>>> notify="true" >>>> ms ms_drbd_mount1 p_drbd_mount1 \ >>>> meta master-max="1" master-node-max="1" clone-max="2" >>>> clone-node-max="1" >>>> notify="true" >>>> ms ms_drbd_vmstore p_drbd_vmstore \ >>>> meta master-max="1" master-node-max="1" clone-max="2" >>>> clone-node-max="1" >>>> notify="true" >>>> clone cl_daemons g_daemons >>>> clone cl_ping p_ping \ >>>> meta interleave="true" >>>> clone cl_sysadmin_notify p_sysadmin_notify \ >>>> meta target-role="Started" >>>> location l-st-node1 stonith-node1 -inf: node1 >>>> location l-st-node2 stonith-node2 -inf: node2 >>>> location l_run_on_most_connected p_vm \ >>>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping >>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master >>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm >>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote >>>> ms_drbd_mount1:promote >>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start >>>> property $id="cib-bootstrap-options" \ >>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ >>>> cluster-infrastructure="Heartbeat" \ >>>> stonith-enabled="true" \ >>>> no-quorum-policy="freeze" \ >>>> last-lrm-refresh="1333041002" \ >>>> cluster-recheck-interval="5m" \ >>>> crmd-integration-timeout="3m" \ >>>> shutdown-escalation="5m" >>>> >>>> Thanks, >>>> >>>> Andrew >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> *From: *"emmanuel segura" <emi2f...@gmail.com> >>>> *To: *"The Pacemaker cluster resource manager" >>>> <pacemaker@oss.clusterlabs.org> >>>> *Sent: *Monday, April 2, 2012 9:43:20 AM >>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources >>>> to >>>> master on failover >>>> >>>> Sorry Andrew >>>> >>>> Can you post me your crm configure show again? >>>> >>>> Thanks >>>> >>>> Il giorno 30 marzo 2012 18:53, Andrew Martin <amar...@xes-inc.com >>>> <mailto:amar...@xes-inc.com>> ha scritto: >>>> >>>> Hi Emmanuel, >>>> >>>> Thanks, that is a good idea. I updated the colocation contraint as >>>> you described. After, the cluster remains in this state (with the >>>> filesystem not mounted and the VM not started): >>>> Online: [ node2 node1 ] >>>> >>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] >>>> Masters: [ node1 ] >>>> Slaves: [ node2 ] >>>> Master/Slave Set: ms_drbd_tools [p_drbd_mount1] >>>> Masters: [ node1 ] >>>> Slaves: [ node2 ] >>>> Master/Slave Set: ms_drbd_crm [p_drbd_mount2] >>>> Masters: [ node1 ] >>>> Slaves: [ node2 ] >>>> Clone Set: cl_daemons [g_daemons] >>>> Started: [ node2 node1 ] >>>> Stopped: [ g_daemons:2 ] >>>> stonith-node1 (stonith:external/tripplitepdu): Started node2 >>>> stonith-node2 (stonith:external/tripplitepdu): Started node1 >>>> >>>> I noticed that Pacemaker had not issued "drbdadm connect" for any >>>> of >>>> the DRBD resources on node2 >>>> # service drbd status >>>> drbd driver loaded OK; device status: >>>> version: 8.3.7 (api:88/proto:86-91) >>>> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by >>>> root@node2, 2012-02-02 12:29:26 >>>> m:res cs ro ds p >>>> mounted fstype >>>> 0:vmstore StandAlone Secondary/Unknown Outdated/DUnknown r---- >>>> 1:mount1 StandAlone Secondary/Unknown Outdated/DUnknown r---- >>>> 2:mount2 StandAlone Secondary/Unknown Outdated/DUnknown r---- >>>> # drbdadm cstate all >>>> StandAlone >>>> StandAlone >>>> StandAlone >>>> >>>> After manually issuing "drbdadm connect all" on node2 the rest of >>>> the resources eventually started (several minutes later) on node1: >>>> Online: [ node2 node1 ] >>>> >>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] >>>> Masters: [ node1 ] >>>> Slaves: [ node2 ] >>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] >>>> Masters: [ node1 ] >>>> Slaves: [ node2 ] >>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] >>>> Masters: [ node1 ] >>>> Slaves: [ node2 ] >>>> Resource Group: g_vm >>>> p_fs_vmstore (ocf::heartbeat:Filesystem): Started node1 >>>> p_vm (ocf::heartbeat:VirtualDomain): Started node1 >>>> Clone Set: cl_daemons [g_daemons] >>>> Started: [ node2 node1 ] >>>> Stopped: [ g_daemons:2 ] >>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify] >>>> Started: [ node2 node1 ] >>>> Stopped: [ p_sysadmin_notify:2 ] >>>> stonith-node1 (stonith:external/tripplitepdu): Started node2 >>>> stonith-node2 (stonith:external/tripplitepdu): Started node1 >>>> Clone Set: cl_ping [p_ping] >>>> Started: [ node2 node1 ] >>>> Stopped: [ p_ping:2 ] >>>> >>>> The DRBD devices on node1 were all UpToDate, so it doesn't seem >>>> right that it would need to wait for node2 to be connected before >>>> it >>>> could continue promoting additional resources. I then restarted >>>> heartbeat on node2 to see if it would automatically connect the >>>> DRBD >>>> devices this time. After restarting it, the DRBD devices are not >>>> even configured: >>>> # service drbd status >>>> drbd driver loaded OK; device status: >>>> version: 8.3.7 (api:88/proto:86-91) >>>> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by >>>> root@webapps2host, 2012-02-02 12:29:26 >>>> m:res cs ro ds p mounted fstype >>>> 0:vmstore Unconfigured >>>> 1:mount1 Unconfigured >>>> 2:mount2 Unconfigured >>>> >>>> Looking at the log I found this part about the drbd primitives: >>>> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[2] on >>>> p_drbd_vmstore:1 for client 10705: pid 11065 exited with return >>>> code 7 >>>> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM >>>> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11, >>>> confirmed=true) not running >>>> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[4] on >>>> p_drbd_mount2:1 for client 10705: pid 11069 exited with return >>>> code 7 >>>> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM >>>> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=12, >>>> confirmed=true) not running >>>> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[3] on >>>> p_drbd_mount1:1 for client 10705: pid 11066 exited with return >>>> code 7 >>>> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM >>>> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=13, >>>> confirmed=true) not running >>>> >>>> I am not sure what exit code 7 is - is it possible to manually run >>>> the monitor code or somehow obtain more debug about this? Here is >>>> the complete log after restarting heartbeat on node2: >>>> http://pastebin.com/KsHKi3GW >>>> >>>> Thanks, >>>> >>>> Andrew >>>> >>>> >>> ------------------------------------------------------------------------ >>>> *From: *"emmanuel segura" <emi2f...@gmail.com >>>> <mailto:emi2f...@gmail.com>> >>>> *To: *"The Pacemaker cluster resource manager" >>>> <pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org>> >>>> *Sent: *Friday, March 30, 2012 10:26:48 AM >>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources >>>> to >>>> master on failover >>>> >>>> I think this constrain it's wrong >>>> ================================================== >>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master >>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm >>>> =================================================== >>>> >>>> change to >>>> ====================================================== >>>> colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master >>>> ms_drbd_mount1:Master ms_drbd_mount2:Master >>>> ======================================================= >>>> >>>> Il giorno 30 marzo 2012 17:16, Andrew Martin <amar...@xes-inc.com >>>> <mailto:amar...@xes-inc.com>> ha scritto: >>>> >>>> Hi Emmanuel, >>>> >>>> Here is the output of crm configure show: >>>> http://pastebin.com/NA1fZ8dL >>>> >>>> Thanks, >>>> >>>> Andrew >>>> >>>> >>> ------------------------------------------------------------------------ >>>> *From: *"emmanuel segura" <emi2f...@gmail.com >>>> <mailto:emi2f...@gmail.com>> >>>> *To: *"The Pacemaker cluster resource manager" >>>> <pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org>> >>>> *Sent: *Friday, March 30, 2012 9:43:45 AM >>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources >>>> to master on failover >>>> >>>> can you show me? >>>> >>>> crm configure show >>>> >>>> Il giorno 30 marzo 2012 16:10, Andrew Martin >>>> <amar...@xes-inc.com <mailto:amar...@xes-inc.com>> ha scritto: >>>> >>>> Hi Andreas, >>>> >>>> Here is a copy of my complete CIB: >>>> http://pastebin.com/v5wHVFuy >>>> >>>> I'll work on generating a report using crm_report as well. >>>> >>>> Thanks, >>>> >>>> Andrew >>>> >>>> >>> ------------------------------------------------------------------------ >>>> *From: *"Andreas Kurz" <andr...@hastexo.com >>>> <mailto:andr...@hastexo.com>> >>>> *To: *pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org> >>>> *Sent: *Friday, March 30, 2012 4:41:16 AM >>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD >>>> resources to master on failover >>>> >>>> On 03/28/2012 04:56 PM, Andrew Martin wrote: >>>>> Hi Andreas, >>>>> >>>>> I disabled the DRBD init script and then restarted the >>>> slave node >>>>> (node2). After it came back up, DRBD did not start: >>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): >>>> pending >>>>> Online: [ node2 node1 ] >>>>> >>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] >>>>> Masters: [ node1 ] >>>>> Stopped: [ p_drbd_vmstore:1 ] >>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_tools] >>>>> Masters: [ node1 ] >>>>> Stopped: [ p_drbd_mount1:1 ] >>>>> Master/Slave Set: ms_drbd_mount2 [p_drbdmount2] >>>>> Masters: [ node1 ] >>>>> Stopped: [ p_drbd_mount2:1 ] >>>>> ... >>>>> >>>>> root@node2:~# service drbd status >>>>> drbd not loaded >>>> >>>> Yes, expected unless Pacemaker starts DRBD >>>> >>>>> >>>>> Is there something else I need to change in the CIB to >>>> ensure that DRBD >>>>> is started? All of my DRBD devices are configured like this: >>>>> primitive p_drbd_mount2 ocf:linbit:drbd \ >>>>> params drbd_resource="mount2" \ >>>>> op monitor interval="15" role="Master" \ >>>>> op monitor interval="30" role="Slave" >>>>> ms ms_drbd_mount2 p_drbd_mount2 \ >>>>> meta master-max="1" master-node-max="1" >>> clone-max="2" >>>>> clone-node-max="1" notify="true" >>>> >>>> That should be enough ... unable to say more without seeing >>>> the complete >>>> configuration ... too much fragments of information ;-) >>>> >>>> Please provide (e.g. pastebin) your complete cib (cibadmin >>>> -Q) when >>>> cluster is in that state ... or even better create a >>>> crm_report archive >>>> >>>>> >>>>> Here is the output from the syslog (grep -i drbd >>>> /var/log/syslog): >>>>> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: >>>> Performing >>>>> key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc >>>>> op=p_drbd_vmstore:1_monitor_0 ) >>>>> Mar 28 09:24:47 node2 lrmd: [3210]: info: >>>> rsc:p_drbd_vmstore:1 probe[2] >>>>> (pid 3455) >>>>> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: >>>> Performing >>>>> key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc >>>>> op=p_drbd_mount1:1_monitor_0 ) >>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info: >>>> rsc:p_drbd_mount1:1 probe[3] >>>>> (pid 3456) >>>>> Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op: >>>> Performing >>>>> key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc >>>>> op=p_drbd_mount2:1_monitor_0 ) >>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info: >>>> rsc:p_drbd_mount2:1 probe[4] >>>>> (pid 3457) >>>>> Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING: >>>> Couldn't find >>>>> device [/dev/drbd0]. Expected /dev/??? to exist >>>>> Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked: >>>>> crm_attribute -N node2 -n master-p_drbd_mount2:1 -l >>> reboot -D >>>>> Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked: >>>>> crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l >>> reboot -D >>>>> Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked: >>>>> crm_attribute -N node2 -n master-p_drbd_mount1:1 -l >>> reboot -D >>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation >>>> monitor[4] on >>>>> p_drbd_mount2:1 for client 3213: pid 3457 exited with >>>> return code 7 >>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation >>>> monitor[2] on >>>>> p_drbd_vmstore:1 for client 3213: pid 3455 exited with >>>> return code 7 >>>>> Mar 28 09:24:48 node2 crmd: [3213]: info: >>>> process_lrm_event: LRM >>>>> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, >>>> cib-update=10, >>>>> confirmed=true) not running >>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation >>>> monitor[3] on >>>>> p_drbd_mount1:1 for client 3213: pid 3456 exited with >>>> return code 7 >>>>> Mar 28 09:24:48 node2 crmd: [3213]: info: >>>> process_lrm_event: LRM >>>>> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, >>>> cib-update=11, >>>>> confirmed=true) not running >>>>> Mar 28 09:24:48 node2 crmd: [3213]: info: >>>> process_lrm_event: LRM >>>>> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, >>>> cib-update=12, >>>>> confirmed=true) not running >>>> >>>> No errors, just probing ... so for any reason Pacemaker does >>>> not like to >>>> start it ... use crm_simulate to find out why ... or provide >>>> information >>>> as requested above. >>>> >>>> Regards, >>>> Andreas >>>> >>>> -- >>>> Need help with Pacemaker? >>>> http://www.hastexo.com/now >>>> >>>>> >>>>> Thanks, >>>>> >>>>> Andrew >>>>> >>>>> >>>> >>> ------------------------------------------------------------------------ >>>>> *From: *"Andreas Kurz" <andr...@hastexo.com >>>> <mailto:andr...@hastexo.com>> >>>>> *To: *pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org> >>>>> *Sent: *Wednesday, March 28, 2012 9:03:06 AM >>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD >>>> resources to >>>>> master on failover >>>>> >>>>> On 03/28/2012 03:47 PM, Andrew Martin wrote: >>>>>> Hi Andreas, >>>>>> >>>>>>> hmm ... what is that fence-peer script doing? If you >>>> want to use >>>>>>> resource-level fencing with the help of dopd, activate the >>>>>>> drbd-peer-outdater script in the line above ... and >>>> double check if the >>>>>>> path is correct >>>>>> fence-peer is just a wrapper for drbd-peer-outdater that >>>> does some >>>>>> additional logging. In my testing dopd has been working >>> well. >>>>> >>>>> I see >>>>> >>>>>> >>>>>>>> I am thinking of making the following changes to the >>>> CIB (as per the >>>>>>>> official DRBD >>>>>>>> guide >>>>>> >>>>> >>>> >>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) >>>> in >>>>>>>> order to add the DRBD lsb service and require that it >>>> start before the >>>>>>>> ocf:linbit:drbd resources. Does this look correct? >>>>>>> >>>>>>> Where did you read that? No, deactivate the startup of >>>> DRBD on system >>>>>>> boot and let Pacemaker manage it completely. >>>>>>> >>>>>>>> primitive p_drbd-init lsb:drbd op monitor interval="30" >>>>>>>> colocation c_drbd_together inf: >>>>>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master >>>>>>>> ms_drbd_mount2:Master >>>>>>>> order drbd_init_first inf: ms_drbd_vmstore:promote >>>>>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote >>>> p_drbd-init:start >>>>>>>> >>>>>>>> This doesn't seem to require that drbd be also running >>>> on the node where >>>>>>>> the ocf:linbit:drbd resources are slave (which it would >>>> need to do to be >>>>>>>> a DRBD SyncTarget) - how can I ensure that drbd is >>>> running everywhere? >>>>>>>> (clone cl_drbd p_drbd-init ?) >>>>>>> >>>>>>> This is really not needed. >>>>>> I was following the official DRBD Users Guide: >>>>>> >>>> >>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html >>>>>> >>>>>> If I am understanding your previous message correctly, I >>>> do not need to >>>>>> add a lsb primitive for the drbd daemon? It will be >>>>>> started/stopped/managed automatically by my >>>> ocf:linbit:drbd resources >>>>>> (and I can remove the /etc/rc* symlinks)? >>>>> >>>>> Yes, you don't need that LSB script when using Pacemaker >>>> and should not >>>>> let init start it. >>>>> >>>>> Regards, >>>>> Andreas >>>>> >>>>> -- >>>>> Need help with Pacemaker? >>>>> http://www.hastexo.com/now >>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Andrew >>>>>> >>>>>> >>>> >>> ------------------------------------------------------------------------ >>>>>> *From: *"Andreas Kurz" <andr...@hastexo.com >>>> <mailto:andr...@hastexo.com> <mailto:andr...@hastexo.com >>>> <mailto:andr...@hastexo.com>>> >>>>>> *To: *pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org> >>>> <mailto:pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org>> >>>>>> *Sent: *Wednesday, March 28, 2012 7:27:34 AM >>>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD >>>> resources to >>>>>> master on failover >>>>>> >>>>>> On 03/28/2012 12:13 AM, Andrew Martin wrote: >>>>>>> Hi Andreas, >>>>>>> >>>>>>> Thanks, I've updated the colocation rule to be in the >>>> correct order. I >>>>>>> also enabled the STONITH resource (this was temporarily >>>> disabled before >>>>>>> for some additional testing). DRBD has its own network >>>> connection over >>>>>>> the br1 interface (192.168.5.0/24 >>>> <http://192.168.5.0/24> network), a direct crossover cable >>>>>>> between node1 and node2: >>>>>>> global { usage-count no; } >>>>>>> common { >>>>>>> syncer { rate 110M; } >>>>>>> } >>>>>>> resource vmstore { >>>>>>> protocol C; >>>>>>> startup { >>>>>>> wfc-timeout 15; >>>>>>> degr-wfc-timeout 60; >>>>>>> } >>>>>>> handlers { >>>>>>> #fence-peer >>>> "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; >>>>>>> fence-peer "/usr/local/bin/fence-peer"; >>>>>> >>>>>> hmm ... what is that fence-peer script doing? If you want >>>> to use >>>>>> resource-level fencing with the help of dopd, activate the >>>>>> drbd-peer-outdater script in the line above ... and >>>> double check if the >>>>>> path is correct >>>>>> >>>>>>> split-brain >>>> "/usr/lib/drbd/notify-split-brain.sh >>>>>>> m...@example.com <mailto:m...@example.com> >>>> <mailto:m...@example.com <mailto:m...@example.com>>"; >>>>>>> } >>>>>>> net { >>>>>>> after-sb-0pri discard-zero-changes; >>>>>>> after-sb-1pri discard-secondary; >>>>>>> after-sb-2pri disconnect; >>>>>>> cram-hmac-alg md5; >>>>>>> shared-secret "xxxxx"; >>>>>>> } >>>>>>> disk { >>>>>>> fencing resource-only; >>>>>>> } >>>>>>> on node1 { >>>>>>> device /dev/drbd0; >>>>>>> disk /dev/sdb1; >>>>>>> address 192.168.5.10:7787 >>>> <http://192.168.5.10:7787>; >>>>>>> meta-disk internal; >>>>>>> } >>>>>>> on node2 { >>>>>>> device /dev/drbd0; >>>>>>> disk /dev/sdf1; >>>>>>> address 192.168.5.11:7787 >>>> <http://192.168.5.11:7787>; >>>>>>> meta-disk internal; >>>>>>> } >>>>>>> } >>>>>>> # and similar for mount1 and mount2 >>>>>>> >>>>>>> Also, here is my ha.cf <http://ha.cf>. It uses both the >>>> direct link between the nodes >>>>>>> (br1) and the shared LAN network on br0 for communicating: >>>>>>> autojoin none >>>>>>> mcast br0 239.0.0.43 694 1 0 >>>>>>> bcast br1 >>>>>>> warntime 5 >>>>>>> deadtime 15 >>>>>>> initdead 60 >>>>>>> keepalive 2 >>>>>>> node node1 >>>>>>> node node2 >>>>>>> node quorumnode >>>>>>> crm respawn >>>>>>> respawn hacluster /usr/lib/heartbeat/dopd >>>>>>> apiauth dopd gid=haclient uid=hacluster >>>>>>> >>>>>>> I am thinking of making the following changes to the CIB >>>> (as per the >>>>>>> official DRBD >>>>>>> guide >>>>>> >>>>> >>>> >>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) >>>> in >>>>>>> order to add the DRBD lsb service and require that it >>>> start before the >>>>>>> ocf:linbit:drbd resources. Does this look correct? >>>>>> >>>>>> Where did you read that? No, deactivate the startup of >>>> DRBD on system >>>>>> boot and let Pacemaker manage it completely. >>>>>> >>>>>>> primitive p_drbd-init lsb:drbd op monitor interval="30" >>>>>>> colocation c_drbd_together inf: >>>>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master >>>>>>> ms_drbd_mount2:Master >>>>>>> order drbd_init_first inf: ms_drbd_vmstore:promote >>>>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote >>>> p_drbd-init:start >>>>>>> >>>>>>> This doesn't seem to require that drbd be also running >>>> on the node where >>>>>>> the ocf:linbit:drbd resources are slave (which it would >>>> need to do to be >>>>>>> a DRBD SyncTarget) - how can I ensure that drbd is >>>> running everywhere? >>>>>>> (clone cl_drbd p_drbd-init ?) >>>>>> >>>>>> This is really not needed. >>>>>> >>>>>> Regards, >>>>>> Andreas >>>>>> >>>>>> -- >>>>>> Need help with Pacemaker? >>>>>> http://www.hastexo.com/now >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Andrew >>>>>>> >>>> >>> ------------------------------------------------------------------------ >>>>>>> *From: *"Andreas Kurz" <andr...@hastexo.com >>>> <mailto:andr...@hastexo.com> <mailto:andr...@hastexo.com >>>> <mailto:andr...@hastexo.com>>> >>>>>>> *To: *pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org> >>>>> <mailto:*pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org>> >>>>>>> *Sent: *Monday, March 26, 2012 5:56:22 PM >>>>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD >>>> resources to >>>>>>> master on failover >>>>>>> >>>>>>> On 03/24/2012 08:15 PM, Andrew Martin wrote: >>>>>>>> Hi Andreas, >>>>>>>> >>>>>>>> My complete cluster configuration is as follows: >>>>>>>> ============ >>>>>>>> Last updated: Sat Mar 24 13:51:55 2012 >>>>>>>> Last change: Sat Mar 24 13:41:55 2012 >>>>>>>> Stack: Heartbeat >>>>>>>> Current DC: node2 >>>> (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition >>>>>>>> with quorum >>>>>>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c >>>>>>>> 3 Nodes configured, unknown expected votes >>>>>>>> 19 Resources configured. >>>>>>>> ============ >>>>>>>> >>>>>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): >>>> OFFLINE >>>>> (standby) >>>>>>>> Online: [ node2 node1 ] >>>>>>>> >>>>>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] >>>>>>>> Masters: [ node2 ] >>>>>>>> Slaves: [ node1 ] >>>>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] >>>>>>>> Masters: [ node2 ] >>>>>>>> Slaves: [ node1 ] >>>>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] >>>>>>>> Masters: [ node2 ] >>>>>>>> Slaves: [ node1 ] >>>>>>>> Resource Group: g_vm >>>>>>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started >>> node2 >>>>>>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2 >>>>>>>> Clone Set: cl_daemons [g_daemons] >>>>>>>> Started: [ node2 node1 ] >>>>>>>> Stopped: [ g_daemons:2 ] >>>>>>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify] >>>>>>>> Started: [ node2 node1 ] >>>>>>>> Stopped: [ p_sysadmin_notify:2 ] >>>>>>>> stonith-node1(stonith:external/tripplitepdu):Started >>> node2 >>>>>>>> stonith-node2(stonith:external/tripplitepdu):Started >>> node1 >>>>>>>> Clone Set: cl_ping [p_ping] >>>>>>>> Started: [ node2 node1 ] >>>>>>>> Stopped: [ p_ping:2 ] >>>>>>>> >>>>>>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \ >>>>>>>> attributes standby="off" >>>>>>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \ >>>>>>>> attributes standby="off" >>>>>>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" >>>> quorumnode \ >>>>>>>> attributes standby="on" >>>>>>>> primitive p_drbd_mount2 ocf:linbit:drbd \ >>>>>>>> params drbd_resource="mount2" \ >>>>>>>> op monitor interval="15" role="Master" \ >>>>>>>> op monitor interval="30" role="Slave" >>>>>>>> primitive p_drbd_mount1 ocf:linbit:drbd \ >>>>>>>> params drbd_resource="mount1" \ >>>>>>>> op monitor interval="15" role="Master" \ >>>>>>>> op monitor interval="30" role="Slave" >>>>>>>> primitive p_drbd_vmstore ocf:linbit:drbd \ >>>>>>>> params drbd_resource="vmstore" \ >>>>>>>> op monitor interval="15" role="Master" \ >>>>>>>> op monitor interval="30" role="Slave" >>>>>>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \ >>>>>>>> params device="/dev/drbd0" directory="/vmstore" >>>> fstype="ext4" \ >>>>>>>> op start interval="0" timeout="60s" \ >>>>>>>> op stop interval="0" timeout="60s" \ >>>>>>>> op monitor interval="20s" timeout="40s" >>>>>>>> primitive p_libvirt-bin upstart:libvirt-bin \ >>>>>>>> op monitor interval="30" >>>>>>>> primitive p_ping ocf:pacemaker:ping \ >>>>>>>> params name="p_ping" host_list="192.168.1.10 >>>> 192.168.1.11" >>>>>>>> multiplier="1000" \ >>>>>>>> op monitor interval="20s" >>>>>>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \ >>>>>>>> params email="m...@example.com >>>> <mailto:m...@example.com> <mailto:m...@example.com >>>> <mailto:m...@example.com>>" \ >>>>>>>> params subject="Pacemaker Change" \ >>>>>>>> op start interval="0" timeout="10" \ >>>>>>>> op stop interval="0" timeout="10" \ >>>>>>>> op monitor interval="10" timeout="10" >>>>>>>> primitive p_vm ocf:heartbeat:VirtualDomain \ >>>>>>>> params config="/vmstore/config/vm.xml" \ >>>>>>>> meta allow-migrate="false" \ >>>>>>>> op start interval="0" timeout="120s" \ >>>>>>>> op stop interval="0" timeout="120s" \ >>>>>>>> op monitor interval="10" timeout="30" >>>>>>>> primitive stonith-node1 stonith:external/tripplitepdu \ >>>>>>>> params pdu_ipaddr="192.168.1.12" pdu_port="1" >>>> pdu_username="xxx" >>>>>>>> pdu_password="xxx" hostname_to_stonith="node1" >>>>>>>> primitive stonith-node2 stonith:external/tripplitepdu \ >>>>>>>> params pdu_ipaddr="192.168.1.12" pdu_port="2" >>>> pdu_username="xxx" >>>>>>>> pdu_password="xxx" hostname_to_stonith="node2" >>>>>>>> group g_daemons p_libvirt-bin >>>>>>>> group g_vm p_fs_vmstore p_vm >>>>>>>> ms ms_drbd_mount2 p_drbd_mount2 \ >>>>>>>> meta master-max="1" master-node-max="1" >>>> clone-max="2" >>>>>>>> clone-node-max="1" notify="true" >>>>>>>> ms ms_drbd_mount1 p_drbd_mount1 \ >>>>>>>> meta master-max="1" master-node-max="1" >>>> clone-max="2" >>>>>>>> clone-node-max="1" notify="true" >>>>>>>> ms ms_drbd_vmstore p_drbd_vmstore \ >>>>>>>> meta master-max="1" master-node-max="1" >>>> clone-max="2" >>>>>>>> clone-node-max="1" notify="true" >>>>>>>> clone cl_daemons g_daemons >>>>>>>> clone cl_ping p_ping \ >>>>>>>> meta interleave="true" >>>>>>>> clone cl_sysadmin_notify p_sysadmin_notify >>>>>>>> location l-st-node1 stonith-node1 -inf: node1 >>>>>>>> location l-st-node2 stonith-node2 -inf: node2 >>>>>>>> location l_run_on_most_connected p_vm \ >>>>>>>> rule $id="l_run_on_most_connected-rule" p_ping: >>>> defined p_ping >>>>>>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master >>>>>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm >>>>>>> >>>>>>> As Emmanuel already said, g_vm has to be in the first >>>> place in this >>>>>>> collocation constraint .... g_vm must be colocated with >>>> the drbd masters. >>>>>>> >>>>>>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote >>>> ms_drbd_mount1:promote >>>>>>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start >>>>>>>> property $id="cib-bootstrap-options" \ >>>>>>>> >>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ >>>>>>>> cluster-infrastructure="Heartbeat" \ >>>>>>>> stonith-enabled="false" \ >>>>>>>> no-quorum-policy="stop" \ >>>>>>>> last-lrm-refresh="1332539900" \ >>>>>>>> cluster-recheck-interval="5m" \ >>>>>>>> crmd-integration-timeout="3m" \ >>>>>>>> shutdown-escalation="5m" >>>>>>>> >>>>>>>> The STONITH plugin is a custom plugin I wrote for the >>>> Tripp-Lite >>>>>>>> PDUMH20ATNET that I'm using as the STONITH device: >>>>>>>> >>>> >>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf >>>>>>> >>>>>>> And why don't using it? .... stonith-enabled="false" >>>>>>> >>>>>>>> >>>>>>>> As you can see, I left the DRBD service to be started >>>> by the operating >>>>>>>> system (as an lsb script at boot time) however >>>> Pacemaker controls >>>>>>>> actually bringing up/taking down the individual DRBD >>>> devices. >>>>>>> >>>>>>> Don't start drbd on system boot, give Pacemaker the full >>>> control. >>>>>>> >>>>>>> The >>>>>>>> behavior I observe is as follows: I issue "crm resource >>>> migrate p_vm" on >>>>>>>> node1 and failover successfully to node2. During this >>>> time, node2 fences >>>>>>>> node1's DRBD devices (using dopd) and marks them as >>>> Outdated. Meanwhile >>>>>>>> node2's DRBD devices are UpToDate. I then shutdown both >>>> nodes and then >>>>>>>> bring them back up. They reconnect to the cluster (with >>>> quorum), and >>>>>>>> node1's DRBD devices are still Outdated as expected and >>>> node2's DRBD >>>>>>>> devices are still UpToDate, as expected. At this point, >>>> DRBD starts on >>>>>>>> both nodes, however node2 will not set DRBD as master: >>>>>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): >>>> OFFLINE >>>>> (standby) >>>>>>>> Online: [ node2 node1 ] >>>>>>>> >>>>>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] >>>>>>>> Slaves: [ node1 node2 ] >>>>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] >>>>>>>> Slaves: [ node1 node 2 ] >>>>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] >>>>>>>> Slaves: [ node1 node2 ] >>>>>>> >>>>>>> There should really be no interruption of the drbd >>>> replication on vm >>>>>>> migration that activates the dopd ... drbd has its own >>>> direct network >>>>>>> connection? >>>>>>> >>>>>>> Please share your ha.cf <http://ha.cf> file and your >>>> drbd configuration. Watch out for >>>>>>> drbd messages in your kernel log file, that should give >>>> you additional >>>>>>> information when/why the drbd connection was lost. >>>>>>> >>>>>>> Regards, >>>>>>> Andreas >>>>>>> >>>>>>> -- >>>>>>> Need help with Pacemaker? >>>>>>> http://www.hastexo.com/now >>>>>>> >>>>>>>> >>>>>>>> I am having trouble sorting through the logging >>>> information because >>>>>>>> there is so much of it in /var/log/daemon.log, but I >>>> can't find an >>>>>>>> error message printed about why it will not promote >>>> node2. At this point >>>>>>>> the DRBD devices are as follows: >>>>>>>> node2: cstate = WFConnection dstate=UpToDate >>>>>>>> node1: cstate = StandAlone dstate=Outdated >>>>>>>> >>>>>>>> I don't see any reason why node2 can't become DRBD >>>> master, or am I >>>>>>>> missing something? If I do "drbdadm connect all" on >>>> node1, then the >>>>>>>> cstate on both nodes changes to "Connected" and node2 >>>> immediately >>>>>>>> promotes the DRBD resources to master. Any ideas on why >>>> I'm observing >>>>>>>> this incorrect behavior? >>>>>>>> >>>>>>>> Any tips on how I can better filter through the >>>> pacemaker/heartbeat logs >>>>>>>> or how to get additional useful debug information? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Andrew >>>>>>>> >>>>>>>> >>>> >>> ------------------------------------------------------------------------ >>>>>>>> *From: *"Andreas Kurz" <andr...@hastexo.com >>>> <mailto:andr...@hastexo.com> >>>>> <mailto:andr...@hastexo.com <mailto:andr...@hastexo.com>>> >>>>>>>> *To: *pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org> >>>>>> <mailto:*pacemaker@oss.clusterlabs.org >>>> <mailto:pacemaker@oss.clusterlabs.org>> >>>>>>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM >>>>>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD >>>> resources to >>>>>>>> master on failover >>>>>>>> >>>>>>>> On 01/25/2012 08:58 PM, Andrew Martin wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> Recently I finished configuring a two-node cluster >>>> with pacemaker 1.1.6 >>>>>>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. >>>> This cluster >>>>> includes >>>>>>>>> the following resources: >>>>>>>>> - primitives for DRBD storage devices >>>>>>>>> - primitives for mounting the filesystem on the DRBD >>>> storage >>>>>>>>> - primitives for some mount binds >>>>>>>>> - primitive for starting apache >>>>>>>>> - primitives for starting samba and nfs servers >>>> (following instructions >>>>>>>>> here >>>> <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>) >>>>>>>>> - primitives for exporting nfs shares >>>> (ocf:heartbeat:exportfs) >>>>>>>> >>>>>>>> not enough information ... please share at least your >>>> complete cluster >>>>>>>> configuration >>>>>>>> >>>>>>>> Regards, >>>>>>>> Andreas >>>>>>>> >>>>>>>> -- >>>>>>>> Need help with Pacemaker? >>>>>>>> http://www.hastexo.com/now >>>>>>>> >>>>>>>>> >>>>>>>>> Perhaps this is best described through the output of >>>> crm_mon: >>>>>>>>> Online: [ node1 node2 ] >>>>>>>>> >>>>>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] >>>> (unmanaged) >>>>>>>>> p_drbd_mount1:0 (ocf::linbit:drbd): >>>> Started node2 >>>>>>> (unmanaged) >>>>>>>>> p_drbd_mount1:1 (ocf::linbit:drbd): >>>> Started node1 >>>>>>>>> (unmanaged) FAILED >>>>>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] >>>>>>>>> p_drbd_mount2:0 (ocf::linbit:drbd): >>>> Master node1 >>>>>>>>> (unmanaged) FAILED >>>>>>>>> Slaves: [ node2 ] >>>>>>>>> Resource Group: g_core >>>>>>>>> p_fs_mount1 (ocf::heartbeat:Filesystem): >>>> Started node1 >>>>>>>>> p_fs_mount2 (ocf::heartbeat:Filesystem): >>>> Started node1 >>>>>>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): >>>> Started node1 >>>>>>>>> Resource Group: g_apache >>>>>>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): >>>> Started node1 >>>>>>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): >>>> Started node1 >>>>>>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): >>>> Started node1 >>>>>>>>> p_fs_varwww (ocf::heartbeat:Filesystem): >>>> Started node1 >>>>>>>>> p_apache (ocf::heartbeat:apache): >>>> Started node1 >>>>>>>>> Resource Group: g_fileservers >>>>>>>>> p_lsb_smb (lsb:smbd): Started node1 >>>>>>>>> p_lsb_nmb (lsb:nmbd): Started node1 >>>>>>>>> p_lsb_nfsserver (lsb:nfs-kernel-server): >>>> Started node1 >>>>>>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): >>>> Started node1 >>>>>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): >>>> Started >>>>> node1 >>>>>>>>> >>>>>>>>> I have read through the Pacemaker Explained >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained> >>>>>>>>> documentation, however could not find a way to further >>>> debug these >>>>>>>>> problems. First, I put node1 into standby mode to >>>> attempt failover to >>>>>>>>> the other node (node2). Node2 appeared to start the >>>> transition to >>>>>>>>> master, however it failed to promote the DRBD >>>> resources to master (the >>>>>>>>> first step). I have attached a copy of this session in >>>> commands.log and >>>>>>>>> additional excerpts from /var/log/syslog during >>>> important steps. I have >>>>>>>>> attempted everything I can think of to try and start >>>> the DRBD resource >>>>>>>>> (e.g. start/stop/promote/manage/cleanup under crm >>>> resource, restarting >>>>>>>>> heartbeat) but cannot bring it out of the slave state. >>>> However, if >>>>> I set >>>>>>>>> it to unmanaged and then run drbdadm primary all in >>>> the terminal, >>>>>>>>> pacemaker is satisfied and continues starting the rest >>>> of the >>>>> resources. >>>>>>>>> It then failed when attempting to mount the filesystem >>>> for mount2, the >>>>>>>>> p_fs_mount2 resource. I attempted to mount the >>>> filesystem myself >>>>> and was >>>>>>>>> successful. I then unmounted it and ran cleanup on >>>> p_fs_mount2 and then >>>>>>>>> it mounted. The rest of the resources started as >>>> expected until the >>>>>>>>> p_exportfs_mount2 resource, which failed as follows: >>>>>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): >>>> started node2 >>>>>>>>> (unmanaged) FAILED >>>>>>>>> >>>>>>>>> I ran cleanup on this and it started, however when >>>> running this test >>>>>>>>> earlier today no command could successfully start this >>>> exportfs >>>>>> resource. >>>>>>>>> >>>>>>>>> How can I configure pacemaker to better resolve these >>>> problems and be >>>>>>>>> able to bring the node up successfully on its own? >>>> What can I check to >>>>>>>>> determine why these failures are occuring? >>>> /var/log/syslog did not seem >>>>>>>>> to contain very much useful information regarding why >>>> the failures >>>>>>>> occurred. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Andrew >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> This body part will be downloaded on demand. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>>>> <mailto:Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org>> >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>>>> <mailto:Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org>> >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>>>> <mailto:Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org>> >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>>>> <mailto:Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org>> >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>>>> <mailto:Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org>> >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> >>>> >>>> -- >>>> esta es mi vida e me la vivo hasta que dios quiera >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> >>>> >>>> -- >>>> esta es mi vida e me la vivo hasta que dios quiera >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> <mailto:Pacemaker@oss.clusterlabs.org> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> >>>> >>>> -- >>>> esta es mi vida e me la vivo hasta que dios quiera >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> -- >>> Need help with Pacemaker? >>> http://www.hastexo.com/now >>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org > >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> Project Home: http://www.clusterlabs.org >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org