Hello, I'm testing pacemaker with cman on CentOS 6.5 where I have drbd resource in classic primary/secondary setup with master/slave config
Relevant packages: cman-3.0.12.1-59.el6_5.1.x86_64 pacemaker-1.1.10-14.el6_5.2.x86_64 kmod-drbd84-8.4.4-1.el6.elrepo.x86_64 drbd84-utils-8.4.4-2.el6.elrepo.x86_64 kernel 2.6.32-431.5.1.el6.x86_64 >From cman point of view I delegated fencing to pacemaker with fence_pcmk fence agent in cluster.conf >From pacemaker point of view: # pcs cluster cib | grep cib-boot | egrep "quorum|stonith" <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/> <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/> >From drbd point ov view: resource res0 { disk { disk-flushes no; md-flushes no; fencing resource-only; } device minor 0; disk /dev/sdb; syncer { rate 30M; verify-alg md5; } handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } What is the expected behavior if I force a power off of the primary node where the resource is master? In my case where I test and power off iclnode01 the status remains: Last updated: Mon Mar 3 18:37:02 2014 Last change: Mon Mar 3 18:37:02 2014 via crmd on iclnode02 Stack: cman Current DC: iclnode02 - partition WITHOUT quorum Version: 1.1.10-14.el6_5.2-368c726 2 Nodes configured 12 Resources configured Online: [ iclnode02 ] OFFLINE: [ iclnode01 ] Master/Slave Set: ms_MyData [MyData] Slaves: [ iclnode02 ] Stopped: [ iclnode01 ] and # cat /proc/drbd version: 8.4.4 (api:1/proto:86-101) GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Outdated C r----- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 In messages I see the crm-fence-peer.sh did its job with putting constraint Mar 3 18:25:35 node02 kernel: drbd res0: helper command: /sbin/drbdadm fence-peer res0 Mar 3 18:25:35 node02 crm-fence-peer.sh[7633]: invoked for res0 Mar 3 18:25:35 node02 cibadmin[7664]: notice: crm_log_args: Invoked: cibadmin -C -o constraints -X <rsc_location rsc="ms_MyData" id="drbd-fence-by-handler-res0-ms_MyData">#012 <rule role="Master" score="-INFINITY" id="drbd-fence-by-handler-res0-rule-ms_MyData">#012 <expression attribute="#uname" operation="ne" value="node02.localdomain.local" id="drbd-fence-by-handler-res0-expr-ms_MyData"/>#012 </rule>#012</rsc_location> Mar 3 18:25:35 node02 stonith-ng[1894]: notice: unpack_config: On loss of CCM Quorum: Ignore Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: Diff: --- 0.127.36 Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: Diff: +++ 0.128.1 6e071e71b96b076e87b27c299ba3057d Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: -- <cib admin_epoch="0" epoch="127" num_updates="36"/> Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++ <rsc_location rsc="ms_MyData" id="drbd-fence-by-handler-res0-ms_MyData"> Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++ <rule role="Master" score="-INFINITY" id="drbd-fence-by-handler-res0-rule-ms_MyData"> Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++ <expression attribute="#uname" operation="ne" value="node02.localdomain.local" id="drbd-fence-by-handler-res0-expr-ms_MyData"/> Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++ </rule> Mar 3 18:25:35 node02 cib[1893]: notice: cib:diff: ++ </rsc_location> Mar 3 18:25:35 node02 crm-fence-peer.sh[7633]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-res0-ms_MyData' Mar 3 18:25:35 node02 kernel: drbd res0: helper command: /sbin/drbdadm fence-peer res0 exit code 5 (0x500) Mar 3 18:25:35 node02 kernel: drbd res0: fence-peer helper returned 5 (peer is unreachable, assumed to be dead) Mar 3 18:25:35 node02 kernel: drbd res0: pdsk( DUnknown -> Outdated ) Mar 3 18:25:35 node02 kernel: block drbd0: role( Secondary -> Primary ) Mar 3 18:25:35 node02 kernel: block drbd0: new current UUID 03E9D09641694365:B5B5224185905A78:83887B50434B5AB6:83877B50434B5AB6 but soon after it is demoted since "monitor" found it in master... Mar 3 18:25:35 node02 crmd[1898]: notice: process_lrm_event: LRM operation MyData_promote_0 (call=305, rc=0, cib-update=90, confirmed=true) ok Mar 3 18:25:35 node02 crmd[1898]: notice: te_rsc_command: Initiating action 54: notify MyData_post_notify_promote_0 on iclnode02 (local) Mar 3 18:25:36 node02 crmd[1898]: notice: process_lrm_event: LRM operation MyData_notify_0 (call=308, rc=0, cib-update=0, confirmed=true) ok Mar 3 18:25:36 node02 crmd[1898]: notice: run_graph: Transition 1 (Complete=9, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-995.bz2): Stopped Mar 3 18:25:36 node02 pengine[1897]: notice: unpack_config: On loss of CCM Quorum: Ignore Mar 3 18:25:36 node02 pengine[1897]: notice: unpack_rsc_op: Operation monitor found resource MyData:0 active in master mode on iclnode02 Mar 3 18:25:36 node02 pengine[1897]: notice: LogActions: Demote MyData:0#011(Master -> Slave iclnode02) Mar 3 18:25:36 node02 pengine[1897]: notice: process_pe_message: Calculated Transition 2: /var/lib/pacemaker/pengine/pe-input-996.bz2 Mar 3 18:25:36 node02 crmd[1898]: notice: te_rsc_command: Initiating action 53: notify MyData_pre_notify_demote_0 on iclnode02 (local) Mar 3 18:25:36 node02 crmd[1898]: notice: process_lrm_event: LRM operation MyData_notify_0 (call=311, rc=0, cib-update=0, confirmed=true) ok Mar 3 18:25:36 node02 crmd[1898]: notice: te_rsc_command: Initiating action 5: demote MyData_demote_0 on iclnode02 (local) Mar 3 18:25:36 node02 kernel: block drbd0: role( Primary -> Secondary ) Mar 3 18:25:36 node02 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies Mar 3 18:25:36 node02 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Mar 3 18:25:36 node02 crmd[1898]: notice: process_lrm_event: LRM operation MyData_demote_0 (call=314, rc=0, cib-update=92, confirmed=true) ok Mar 3 18:25:36 node02 crmd[1898]: notice: te_rsc_command: Initiating action 54: notify MyData_post_notify_demote_0 on iclnode02 (local) Mar 3 18:25:36 node02 crmd[1898]: notice: process_lrm_event: LRM operation MyData_notify_0 (call=317, rc=0, cib-update=0, confirmed=true) ok Suppose I know that iclnode01 has a permanent problem and I can't recover it for some time, what is the correct manual action from pacemaker point of view to force iclnode02 to carry on the service (I have a group configured with the standard way colocation + order: # pcs constraint colocation add Started my_group with Master ms_MyData INFINITY # pcs constraint order promote ms_MyData then start my_group ) and is there any automated action to manage this kind of problems in 2-nodes clusters? Can I solve this problem if I configure a stonith agent? Thanks in advance, Gianluca _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org