On 19/06/14 12:06 AM, Digimer wrote:
<snip>

After sending this, I found that adding:

handlers {
     fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
     after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}

Allowed the constraint to be removed, so eventually node 2 (an-a04n02)
eventually promoted, but not before going into the failed state shown
above.

Subsequent stop -> start of pacemaker on both nodes started cleanly, not
fence action reported in /var/log/messages. I notices this time that the
drbd module was loaded, not sure if that made a difference.

Will keep testing... Any insight is much appreciated.

Ok, that didn't help... It's still resource-fencing on start *most* (not all) of the time.

When I start pacemaker, and pacemaker start DRBD (nearly simultaneously on both nodes), I see this:

====
Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Jun 19 00:14:22 an-a04n01 attrd[16893]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: unpack_config: On loss of CCM Quorum: Ignore Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: LogActions: Start fence_n01_ipmi#011(an-a04n01.alteeve.ca) Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: LogActions: Start fence_n02_ipmi#011(an-a04n02.alteeve.ca) Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: LogActions: Start drbd_r0:0#011(an-a04n01.alteeve.ca) Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: LogActions: Start drbd_r0:1#011(an-a04n02.alteeve.ca) Jun 19 00:14:22 an-a04n01 pengine[16894]: notice: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-230.bz2 Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 8: monitor fence_n01_ipmi_monitor_0 on an-a04n02.alteeve.ca Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 4: monitor fence_n01_ipmi_monitor_0 on an-a04n01.alteeve.ca (local) Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 9: monitor fence_n02_ipmi_monitor_0 on an-a04n02.alteeve.ca Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 5: monitor fence_n02_ipmi_monitor_0 on an-a04n01.alteeve.ca (local) Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 6: monitor drbd_r0:0_monitor_0 on an-a04n01.alteeve.ca (local) Jun 19 00:14:22 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 10: monitor drbd_r0:1_monitor_0 on an-a04n02.alteeve.ca Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM operation drbd_r0_monitor_0 (call=14, rc=7, cib-update=28, confirmed=true) not running Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: process_lrm_event: an-a04n01.alteeve.ca-drbd_r0_monitor_0:14 [ \n ] Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 3: probe_complete probe_complete on an-a04n01.alteeve.ca (local) - no waiting Jun 19 00:14:23 an-a04n01 attrd[16893]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Jun 19 00:14:23 an-a04n01 attrd[16893]: notice: attrd_perform_update: Sent update 4: probe_complete=true Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 7: probe_complete probe_complete on an-a04n02.alteeve.ca - no waiting Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 11: start fence_n01_ipmi_start_0 on an-a04n01.alteeve.ca (local) Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 13: start fence_n02_ipmi_start_0 on an-a04n02.alteeve.ca Jun 19 00:14:23 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 15: start drbd_r0:0_start_0 on an-a04n01.alteeve.ca (local) Jun 19 00:14:24 an-a04n01 stonith-ng[16891]: notice: stonith_device_register: Device 'fence_n01_ipmi' already existed in device list (2 active devices) Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 17: start drbd_r0:1_start_0 on an-a04n02.alteeve.ca Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM operation fence_n01_ipmi_start_0 (call=19, rc=0, cib-update=29, confirmed=true) ok Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 12: monitor fence_n01_ipmi_monitor_60000 on an-a04n01.alteeve.ca (local) Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 14: monitor fence_n02_ipmi_monitor_60000 on an-a04n02.alteeve.ca Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM operation fence_n01_ipmi_monitor_60000 (call=24, rc=0, cib-update=30, confirmed=false) ok Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Starting worker thread (from cqueue [3265]) Jun 19 00:14:24 an-a04n01 kernel: block drbd0: disk( Diskless -> Attaching ) Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Found 4 transactions (126 active extents) in activity log. Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Method to ensure write ordering: flush Jun 19 00:14:24 an-a04n01 kernel: block drbd0: drbd_bm_resize called with capacity == 909525832 Jun 19 00:14:24 an-a04n01 kernel: block drbd0: resync bitmap: bits=113690729 words=1776418 pages=3470
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: size = 434 GB (454762916 KB)
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: bitmap READ of 3470 pages took 8 jiffies Jun 19 00:14:24 an-a04n01 kernel: block drbd0: recounting of set bits took additional 16 jiffies Jun 19 00:14:24 an-a04n01 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Jun 19 00:14:24 an-a04n01 kernel: block drbd0: disk( Attaching -> Consistent ) Jun 19 00:14:24 an-a04n01 kernel: block drbd0: attached to UUIDs 561F3328043888C0:0000000000000000:052A1A6B59936EC5:05291A6B59936EC5 Jun 19 00:14:24 an-a04n01 kernel: block drbd0: conn( StandAlone -> Unconnected ) Jun 19 00:14:24 an-a04n01 kernel: block drbd0: Starting receiver thread (from drbd0_worker [17045])
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: receiver (re)started
Jun 19 00:14:24 an-a04n01 kernel: block drbd0: conn( Unconnected -> WFConnection ) Jun 19 00:14:24 an-a04n01 attrd[16893]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_r0 (5) Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM operation drbd_r0_start_0 (call=21, rc=0, cib-update=31, confirmed=true) ok Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 48: notify drbd_r0:0_post_notify_start_0 on an-a04n01.alteeve.ca (local) Jun 19 00:14:24 an-a04n01 attrd[16893]: notice: attrd_perform_update: Sent update 9: master-drbd_r0=5 Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 49: notify drbd_r0:1_post_notify_start_0 on an-a04n02.alteeve.ca Jun 19 00:14:24 an-a04n01 attrd[16893]: notice: attrd_perform_update: Sent update 11: master-drbd_r0=5 Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM operation drbd_r0_notify_0 (call=28, rc=0, cib-update=0, confirmed=true) ok Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: run_graph: Transition 0 (Complete=23, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-230.bz2): Stopped Jun 19 00:14:24 an-a04n01 pengine[16894]: notice: unpack_config: On loss of CCM Quorum: Ignore Jun 19 00:14:24 an-a04n01 pengine[16894]: notice: LogActions: Promote drbd_r0:0#011(Slave -> Master an-a04n01.alteeve.ca) Jun 19 00:14:24 an-a04n01 pengine[16894]: notice: LogActions: Promote drbd_r0:1#011(Slave -> Master an-a04n02.alteeve.ca) Jun 19 00:14:24 an-a04n01 pengine[16894]: notice: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-231.bz2 Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 52: notify drbd_r0_pre_notify_promote_0 on an-a04n01.alteeve.ca (local) Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 54: notify drbd_r0_pre_notify_promote_0 on an-a04n02.alteeve.ca Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM operation drbd_r0_notify_0 (call=31, rc=0, cib-update=0, confirmed=true) ok Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 13: promote drbd_r0_promote_0 on an-a04n01.alteeve.ca (local) Jun 19 00:14:24 an-a04n01 crmd[16895]: notice: te_rsc_command: Initiating action 16: promote drbd_r0_promote_0 on an-a04n02.alteeve.ca Jun 19 00:14:24 an-a04n01 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 Jun 19 00:14:25 an-a04n01 kernel: block drbd0: Handshake successful: Agreed network protocol version 97
Jun 19 00:14:25 an-a04n01 crm-fence-peer.sh[17156]: invoked for r0
Jun 19 00:14:25 an-a04n01 cibadmin[17188]: notice: crm_log_args: Invoked: cibadmin -C -o constraints -X <rsc_location rsc="drbd_r0_Clone" id="drbd-fence-by-handler-r0-drbd_r0_Clone">#012 <rule role="Master" score="-INFINITY" id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone">#012 <expression attribute="#uname" operation="ne" value="an-a04n01.alteeve.ca" id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>#012 </rule>#012</rsc_location> Jun 19 00:14:25 an-a04n01 crmd[16895]: notice: handle_request: Current ping state: S_TRANSITION_ENGINE
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: Diff: --- 0.94.19
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: Diff: +++ 0.95.1 4f095b8add6dcbb173de1254bf02fcf6 Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: -- <cib admin_epoch="0" epoch="94" num_updates="19"/> Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: ++ <rsc_location rsc="drbd_r0_Clone" id="drbd-fence-by-handler-r0-drbd_r0_Clone"> Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: ++ <rule role="Master" score="-INFINITY" id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone"> Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: ++ <expression attribute="#uname" operation="ne" value="an-a04n01.alteeve.ca" id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>
Jun 19 00:14:25 an-a04n01 cib[16890]:   notice: cib:diff: ++         </rule>
Jun 19 00:14:25 an-a04n01 cib[16890]: notice: cib:diff: ++ </rsc_location> Jun 19 00:14:25 an-a04n01 stonith-ng[16891]: notice: unpack_config: On loss of CCM Quorum: Ignore Jun 19 00:14:25 an-a04n01 crm-fence-peer.sh[17156]: INFO peer is reachable, my disk is Consistent: placed constraint 'drbd-fence-by-handler-r0-drbd_r0_Clone' Jun 19 00:14:25 an-a04n01 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 4 (0x400) Jun 19 00:14:25 an-a04n01 kernel: block drbd0: fence-peer helper returned 4 (peer was fenced) Jun 19 00:14:25 an-a04n01 kernel: block drbd0: role( Secondary -> Primary ) disk( Consistent -> UpToDate ) pdsk( DUnknown -> Outdated ) Jun 19 00:14:25 an-a04n01 kernel: block drbd0: new current UUID 25DF173CF8D89023:561F3328043888C0:052A1A6B59936EC5:05291A6B59936EC5 Jun 19 00:14:25 an-a04n01 kernel: block drbd0: conn( WFConnection -> WFReportParams ) Jun 19 00:14:25 an-a04n01 kernel: block drbd0: Starting asender thread (from drbd0_receiver [17062]) Jun 19 00:14:25 an-a04n01 kernel: block drbd0: data-integrity-alg: <not-used> Jun 19 00:14:25 an-a04n01 stonith-ng[16891]: notice: stonith_device_register: Device 'fence_n01_ipmi' already existed in device list (2 active devices) Jun 19 00:14:25 an-a04n01 cib[16890]: warning: update_results: Action cib_create failed: Name not unique on network (cde=-76) Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB Update failures <failed> Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB Update failures <failed_update id="drbd-fence-by-handler-r0-drbd_r0_Clone" object_type="rsc_location" operation="cib_create" reason="Name not unique on network"> Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB Update failures <rsc_location rsc="drbd_r0_Clone" id="drbd-fence-by-handler-r0-drbd_r0_Clone"> Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB Update failures <rule role="Master" score="-INFINITY" id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone"> Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB Update failures <expression attribute="#uname" operation="ne" value="an-a04n02.alteeve.ca" id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/> Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB Update failures </rule> Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB Update failures </rsc_location> Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB Update failures </failed_update> Jun 19 00:14:25 an-a04n01 cib[16890]: error: cib_process_create: CIB Update failures </failed> Jun 19 00:14:25 an-a04n01 cib[16890]: warning: cib_process_request: Completed cib_create operation for section constraints: Name not unique on network (rc=-76, origin=an-a04n02.alteeve.ca/cibadmin/2, version=0.95.1) Jun 19 00:14:25 an-a04n01 stonith-ng[16891]: notice: stonith_device_register: Added 'fence_n02_ipmi' to the device list (2 active devices)
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: drbd_sync_handshake:
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: self 25DF173CF8D89023:561F3328043888C0:052A1A6B59936EC5:05291A6B59936EC5 bits:0 flags:0 Jun 19 00:14:25 an-a04n01 kernel: block drbd0: peer 561F3328043888C0:0000000000000000:052A1A6B59936EC4:05291A6B59936EC5 bits:0 flags:0
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: uuid_compare()=1 by rule 70
Jun 19 00:14:25 an-a04n01 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) Jun 19 00:14:25 an-a04n01 crmd[16895]: notice: process_lrm_event: LRM operation drbd_r0_promote_0 (call=34, rc=0, cib-update=33, confirmed=true) ok Jun 19 00:14:26 an-a04n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 Jun 19 00:14:26 an-a04n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) Jun 19 00:14:26 an-a04n01 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Jun 19 00:14:26 an-a04n01 kernel: block drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]). Jun 19 00:14:26 an-a04n01 kernel: block drbd0: updated sync UUID 25DF173CF8D89023:56203328043888C0:561F3328043888C0:052A1A6B59936EC5
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: Diff: --- 0.95.2
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: Diff: +++ 0.96.1 86f147e11a7e9934f7b2a686715dcca6 Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: -- <rsc_location rsc="drbd_r0_Clone" id="drbd-fence-by-handler-r0-drbd_r0_Clone"> Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: -- <rule role="Master" score="-INFINITY" id="drbd-fence-by-handler-r0-rule-drbd_r0_Clone"> Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: -- <expression attribute="#uname" operation="ne" value="an-a04n01.alteeve.ca" id="drbd-fence-by-handler-r0-expr-drbd_r0_Clone"/>
Jun 19 00:14:26 an-a04n01 cib[16890]:   notice: cib:diff: --         </rule>
Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: -- </rsc_location> Jun 19 00:14:26 an-a04n01 cib[16890]: notice: cib:diff: ++ <cib admin_epoch="0" cib-last-written="Thu Jun 19 00:14:26 2014" crm_feature_set="3.0.7" epoch="96" have-quorum="1" num_updates="1" update-client="cibadmin" update-origin="an-a04n02.alteeve.ca" validate-with="pacemaker-1.2" dc-uuid="an-a04n01.alteeve.ca"/> Jun 19 00:14:26 an-a04n01 stonith-ng[16891]: notice: unpack_config: On loss of CCM Quorum: Ignore Jun 19 00:14:26 an-a04n01 stonith-ng[16891]: notice: stonith_device_register: Device 'fence_n01_ipmi' already existed in device list (2 active devices) Jun 19 00:14:26 an-a04n01 stonith-ng[16891]: notice: stonith_device_register: Added 'fence_n02_ipmi' to the device list (2 active devices) Jun 19 00:14:26 an-a04n01 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) Jun 19 00:14:26 an-a04n01 kernel: block drbd0: updated UUIDs 25DF173CF8D89023:0000000000000000:56203328043888C0:561F3328043888C0 Jun 19 00:14:26 an-a04n01 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Jun 19 00:14:26 an-a04n01 kernel: block drbd0: bitmap WRITE of 3470 pages took 9 jiffies Jun 19 00:14:26 an-a04n01 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
====

It seems to immediately fence as soon as DRBD starts, and I can't see why it feels the need to do this...

RHEL 6.5, DRBD 8.3.16.

I am really stumped... any help would be much appreciated!

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to