Hi,

I am not sure if this is the right mailing list to post for our problems, let me know if this is more a Corosync issue.

We are facing a weird behavior with Corosync/Pacemaker/DRBD, and I expect to have some inputs and help to troubleshoot what is going on. Let me try to explain the best I can what we do.

We have a cluster with two node using DRBD running SQL stuff. We do have a location constraint that tells Pacemaker that DRBD should be primary on NODE A when able. When we migrate the services to NODE B, and we reboot NODE A, we are facing split brains every single time.

From what we can see in the logs, it appears that the DRBD resource, for some reason, is not waiting for getting an established connection (to get initial sync) before changing its role to Primary. (I apologize for the length of the log):

May 1 10:30:34 npf1 kernel: d-con mysql: Starting worker thread (from drbdsetup [1960])
May  1 10:30:34 npf1 kernel: block drbd0: disk( Diskless -> Attaching )
May 1 10:30:34 npf1 kernel: block drbd0: Method to ensure write ordering: barrier
May  1 10:30:34 npf1 kernel: block drbd0: max BIO size = 1048576
May 1 10:30:34 npf1 kernel: block drbd0: drbd_bm_resize called with capacity == 143355552 May 1 10:30:34 npf1 kernel: block drbd0: resync bitmap: bits=17919444 words=279992 pages=547
May  1 10:30:34 npf1 kernel: block drbd0: size = 68 GB (71677776 KB)
May 1 10:30:34 npf1 kernel: block drbd0: bitmap READ of 547 pages took 26 jiffies May 1 10:30:34 npf1 kernel: block drbd0: recounting of set bits took additional 1 jiffies May 1 10:30:34 npf1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
May  1 10:30:34 npf1 kernel: block drbd0: disk( Attaching -> UpToDate )
May 1 10:30:34 npf1 kernel: block drbd0: attached to UUIDs 343EF8A0E434C9D8:0000000000000000:282208A916EB5687:282108A916EB5687
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)
May  1 10:30:34 npf1 kernel: d-con mysql: conn( StandAlone -> Unconnected )
May 1 10:30:34 npf1 kernel: d-con mysql: Starting receiver thread (from drbd_w_mysql [1961])
May  1 10:30:34 npf1 kernel: d-con mysql: receiver (re)started
May 1 10:30:34 npf1 kernel: d-con mysql: conn( Unconnected -> WFConnection )
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)
May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_start_0 (call=7, rc=0, cib-update=12, confirmed=true) ok May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=53:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD:0_notify_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:8: notify
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout)
May 1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource op DRBD:0_notify_0 from 53:84:0:5bc3f587-ac97-491b-b102-2325c4352589: lrm_invoke-lrmd-1335882634-4 May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_notify_0 (call=8, rc=0, cib-update=0, confirmed=true) ok May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=58:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD:0_notify_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:9: notify
May 1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource op DRBD:0_notify_0 from 58:84:0:5bc3f587-ac97-491b-b102-2325c4352589: lrm_invoke-lrmd-1335882634-5 May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_notify_0 (call=9, rc=0, cib-update=0, confirmed=true) ok May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=20:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD:0_promote_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:10: promote
May  1 10:30:34 npf1 kernel: block drbd0: role( Secondary -> Primary )
May 1 10:30:34 npf1 kernel: block drbd0: new current UUID 66767846186A82BB:343EF8A0E434C9D8:282208A916EB5687:282108A916EB5687
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:promote:stdout)
May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_promote_0 (call=10, rc=0, cib-update=13, confirmed=true) ok May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=59:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD:0_notify_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:11: notify
May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout)
May 1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource op DRBD:0_notify_0 from 59:84:0:5bc3f587-ac97-491b-b102-2325c4352589: lrm_invoke-lrmd-1335882634-6 May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_notify_0 (call=11, rc=0, cib-update=0, confirmed=true) ok May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=7:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD_fs_start_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:12: start
May 1 10:30:34 npf1 Filesystem[2089]: INFO: Running start for /dev/drbd0 on /var/lib/mysql May 1 10:30:34 npf1 kernel: EXT4-fs (drbd0): mounted filesystem with ordered data mode. Opts: May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD_fs_start_0 (call=12, rc=0, cib-update=14, confirmed=true) ok May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=8:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=DRBD_fs_monitor_120000 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:13: monitor
May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=9:84:0:5bc3f587-ac97-491b-b102-2325c4352589 op=MySQL_start_0 )
May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:MySQL:14: start
May 1 10:30:34 npf1 lrmd: [2176]: WARN: For LSB init script, no additional parameters are needed. May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD_fs_monitor_120000 (call=13, rc=0, cib-update=15, confirmed=false) o May 1 10:30:35 npf1 kernel: d-con mysql: Handshake successful: Agreed network protocol version 100 May 1 10:30:35 npf1 kernel: d-con mysql: conn( WFConnection -> WFReportParams ) May 1 10:30:35 npf1 kernel: d-con mysql: Starting asender thread (from drbd_r_mysql [1973])
May  1 10:30:35 npf1 kernel: block drbd0: drbd_sync_handshake:
May 1 10:30:35 npf1 kernel: block drbd0: self 66767846186A82BB:343EF8A0E434C9D8:282208A916EB5687:282108A916EB5687 bits:1 flags:0 May 1 10:30:35 npf1 kernel: block drbd0: peer D06952D9712CA916:343EF8A0E434C9D8:282208A916EB5686:282108A916EB5687 bits:41 flags:0
May  1 10:30:35 npf1 kernel: block drbd0: uuid_compare()=100 by rule 90
May 1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 May 1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) May 1 10:30:35 npf1 kernel: block drbd0: Split-Brain detected, 1 primaries, automatically solved. Sync from this node May 1 10:30:35 npf1 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )

The only solutions we found is to put automatic SB resolving and resource-stickiness, but we do NOT want auto split brain recovery and we want the services to go back on NODE A when the cluster is back to a sane state.

I can provide the configs if needed.

Thanks for your help!

--
Francois Gaudreault, ing. jr
fgaudrea...@inverse.ca  ::  +1.514.447.4918 (x130) ::  www.inverse.ca
Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence (www.packetfence.org)

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to