Hello Francois can you post your cluster configuration using pastebin?
Thanks 2012/5/1 Francois Gaudreault <fgaudrea...@inverse.ca> > Hi, > > I am not sure if this is the right mailing list to post for our problems, > let me know if this is more a Corosync issue. > > We are facing a weird behavior with Corosync/Pacemaker/DRBD, and I expect > to have some inputs and help to troubleshoot what is going on. Let me try > to explain the best I can what we do. > > We have a cluster with two node using DRBD running SQL stuff. We do have > a location constraint that tells Pacemaker that DRBD should be primary on > NODE A when able. When we migrate the services to NODE B, and we reboot > NODE A, we are facing split brains every single time. > > From what we can see in the logs, it appears that the DRBD resource, for > some reason, is not waiting for getting an established connection (to get > initial sync) before changing its role to Primary. (I apologize for the > length of the log): > > May 1 10:30:34 npf1 kernel: d-con mysql: Starting worker thread (from > drbdsetup [1960]) > May 1 10:30:34 npf1 kernel: block drbd0: disk( Diskless -> Attaching ) > May 1 10:30:34 npf1 kernel: block drbd0: Method to ensure write ordering: > barrier > May 1 10:30:34 npf1 kernel: block drbd0: max BIO size = 1048576 > May 1 10:30:34 npf1 kernel: block drbd0: drbd_bm_resize called with > capacity == 143355552 > May 1 10:30:34 npf1 kernel: block drbd0: resync bitmap: bits=17919444 > words=279992 pages=547 > May 1 10:30:34 npf1 kernel: block drbd0: size = 68 GB (71677776 KB) > May 1 10:30:34 npf1 kernel: block drbd0: bitmap READ of 547 pages took 26 > jiffies > May 1 10:30:34 npf1 kernel: block drbd0: recounting of set bits took > additional 1 jiffies > May 1 10:30:34 npf1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync > by on disk bit-map. > May 1 10:30:34 npf1 kernel: block drbd0: disk( Attaching -> UpToDate ) > May 1 10:30:34 npf1 kernel: block drbd0: attached to UUIDs > 343EF8A0E434C9D8:**0000000000000000:**282208A916EB5687:**282108A916EB5687 > May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout) > May 1 10:30:34 npf1 kernel: d-con mysql: conn( StandAlone -> Unconnected ) > May 1 10:30:34 npf1 kernel: d-con mysql: Starting receiver thread (from > drbd_w_mysql [1961]) > May 1 10:30:34 npf1 kernel: d-con mysql: receiver (re)started > May 1 10:30:34 npf1 kernel: d-con mysql: conn( Unconnected -> > WFConnection ) > May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout) > May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout) > May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation > DRBD:0_start_0 (call=7, rc=0, cib-update=12, confirmed=true) ok > May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing > key=53:84:0:5bc3f587-ac97-**491b-b102-2325c4352589 op=DRBD:0_notify_0 ) > May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:8: notify > May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout) > May 1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource > op DRBD:0_notify_0 from 53:84:0:5bc3f587-ac97-491b-**b102-2325c4352589: > lrm_invoke-lrmd-1335882634-4 > May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation > DRBD:0_notify_0 (call=8, rc=0, cib-update=0, confirmed=true) ok > May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing > key=58:84:0:5bc3f587-ac97-**491b-b102-2325c4352589 op=DRBD:0_notify_0 ) > May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:9: notify > May 1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource > op DRBD:0_notify_0 from 58:84:0:5bc3f587-ac97-491b-**b102-2325c4352589: > lrm_invoke-lrmd-1335882634-5 > May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation > DRBD:0_notify_0 (call=9, rc=0, cib-update=0, confirmed=true) ok > May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing > key=20:84:0:5bc3f587-ac97-**491b-b102-2325c4352589 op=DRBD:0_promote_0 ) > May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:10: promote > May 1 10:30:34 npf1 kernel: block drbd0: role( Secondary -> Primary ) > May 1 10:30:34 npf1 kernel: block drbd0: new current UUID > 66767846186A82BB:**343EF8A0E434C9D8:**282208A916EB5687:**282108A916EB5687 > May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:promote:stdout) > May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation > DRBD:0_promote_0 (call=10, rc=0, cib-update=13, confirmed=true) ok > May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing > key=59:84:0:5bc3f587-ac97-**491b-b102-2325c4352589 op=DRBD:0_notify_0 ) > May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:11: notify > May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout) > May 1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource > op DRBD:0_notify_0 from 59:84:0:5bc3f587-ac97-491b-**b102-2325c4352589: > lrm_invoke-lrmd-1335882634-6 > May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation > DRBD:0_notify_0 (call=11, rc=0, cib-update=0, confirmed=true) ok > May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing > key=7:84:0:5bc3f587-ac97-491b-**b102-2325c4352589 op=DRBD_fs_start_0 ) > May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:12: start > May 1 10:30:34 npf1 Filesystem[2089]: INFO: Running start for /dev/drbd0 > on /var/lib/mysql > May 1 10:30:34 npf1 kernel: EXT4-fs (drbd0): mounted filesystem with > ordered data mode. Opts: > May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation > DRBD_fs_start_0 (call=12, rc=0, cib-update=14, confirmed=true) ok > May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing > key=8:84:0:5bc3f587-ac97-491b-**b102-2325c4352589 > op=DRBD_fs_monitor_120000 ) > May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:13: monitor > May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing > key=9:84:0:5bc3f587-ac97-491b-**b102-2325c4352589 op=MySQL_start_0 ) > May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:MySQL:14: start > May 1 10:30:34 npf1 lrmd: [2176]: WARN: For LSB init script, no > additional parameters are needed. > May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation > DRBD_fs_monitor_120000 (call=13, rc=0, cib-update=15, confirmed=false) o > May 1 10:30:35 npf1 kernel: d-con mysql: Handshake successful: Agreed > network protocol version 100 > May 1 10:30:35 npf1 kernel: d-con mysql: conn( WFConnection -> > WFReportParams ) > May 1 10:30:35 npf1 kernel: d-con mysql: Starting asender thread (from > drbd_r_mysql [1973]) > May 1 10:30:35 npf1 kernel: block drbd0: drbd_sync_handshake: > May 1 10:30:35 npf1 kernel: block drbd0: self 66767846186A82BB:** > 343EF8A0E434C9D8:**282208A916EB5687:**282108A916EB5687 bits:1 flags:0 > May 1 10:30:35 npf1 kernel: block drbd0: peer D06952D9712CA916:** > 343EF8A0E434C9D8:**282208A916EB5686:**282108A916EB5687 bits:41 flags:0 > May 1 10:30:35 npf1 kernel: block drbd0: uuid_compare()=100 by rule 90 > May 1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm > initial-split-brain minor-0 > May 1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm > initial-split-brain minor-0 exit code 0 (0x0) > May 1 10:30:35 npf1 kernel: block drbd0: Split-Brain detected, 1 > primaries, automatically solved. Sync from this node > May 1 10:30:35 npf1 kernel: block drbd0: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) > > The only solutions we found is to put automatic SB resolving and > resource-stickiness, but we do NOT want auto split brain recovery and we > want the services to go back on NODE A when the cluster is back to a sane > state. > > I can provide the configs if needed. > > Thanks for your help! > > -- > Francois Gaudreault, ing. jr > fgaudrea...@inverse.ca :: +1.514.447.4918 (x130) :: www.inverse.ca > Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence ( > www.packetfence.org) > > ______________________________**_________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/**mailman/listinfo/pacemaker<http://oss.clusterlabs.org/mailman/listinfo/pacemaker> > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/**doc/Cluster_from_Scratch.pdf<http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> > Bugs: http://bugs.clusterlabs.org > -- esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org