Hello, thanks for the information. I was looking at this page http://www.drbd.org/users-guide/s-pacemaker-fencing.html I did specify the following handlers : handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
} I disconnected the network cable between the clusters, corosync and drbd uses this link. I was able to see that the fence script added a constraint : location drbd-fence-by-handler-ms-drbd-supervision ms-drbd-supervision \ rule $id="drbd-fence-by-handler-rule-ms-drbd-supervision" $role="Master" -inf: #uname ne host But this made : :StandAlone ro:Secondary/Unknown ds:UpToDate/Outdated on drbd. I don't really understand what I should be expected from those handlers ? When cleaning up the errors, I shoudl delete the constraint right ? Regards, Hugo On 26 July 2011 19:27, Digimer <li...@alteeve.com> wrote: > On 07/26/2011 11:43 AM, Lars Ellenberg wrote: > > On Wed, Jul 20, 2011 at 11:36:25AM -0400, Digimer wrote: > >> On 07/20/2011 11:24 AM, Hugo Deprez wrote: > >>> Hello Andrew, > >>> > >>> in fact DRBD was in standalone mode but the cluster was working : > >>> > >>> Here is the syslog of the drbd's split brain : > >>> > >>> Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake > >>> successful: Agreed network protocol version 91 > >>> Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn( > >>> WFConnection -> WFReportParams ) > >>> Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting > >>> asender thread (from drbd0_receiver [23281]) > >>> Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0: > >>> data-integrity-alg: <not-used> > >>> Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0: > >>> drbd_sync_handshake: > >>> Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self > >>> BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F > >>> bits:75338 flags:0 > >>> Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer > >>> 8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F > >>> bits:769 flags:0 > >>> Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0: > >>> uuid_compare()=100 by rule 90 > >>> Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain > >>> detected, dropping connection! > >>> Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper > >>> command: /sbin/drbdadm split-brain minor-0 > >>> Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta > >>> connection shut down by peer. > >>> Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn( > >>> WFReportParams -> NetworkFailure ) > >>> Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender > >>> terminated > >>> Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating > >>> asender thread > >>> Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper > >>> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) > >>> Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn( > >>> NetworkFailure -> Disconnecting ) > >>> Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error > >>> receiving ReportState, l: 4! > >>> Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection > >>> closed > >>> Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn( > >>> Disconnecting -> StandAlone ) > >>> Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver > >>> terminated > >>> Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating > >>> receiver thread > >> > >> This was a DRBD split-brain, not a pacemaker split. I think that might > >> have been the source of confusion. > >> > >> The split brain occurs when both DRBD nodes lose contact with one > >> another and then proceed as StandAlone/Primary/UpToDate. To avoid this, > >> configure fencing (stonith) in Pacemaker, then use 'crm-fence-peer.sh' > >> in drbd.conf; > >> > >> === > >> disk { > >> fencing resource-and-stonith; > >> } > >> > >> handlers { > >> outdate-peer "/path/to/crm-fence-peer.sh"; > >> } > >> === > > > > Thanks, that is basically right. > > Let me fill in some details, though: > > > >> This will tell DRBD to block (resource) and fence (stonith). DRBD will > > > > drbd fencing options are "fencing resource-only", > > and "fencing resource-and-stonith". > > > > "resource-only" does *not* block IO while the fencing handler runs. > > > > "resource-and-stonith" does block IO. > > Ahhh, that's why I was confused. I thought the 'resource' meant the same > thing in both cases, but had only read the 'resource-and-stonith' section. > > >> not resume IO until either the fence script exits with a success, or > >> until an admit types 'drbdadm resume-io <res>'. > > > > > >> The CRM script simply calls pacemaker and asks it to fence the other > >> node. > > > > No. It tries to place a constraint forcing the Master role off of any > > node but the one with the good data. > > Ok, I thought it was akin to the 'obliterate-peer.sh' script, which > calls 'fence_node'... I made an assumption, which was not correct. > > >> When a node has actually failed, then the lost no is fenced. If > >> both nodes are up but disconnected, as you had, then only the fastest > >> node will succeed in calling the fence, and the slower node will be > >> fenced before it can call a fence. > > > > "fenced" may be "restricted from being/becoming Master" by that fencing > > constraint. Or, if pacemaker decided to do so, actually "shot" by some > > node level fencing agent (stonith). > > > > All that resource-level fencing by placing some constraint stuff > > obviously only works as long as the cluster communication is still up. > > It not only the drbd replication link had issues, but the cluster > > communication was down as well, it becomes a bit more complex. > > Thanks for the clarity. Today I learned. :) > > -- > Digimer > E-Mail: digi...@alteeve.com > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "At what point did we forget that the Space Shuttle was, essentially, > a program that strapped human beings to an explosion and tried to stab > through the sky with fire and math?" > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org