On Mon, Sep 26, 2011 at 6:41 PM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > 26.09.2011 11:16, Andrew Beekhof wrote: > [snip] >>> >>>> >>>> Regardless, for 1.1.6 the dlm would be better off making a call like: >>>> >>>> rc = st->cmds->fence(st, st_opts, target, "reboot", 120); >>>> >>>> from fencing/admin.c >>>> >>>> That would talk directly to the fencing daemon, bypassing attrd, crnd >>>> and PE - and thus be more reliable. >>>> >>>> This is what the cman plugin will be doing soon too. >>> >>> Great to know, I'll try that in near future. Thank you very much for >>> pointer. >> >> 1.1.7 will actually make use of this API regardless of any *_controld >> changes - i'm in the middle of updating the two library functions they >> use (crm_terminate_member and crm_terminate_member_no_mainloop). > > Ah, I then try your patch and wait for that to be resolved. > >> >>> >>>> >>>>> >>>>> I agree with Jiaju >>>>> (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), >>>>> that could be solely pacemaker problem, because it probably should >>>>> originate fencing itself is such situation I think. >>>>> >>>>> So, using pacemaker/dlm with openais stack is currently risky due to >>>>> possible hangs of dlm_lockspaces. >>>> >>>> It shouldn't be, failing to connect to attrd is very unusual. >>> >>> By the way, one of underlying problems, which actually made me to notice >>> all this, is that pacemaker cluster does not fence its DC if it leaves >>> the cluster for a very short time. That is what Jiaju told in his notes. >>> And I can confirm that. >> >> Thats highly surprising. Do the logs you sent display this behaviour? > > They do. Rest of the cluster begins the election, but then accepts > returned DC back (I write this from memory, I looked at logs Sep 5-6, so > I may mix up something).
Actually, this might be possible - if DC.old came back before DC.new had a chance to get elected, run the PE and initiate fencing, then there would be no need to fence. > [snip] >>>>> Although it took 25 seconds instead of 3 to break the cluster (I >>>>> understand, this is almost impossible to load host so much, but >>>>> anyways), then I got a real nightmare: two nodes of 3-node cluster had >>>>> cman stopped (and pacemaker too because of cman connection loss) - they >>>>> asked to kick_node_from_cluster() for each other, and that succeeded. >>>>> But fencing didn't happen (I still need to look why, but this is cman >>>>> specific). > > Btw this part is tricky for me to understand the underlying logic: > * cman just stops cman processes on remote nodes, disregarding the > quorum. I hope that could be fixed in corosync If I understand one of > latest threads there right. > * But cman does not do fencing of that nodes, and they still run > resources. And this could be extremely dangerous under some > circumstances. And cman does not do fencing even if it has fence devices > configure in cluster.conf (I verified that). > >>>>> Remaining node had pacemaker hanged, it doesn't even >>>>> notice cluster infrastructure change, down nodes were listed as a >>>>> online, one of them was a DC, all resources are marked as started on all >>>>> (down too) nodes. No log entries from pacemaker at all. >>>> >>>> Well I can't see any logs from anyone to its hard for me to comment. >>> >>> Logs are sent privately. >>> >>>> > > Vladislav > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker