On Tue, Sep 6, 2011 at 5:27 PM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > Hi Andrew, hi all, > > I'm further investigating dlm lockspace hangs I described in > https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html > and in the thread starting from > https://lists.linux-foundation.org/pipermail/openais/2011-September/016701.html > . > > What I described there is setup which involves pacemaker-1.1.6 with > corosync-1.4.1 and dlm_controld.pcmk from cluster-3.0.17 (without cman). > I use openais stack for pacemaker. > > I found that it is possible to reproduce dlm kern_stop state across a > whole cluster with iptables on just one node, it is sufficient to block > all (or just corosync-specific) incoming/outgoing UDP for several > seconds (that time probably depends on corosync settings). I my case I > reproduced hang with 3-seconds traffic block: > iptables -I INPUT 1 -p udp -j REJECT; \ > iptables -I OUTPUT 1 -p udp -j REJECT; \ > sleep 3; \ > iptables -D INPUT 1; \ > iptables -D OUTPUT 1 > > I tried to make dlm_controld schedule fencing on CPG_REASON_NODEDOWN > event (just to look if it helps with problems I described in posts > referenced above), but without much success, following code does not work: > > int fd = pcmk_cluster_fd; > int rc = crm_terminate_member_no_mainloop(nodeid, NULL, &fd); > > I get "Could not kick node XXX from the cluster" message accompanied > with "No connection to the cluster". That means that > attrd_update_no_mainloop() fails. > > Andrew, could you please give some pointers why may it fail? I'd then > try to fix dlm_controld. I do not see any other uses of that function > except than in dlm_controld.pcmk.
I can't think of anything except that attrd might not be running. Is it? Regardless, for 1.1.6 the dlm would be better off making a call like: rc = st->cmds->fence(st, st_opts, target, "reboot", 120); from fencing/admin.c That would talk directly to the fencing daemon, bypassing attrd, crnd and PE - and thus be more reliable. This is what the cman plugin will be doing soon too. > > I agree with Jiaju > (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), > that could be solely pacemaker problem, because it probably should > originate fencing itself is such situation I think. > > So, using pacemaker/dlm with openais stack is currently risky due to > possible hangs of dlm_lockspaces. It shouldn't be, failing to connect to attrd is very unusual. > Originally I got it due to heavy load > on one cluster nodes (actually on a host which has that cluster node > running as virtual guest). > > Ok, I switched to cman to see if it helps. Fencing is configured in > pacemaker, not in cluster.conf. > > Things became even worse ;( . > > Although it took 25 seconds instead of 3 to break the cluster (I > understand, this is almost impossible to load host so much, but > anyways), then I got a real nightmare: two nodes of 3-node cluster had > cman stopped (and pacemaker too because of cman connection loss) - they > asked to kick_node_from_cluster() for each other, and that succeeded. > But fencing didn't happen (I still need to look why, but this is cman > specific). > Remaining node had pacemaker hanged, it doesn't even > notice cluster infrastructure change, down nodes were listed as a > online, one of them was a DC, all resources are marked as started on all > (down too) nodes. No log entries from pacemaker at all. Well I can't see any logs from anyone to its hard for me to comment. > So, from my PoV cman+pacemaker is not currently suitable for HA tasks too. > > That means that both possible alternatives are currently unusable if one > needs self-repairing pacemaker cluster with dlm support ;( That is > really regrettable. > > I can provide all needed information and really hope that it is possible > to fix both issues: > * dlm blockage with openais and > * pacemaker lock with cman and no fencing from within dlm_controld > > I think both issues are really high priority, because it is definitely > not acceptable when problems with load on one cluster node (or with link > to that node) lead to a total cluster lock or even crash. > > I also offer any possible assistance from my side (f.e. patch trials > etc.) to get that all fixed. I can run either openais or cman and can > quickly switch between that stacks. > > Sorry for not being brief, > > Best regards, > Vladislav > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker