28.02.2013 07:21, Andrew Beekhof wrote: > On Tue, Feb 26, 2013 at 7:36 PM, Vladislav Bogdanov > <bub...@hoster-ok.com> wrote: >> 26.02.2013 11:10, Andrew Beekhof wrote: >>> On Mon, Feb 18, 2013 at 6:18 PM, Vladislav Bogdanov >>> <bub...@hoster-ok.com> wrote: >>>> Hi Andrew, all, >>>> >>>> I had an idea last night, that it may be worth implementing >>>> fully-dynamic cluster resize support in pacemaker, >>> >>> We already support nodes being added on the fly. As soon as they show >>> up in the membership we add them to the cib. >> >> Membership (runtime.totem.pg.mrp.srp.members) or nodelist (nodelist.node)? > > To my knowledge, only one (first) gets updated at runtime. > Even if nodelist.node could be updated dynamically, we'd have to poll > or be prompted to find out.
It can, please see at the end of cmap_keys(8). Please also see cmap_track_add(3) for CMAP_TRACK_PREFIX flag (and my original message ;) ). >> >> I recall that when I migrated from corosync 1.4 to 2.0 (somewhere near >> pacemaker 1.1.8 release time) and replaced old-style UDPU member list >> with nodelist.node, I saw all nodes configured in that nodelist appeared >> in a CIB. For me that was a regression, because with old-style config >> (and corosync 1.4) CIB contained only nodes seen online (4 of 16). > > That was a loophole that only worked when the entire cluster had been > down and the <nodes> section was empty. Aha, that is what I've been hit by. > People filed bugs explicitly asking for that loophole to be closed > because it was inconsistent with what the cluster did on every > subsequent startup. That is what I'm interested too. And what I propose should fix that too. > >> That >> would be OK if number of clone instances does not raise with that... > > Why? If clone-node-max=1, then you'll never have more than the number > of active nodes - even if clone-max is greater. Active (online) or known (existing in a <nodes> section)? I've seen that as soon as node appears in <nodes> even in offline state, new clone instance is allocated. Also, on one cluster with post-1.1.7 with openais plugin I have 16 nodes configured in totem.interface.members, but only three nodes in <nodes> CIB section, And I'm able to allocate at least 8-9 instances of clones with clone-max. I believe that pacemaker does not query totem.interface.members directly with openais plugin, and runtime.totem.pg.mrp.srp.members has only three nodes. Did that behavior change recently? > >> >> >>> For node removal we do require crm_node --remove. >>> >>> Is this not sufficient? >> >> I think it would be more straight-forward if there is only one origin of >> membership information for entire cluster stack, so proposal is to >> automatically remove node from CIB when it disappears from corosync >> nodelist (due to removal by admin). That nodelist is not dynamic (read >> from a config and then may be altered with cmapctl). > > Ok, but there still needs to be a trigger. > Otherwise we waste cycles continuously polling corosync for something > that is probably never going to happen. Please see above (cmap_track_add). > > Btw. crm_node doesn't just remove the node from the cib, its existence > is preserved in a number of caches which need to be purged. That could be done in a cmap_track_add's callback function too I think. > It could be possible to have crm_node also use the CMAP API to remove > it from the running corosync, but something would still need to edit > corosync.conf Yes, that is to admin. Btw I think more about scenario Fabio explains in votequorum(8) in 'allow_downscale' section - that is the one I'm interested in. > > IIRC, pcs handles all three components (corosync.conf, CMAP, crm_node) > as well as the "add" case. Good to know. But, I'm not ready yet to switch to it. > >> Of course, it is possible to use crm_node to remove node from CIB too >> after it disappeared from corosync, but that is not as elegant as >> automatic one IMHO. And, that should not be very difficult to implement. >> >>> >>>> utilizing >>>> possibilities CMAP and votequorum provide. >>>> >>>> Idea is to: >>>> * Do not add nodes from nodelist to CIB if their join-count in cmap is >>>> zero (but do not touch CIB nodes which exist in a nodelist and have zero >>>> join-count in cmap). >>>> * Install watches on a cmap nodelist.node and >>>> runtime.totem.pg.mrp.srp.members subtrees (cmap_track_add). >>>> * Add missing nodes to CIB as soon as they are both >>>> ** defined in a nodelist >>>> ** their join count becomes non-zero. >>>> * Remove nodes from CIB when they are removed from a nodelist. >>> >>> From _a_ nodelist or _the_ (optional) corosync nodelist? >> >> From the nodelist.node subtree of CMAP tree. >> >>> >>> Because removing a node from the cluster because it shut down is... an >>> interesting idea. >> >> BTW even that could be possible if quorum.allow_downscale is enabled, >> but requires much more thinking and probably more functionality from >> corosync. I'm not ready to comment on that yet though. > > "A node left but I still have quorum" is very different to "a node > left... what node?". > Also, what happens after you fence a node... do we forget about it too? quorum.allow_downscale mandates that it is active only if node leaves the cluster in a clean state. But, from what I know, corosync does not remove node from a nodelist.node neither itself nor on request from votequorum, that's why I say about "more functionality from corosync". If votequorum could distinguish "static" node (listed in config) from "dynamic" node (added on-the-fly), and manage list of "dynamic" ones if allow_downscale is enabled, that would do the trick. > >> >> I was about node removal from a CMAP's nodelist with corosync_cmapctl >> command. Of course, absence of (optional) nodelist in CMAP would result >> in NOOP because there is no removal event on a nodelist.node tree from cmap. >> >>> >>>> Certainly, this requires some CMAP values (especially votequorum ones >>>> and may be totem mode) to have some 'well-known' values, f.e. only UDPU >>>> mode and quorum.allow_downscale=1, that should be defined yet. >>>> >>>> May be, it also have sense to make this depend on some new CMAP >>>> variable, f.e. nodelist.dynamic=1. >>>> >>>> I would even try to implement this if general agreement is gained and >>>> nobody else wants to implement this. >>>> >>>> Can you please comment on this? >>>> >>>> Vladislav _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org