12.03.2013 04:44, Andrew Beekhof wrote: > On Thu, Mar 7, 2013 at 5:30 PM, Vladislav Bogdanov <bub...@hoster-ok.com> > wrote: >> 07.03.2013 03:37, Andrew Beekhof wrote: >>> On Thu, Mar 7, 2013 at 2:41 AM, Vladislav Bogdanov <bub...@hoster-ok.com> >>> wrote: >>>> 06.03.2013 08:35, Andrew Beekhof wrote: >>> >>>>> So basically, you want to be able to add/remove nodes from nodelist.* >>>>> in corosync.conf and have pacemaker automatically add/remove them from >>>>> itself? >>>> >>>> Not corosync.conf, but cmap which is initially (partially) filled with >>>> values from corosync.conf. >>>> >>>>> >>>>> If corosync.conf gets out of sync (admin error or maybe a node was >>>>> down when you updated last) they might well get added back - I assume >>>>> you're ok with that? >>>>> Because there's no real way to know the difference between "added >>>>> back" and "not removed from last time". >>>> >>>> Sorry, can you please reword? >>> >>> When node-A comes up with "node-X" that no-one else has, the cluster >>> has no way to know if node-X was just added, or if the admin forgot to >>> remove it on node-A. >> >> Exactly that is not problem if node does not appear in CIB until it is >> seen online. > > But that is at odds with the "read the corosync nodelist" part.
But not with * Do not add nodes from nodelist to CIB if their join-count in cmap is zero (but do not touch CIB nodes which exist in a nodelist and have zero join-count in cmap). For the rest - I got your position on node deletion. It is not very common to see so many words from you, so, thank you for interesting discussion ;) > >> If node-A comes up, then it is just booted, and that means >> that it didn't see node-X online yet (if it is not actually online of >> course). And then node-X is not added to CIB. >> >>> >>>>> Or are you planning to never update the on-disk corosync.conf and only >>>>> modify the in-memory nodelist? >>>> >>>> That depends on the actual use case I think. >>>> >>>> Hm. Interesting, how corosync behave when new dynamic nodes are added to >>>> cluster... I mean following: we have static corosync.conf with nodelist >>>> containing f.e. 3 entries, then we add fourth entry via cmap and boot >>>> fourth node. What should be in corosync.conf of that node? >>> >>> I don't know actually. Try it and see if it works without the local >>> node being defined? >>> >>>> I believe in >>>> wont work without that _its_ fourth entry. Ugh. If so, then no fully >>>> dynamic "elastic" cluster which I was dreaming of is still possible >>>> because out-of-the-box when using dynamic nodelist. >>>> >>>> The only way to have this I see is to have static nodelist in >>>> corosync.conf with all possible nodes predefined. And never edit it in >>>> cmap. So, my original point >>>> * Remove nodes from CIB when they are removed from a nodelist. >>>> does not fit. >>>> >>>> By elastic I mean what was discussed on corosync list when Fabio started >>>> with votequorum design and what then appeared in votequorum manpage: >>>> =========== >>>> allow_downscale: 1 >>>> >>>> Enables allow downscale (AD) feature (default: 0). >>>> >>>> The general behaviour of votequorum is to never decrease expected votes >>>> or quorum. >>>> >>>> When AD is enabled, both expected votes and quorum are recalculated >>>> when a node leaves the cluster in a clean state (normal corosync shut- >>>> down process) down to configured expected_votes. >>> >>> But thats very different to removing the node completely. >>> You still want to know its in a sane state. >> >> Isn't it enough to trust corosync here? > > Absolutely not. > "Clean" to corosync means "did corosync on that node send me a message > to say that they planned to exit". > That implies nothing at all about the state of pacemaker or, more > importantly, the cluster resources on that machine. > > In addition, "clean" is an internal corosync concept that is not > reported to clients like pacemaker. > >> Of course if it supplies some event that "node X leaved cluster in a >> clean state and we lowered expected_votes and quorum. >> >> Clean corosync shutdown means that either 'no more corosync clients >> remain and it was safe to shutdown' or 'corosync has a bug'. >> Pacemaker is corosync client, and corosync should not stop in a clean >> state if pacemaker is still running there. >> >> And 'pacemaker is not running on node X' means that pacemaker instances >> on other nodes accepted that. Otherwise node is scheduled to stonith and >> there is no 'clean' shutdown. >> >> Am I correct here? > > Not really, no. > >>> >>>> Example use case: >>>> >>>> 1) N node cluster (where N is any value higher than 3) >>>> 2) expected_votes set to 3 in corosync.conf >>>> 3) only 3 nodes are running >>>> 4) admin requires to increase processing power and adds 10 nodes >>>> 5) internal expected_votes is automatically set to 13 >>>> 6) minimum expected_votes is 3 (from configuration) >>>> - up to this point this is standard votequorum behavior - >>>> 7) once the work is done, admin wants to remove nodes from the cluster >>>> 8) using an ordered shutdown the admin can reduce the cluster size >>>> automatically back to 3, but not below 3, where normal quorum >>>> operation will work as usual. >>>> ============= >>>> >>>> What I would expect from pacemaker, is to automatically remove nodes >>>> down to 3 at step 8 (just follow quorum) if AD is enabled AND pacemaker >>>> is instructed to follow that (with some other cmap switch). And also to >>>> reduce number of allocated clone instances. Sure, all nodes must have >>>> equal number of votes (1). >>>> >>>> Is it ok for you? >>> >>> Not really. >>> We simply don't have enough information to do the removal. >>> All we get is "node gone", we have to do a fair bit of work to >>> calculate if it was clean at the time or not (and clean to corosync >>> doesn't always imply clean to pacemaker). >> >> Please see above. >> There is always (at least with mcp model) some time frame between >> pacemaker stop and corosync stop events. > > Relying on this would be a world of hurt. > I have enough trouble dealing with timing issues to ever consider > relying on one. > > Nodes routinely reappear in virtualised clusters before stonith > reports they have been fenced. > Or peers that show up in the CPG membership 5 minutes before appearing > in the cman/quorum membership. > >> And pacemaker should accept >> "node leave" after first one (doesn't it mark node as 'pending' in that >> state?). And second event (corosync stop) is enough to remove that >> 'pending' node I think. >> >> And, I think there should be some special message from corosync (or >> better from votequorum) > > Please. Stop. votequorum excluding a node's vote is completely > different to removing all trace of the node from the cluster. > One is not a suitable trigger for the other. > >> that 'I lowered quorum and expected_votes >> accepting clean leave of node X'. Otherwise we'd have problems when two >> node leave at the same time, but one is clean while other is not. I hope >> guys who maintain corosync will accept patch with that if it is not >> possible to get similar message with current implementation. >> >>> >>> So back to the start, why do you need pacemaker to forget about the >>> other 10 nodes? >> >> Mostly aesthetic reasons. > > This is a massive amount of pain and complexity to create for the > "aesthetic reasons" of (AFAICS) a very uncommon use-case. > Why not simply set value-- for clone-max as part of the shutdown > procedure for your extra nodes? > > (The cib actually understands value="value--") > >> I think about using pacemaker as a backend for private (or semi-private) >> virtualization clouds with varying load. F.e. 10 nodes, 3 of then should >> run always, remaining 7 are only powered on when peak load is requested >> (some heavy calculations should be done as fast as possible). Power is >> not cheap today, so I'd prefer to have that nodes powered off when they >> are not needed. That peak load is required only 3 days per month. >> So I want cluster to "breath" accordingly with user needs. >> >> I agree that nothing hampers that to be in case right now, and that will >> work, but f.e. crm_mon output will show all allocated but stopped clone >> instances, and it would be much harder for admin to quickly find real >> problems. >> >> Please look: >> * No extra nodes in CIB, no extra clone instances >> ** Everything is good: >> >> Clone Set: cl-vlan200-if [vlan200-if] >> Started: [ v03-b v03-a v03-c ] >> >> ** Real problem, cl-vlan200-if is not running on v03-c >> >> Clone Set: cl-vlan200-if [vlan200-if] >> Started: [ v03-b v03-a ] >> Stopped: [ vlan200-if:2 ] >> >> * Extra nodes in CIB, extra clone instances >> ** Everything is good: >> >> Clone Set: cl-vlan200-if [vlan200-if] >> Started: [ v03-b v03-a v03-c ] >> Stopped: [ vlan200-if:3 vlan200-if:4 vlan200-if:5 vlan200-if:6 >> vlan200-if:7 ] >> >> ** Real problem, cl-vlan200-if is not running on v03-c >> >> Clone Set: cl-vlan200-if [vlan200-if] >> Started: [ v03-b v03-a ] >> Stopped: [ vlan200-if:2 vlan200-if:3 vlan200-if:4 vlan200-if:5 >> vlan200-if:6 vlan200-if:7 ] >> >> Do last two differ much, especially if there are dozens of such lines? >> I think no. >> >> But admin quickly sees problem in the first case. And, that is probably >> related not only to crm_mon, but other management tools as well, because >> there is mostly no generic way to distinguish between 'correctly >> not-running clone instance' and 'accidentally not running instance' >> without involving extra logic which complicates management tools a lot. >> >> Does this make clearer why I what that all to be implemented? >> >>> (because everything apart from that should already work). >>> >>>> >>>>> >>>>>> >>>>>>> >>>>>>>> That >>>>>>>> would be OK if number of clone instances does not raise with that... >>>>>>> >>>>>>> Why? If clone-node-max=1, then you'll never have more than the number >>>>>>> of active nodes - even if clone-max is greater. >>>>>> >>>>>> Active (online) or known (existing in a <nodes> section)? >>>>>> I've seen that as soon as node appears in <nodes> even in offline state, >>>>>> new clone instance is allocated. >>>>> >>>>> $num_known instances will "exist", but only $num_active will be running. >>>> >>>> Yep, that's what I say. I see them in crm_mon or 'crm status' and they >>>> make my life harder ;) >>>> That remaining instances are "allocated" but not running. >>>> >>>> I can agree that this issue is very "cosmetic" one, but its existence >>>> conflicts with my perfectionism so I'd like to resolve it ;) >>>> >>>>> >>>>>> >>>>>> Also, on one cluster with post-1.1.7 with openais plugin I have 16 nodes >>>>>> configured in totem.interface.members, but only three nodes in <nodes> >>>>>> CIB section, And I'm able to allocate at least 8-9 instances of clones >>>>>> with clone-max. >>>>> >>>>> Yes, but did you set clone-node-max? One is the global maximum, the >>>>> other is the per-node maximum. >>>>> >>>>>> I believe that pacemaker does not query >>>>>> totem.interface.members directly with openais plugin, >>>>> >>>>> Correct. >>>>> >>>>>> and >>>>>> runtime.totem.pg.mrp.srp.members has only three nodes. >>>>>> Did that behavior change recently? >>>>> >>>>> No. >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> For node removal we do require crm_node --remove. >>>>>>>>> >>>>>>>>> Is this not sufficient? >>>>>>>> >>>>>>>> I think it would be more straight-forward if there is only one origin >>>>>>>> of >>>>>>>> membership information for entire cluster stack, so proposal is to >>>>>>>> automatically remove node from CIB when it disappears from corosync >>>>>>>> nodelist (due to removal by admin). That nodelist is not dynamic (read >>>>>>>> from a config and then may be altered with cmapctl). >>>>>>> >>>>>>> Ok, but there still needs to be a trigger. >>>>>>> Otherwise we waste cycles continuously polling corosync for something >>>>>>> that is probably never going to happen. >>>>>> >>>>>> Please see above (cmap_track_add). >>>>>> >>>>>>> >>>>>>> Btw. crm_node doesn't just remove the node from the cib, its existence >>>>>>> is preserved in a number of caches which need to be purged. >>>>>> >>>>>> That could be done in a cmap_track_add's callback function too I think. >>>>>> >>>>>>> It could be possible to have crm_node also use the CMAP API to remove >>>>>>> it from the running corosync, but something would still need to edit >>>>>>> corosync.conf >>>>>> >>>>>> Yes, that is to admin. >>>>>> Btw I think more about scenario Fabio explains in votequorum(8) in >>>>>> 'allow_downscale' section - that is the one I'm interested in. >>>>>> >>>>>>> >>>>>>> IIRC, pcs handles all three components (corosync.conf, CMAP, crm_node) >>>>>>> as well as the "add" case. >>>>>> >>>>>> Good to know. But, I'm not ready yet to switch to it. >>>>>> >>>>>>> >>>>>>>> Of course, it is possible to use crm_node to remove node from CIB too >>>>>>>> after it disappeared from corosync, but that is not as elegant as >>>>>>>> automatic one IMHO. And, that should not be very difficult to >>>>>>>> implement. >>>>>>>> >>>>>>>>> >>>>>>>>>> utilizing >>>>>>>>>> possibilities CMAP and votequorum provide. >>>>>>>>>> >>>>>>>>>> Idea is to: >>>>>>>>>> * Do not add nodes from nodelist to CIB if their join-count in cmap >>>>>>>>>> is >>>>>>>>>> zero (but do not touch CIB nodes which exist in a nodelist and have >>>>>>>>>> zero >>>>>>>>>> join-count in cmap). >>>>>>>>>> * Install watches on a cmap nodelist.node and >>>>>>>>>> runtime.totem.pg.mrp.srp.members subtrees (cmap_track_add). >>>>>>>>>> * Add missing nodes to CIB as soon as they are both >>>>>>>>>> ** defined in a nodelist >>>>>>>>>> ** their join count becomes non-zero. >>>>>>>>>> * Remove nodes from CIB when they are removed from a nodelist. >>>>>>>>> >>>>>>>>> From _a_ nodelist or _the_ (optional) corosync nodelist? >>>>>>>> >>>>>>>> From the nodelist.node subtree of CMAP tree. >>>>>>>> >>>>>>>>> >>>>>>>>> Because removing a node from the cluster because it shut down is... an >>>>>>>>> interesting idea. >>>>>>>> >>>>>>>> BTW even that could be possible if quorum.allow_downscale is enabled, >>>>>>>> but requires much more thinking and probably more functionality from >>>>>>>> corosync. I'm not ready to comment on that yet though. >>>>>>> >>>>>>> "A node left but I still have quorum" is very different to "a node >>>>>>> left... what node?". >>>>>>> Also, what happens after you fence a node... do we forget about it too? >>>>>> >>>>>> quorum.allow_downscale mandates that it is active only if node leaves >>>>>> the cluster in a clean state. >>>>>> >>>>>> But, from what I know, corosync does not remove node from a >>>>>> nodelist.node neither itself nor on request from votequorum, that's why >>>>>> I say about "more functionality from corosync". >>>>>> If votequorum could distinguish "static" node (listed in config) from >>>>>> "dynamic" node (added on-the-fly), and manage list of "dynamic" ones if >>>>>> allow_downscale is enabled, that would do the trick. >>>>> >>>>> I doubt that request would get very far, but I could be wrong. >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> I was about node removal from a CMAP's nodelist with corosync_cmapctl >>>>>>>> command. Of course, absence of (optional) nodelist in CMAP would result >>>>>>>> in NOOP because there is no removal event on a nodelist.node tree from >>>>>>>> cmap. >>>>>>>> >>>>>>>>> >>>>>>>>>> Certainly, this requires some CMAP values (especially votequorum ones >>>>>>>>>> and may be totem mode) to have some 'well-known' values, f.e. only >>>>>>>>>> UDPU >>>>>>>>>> mode and quorum.allow_downscale=1, that should be defined yet. >>>>>>>>>> >>>>>>>>>> May be, it also have sense to make this depend on some new CMAP >>>>>>>>>> variable, f.e. nodelist.dynamic=1. >>>>>>>>>> >>>>>>>>>> I would even try to implement this if general agreement is gained and >>>>>>>>>> nobody else wants to implement this. >>>>>>>>>> >>>>>>>>>> Can you please comment on this? >>>>>>>>>> >>>>>>>>>> Vladislav >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org