Thank you Andrew. Sorry for misspell your name in the previous email. on 2015/05/05 14:25, Andrew Beekhof wrote: >> On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 <[email protected]> >> wrote: >> >> Thank you Bogdan for clearing the pacemaker promotion process for me. >> >> on 2015/05/05 10:32, Andrew Beekhof wrote: >>>> On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 >>>> <[email protected]> wrote: >>> [snip] >>> >>>> Batch is a pacemaker concept I found when I was reading its >>>> documentation and code. There is a "batch-limit: 30" in the output of >>>> "pcs property list --all". The pacemaker official documentation >>>> explanation is that it's "The number of jobs that the TE is allowed to >>>> execute in parallel." From my understanding, pacemaker maintains cluster >>>> states, and when we start/stop/promote/demote a resource, it triggers a >>>> state transition. Pacemaker puts as many as possible transition jobs >>>> into a batch, and process them in parallel. >>> Technically it calculates an ordered graph of actions that need to be >>> performed for a set of related resources. >>> You can see an example of the kinds of graphs it produces at: >>> >>> >>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html >>> >>> There is a more complex one which includes promotion and demotion on the >>> next page. >>> >>> The number of actions that can run at any one time is therefor limited by >>> - the value of batch-limit (the total number of in-flight actions) >>> - the number of resources that do not have ordering constraints between >>> them (eg. rsc{1,2,3} in the above example) >>> >>> So in the above example, if batch-limit >= 3, the monitor_0 actions will >>> still all execute in parallel. >>> If batch-limit == 2, one of them will be deferred until the others complete. >>> >>> Processing of the graph stops the moment any action returns a value that >>> was not expected. >>> If that happens, we wait for currently in-flight actions to complete, >>> re-calculate a new graph based on the new information and start again. >> So can I infer the following statement? In a big cluster with many >> resources, chances are some resource agent actions return unexpected >> values, > The size of the cluster shouldn’t increase the chance of this happening > unless you’ve set the timeouts too aggressively.
If there are many types of resource agents, and anyone of them is not well written, it might cause trouble, right? >> and if any of the in-flight action timeout is long, it would >> block pacemaker from re-calculating a new transition graph? > Yes, but its actually an argument for making the timeouts longer, not shorter. > Setting the timeouts too aggressively actually increases downtime because of > all the extra delays and recovery it induces. > So set them to be long enough that there is unquestionably a problem if you > hit them. > > But we absolutely recognise that starting/stopping a database can take a very > long time comparatively and that it shouldn’t block recovery of other > unrelated services. > I would expect to see this land in Pacemaker 1.1.14 It will be great to see this in Pacemaker 1.1.14. From my experience using Pacemaker, I think customized resource agents are possibly the weakest part. This feature should improve the handling for resource action timeouts. >> I see the >> current batch-limit is 30 and I tried to increase it to 100, but did not >> help. > Correct. It only puts an upper limit on the number of in-flight actions, > actions still need to wait for all their dependants to complete before > executing. > >> I'm sure that the cloned MySQL Galera resource is not related to >> master-slave RabbitMQ resource. I don't find any dependency, order or >> rule connecting them in the cluster deployed by Fuel [1]. > In general it should not have needed to wait, but if you send me a crm_report > covering the period you’re talking about I’ll be able to comment specifically > about the behaviour you saw. You are very nice, thank you. I uploaded the file generated by crm_report to google drive. https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing >> Is there anything I can do to make sure all the resource actions return >> expected values in a full reassembling? > In general, if we say ‘start’, do your best to start or return ‘0’ if you > already were started. > Likewise for stop. > > Otherwise its really specific to your agent. > For example an IP resource just needs to add itself to an interface - it cant > do much differently, if it times out then the system much be very very busy. > > The only other thing I would say is: > - avoid blocking calls where possible > - have empathy for the machine (do as little as is needed) > +1 for the empathy :) >> Is it because node-1 and node-2 >> happen to boot faster than node-3 and form a cluster, when node-3 joins, >> it triggers new state transition? Or may because some resources are >> already started, so pacemaker needs to stop them firstly? > We only stop them if they shouldn’t yet be running (ie. a colocation or > ordering dependancy is not yet started also). > > >> Does setting >> default-resource-stickiness to 1 help? > From 0 or INFINITY? From 0 to 1. Is it enough for preventing the resource from being moved when some nodes recovered from power failure? >> I also tried "crm history XXX" commands in a live and correct cluster, > I’m not familiar with that tool anymore. > >> but didn't find much information. I can see there are many log entries >> like "run_graph: Transition 7108 ...". Next I'll inspect the pacemaker >> log to see which resource action returns the unexpected value or which >> thing triggers new state transition. >> >> [1] http://paste.openstack.org/show/214919/ > I’d not recommend mixing the two CLI tools. > >>>> The problem is that pacemaker can only promote a resource after it >>>> detects the resource is started. >>> First we do a non-recurring monitor (*_monitor_0) to check what state the >>> resource is in. >>> We can’t assume its off because a) we might have crashed, b) the admin >>> might have accidentally configured it to start at boot or c) the admin may >>> have asked us to re-check everything. >>> >>>> During a full reassemble, in the first >>>> transition batch, pacemaker starts all the resources including MySQL and >>>> RabbitMQ. Pacemaker issues resource agent "start" invocation in parallel >>>> and reaps the results. >>>> >>>> For a multi-state resource agent like RabbitMQ, pacemaker needs the >>>> start result reported in the first batch, then transition engine and >>>> policy engine decide if it has to retry starting or promote, and put >>>> this new transition job into a new batch. >>> Also important to know, the order of actions is: >>> >>> 1. any necessary demotions >>> 2. any necessary stops >>> 3. any necessary starts >>> 4. any necessary promotions >>> >>> >>> >>> __________________________________________________________________________ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: [email protected]?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> -- >> Best wishes! >> Zhou Zheng Sheng / 周征晟 Software Engineer >> Beijing AWcloud Software Co., Ltd. >> >> >> >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: [email protected]?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
