Thank you Bogdan for clearing the pacemaker promotion process for me. on 2015/05/05 10:32, Andrew Beekhof wrote: >> On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 <zhengsh...@awcloud.com> >> wrote: > [snip] > >> Batch is a pacemaker concept I found when I was reading its >> documentation and code. There is a "batch-limit: 30" in the output of >> "pcs property list --all". The pacemaker official documentation >> explanation is that it's "The number of jobs that the TE is allowed to >> execute in parallel." From my understanding, pacemaker maintains cluster >> states, and when we start/stop/promote/demote a resource, it triggers a >> state transition. Pacemaker puts as many as possible transition jobs >> into a batch, and process them in parallel. > Technically it calculates an ordered graph of actions that need to be > performed for a set of related resources. > You can see an example of the kinds of graphs it produces at: > > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html > > There is a more complex one which includes promotion and demotion on the next > page. > > The number of actions that can run at any one time is therefor limited by > - the value of batch-limit (the total number of in-flight actions) > - the number of resources that do not have ordering constraints between them > (eg. rsc{1,2,3} in the above example) > > So in the above example, if batch-limit >= 3, the monitor_0 actions will > still all execute in parallel. > If batch-limit == 2, one of them will be deferred until the others complete. > > Processing of the graph stops the moment any action returns a value that was > not expected. > If that happens, we wait for currently in-flight actions to complete, > re-calculate a new graph based on the new information and start again. So can I infer the following statement? In a big cluster with many resources, chances are some resource agent actions return unexpected values, and if any of the in-flight action timeout is long, it would block pacemaker from re-calculating a new transition graph? I see the current batch-limit is 30 and I tried to increase it to 100, but did not help. I'm sure that the cloned MySQL Galera resource is not related to master-slave RabbitMQ resource. I don't find any dependency, order or rule connecting them in the cluster deployed by Fuel [1].
Is there anything I can do to make sure all the resource actions return expected values in a full reassembling? Is it because node-1 and node-2 happen to boot faster than node-3 and form a cluster, when node-3 joins, it triggers new state transition? Or may because some resources are already started, so pacemaker needs to stop them firstly? Does setting default-resource-stickiness to 1 help? I also tried "crm history XXX" commands in a live and correct cluster, but didn't find much information. I can see there are many log entries like "run_graph: Transition 7108 ...". Next I'll inspect the pacemaker log to see which resource action returns the unexpected value or which thing triggers new state transition. [1] http://paste.openstack.org/show/214919/ >> The problem is that pacemaker can only promote a resource after it >> detects the resource is started. > First we do a non-recurring monitor (*_monitor_0) to check what state the > resource is in. > We can’t assume its off because a) we might have crashed, b) the admin might > have accidentally configured it to start at boot or c) the admin may have > asked us to re-check everything. > >> During a full reassemble, in the first >> transition batch, pacemaker starts all the resources including MySQL and >> RabbitMQ. Pacemaker issues resource agent "start" invocation in parallel >> and reaps the results. >> >> For a multi-state resource agent like RabbitMQ, pacemaker needs the >> start result reported in the first batch, then transition engine and >> policy engine decide if it has to retry starting or promote, and put >> this new transition job into a new batch. > Also important to know, the order of actions is: > > 1. any necessary demotions > 2. any necessary stops > 3. any necessary starts > 4. any necessary promotions > > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Best wishes! Zhou Zheng Sheng / 周征晟 Software Engineer Beijing AWcloud Software Co., Ltd. __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev