Thank you Bogdan for clearing the pacemaker promotion process for me.

on 2015/05/05 10:32, Andrew Beekhof wrote:
>> On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 <zhengsh...@awcloud.com> 
>> wrote:
> [snip]
>
>> Batch is a pacemaker concept I found when I was reading its
>> documentation and code. There is a "batch-limit: 30" in the output of
>> "pcs property list --all". The pacemaker official documentation
>> explanation is that it's "The number of jobs that the TE is allowed to
>> execute in parallel." From my understanding, pacemaker maintains cluster
>> states, and when we start/stop/promote/demote a resource, it triggers a
>> state transition. Pacemaker puts as many as possible transition jobs
>> into a batch, and process them in parallel.
> Technically it calculates an ordered graph of actions that need to be 
> performed for a set of related resources.
> You can see an example of the kinds of graphs it produces at:
>
>    
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
>
> There is a more complex one which includes promotion and demotion on the next 
> page.
>
> The number of actions that can run at any one time is therefor limited by
> - the value of batch-limit (the total number of in-flight actions)
> - the number of resources that do not have ordering constraints between them 
> (eg. rsc{1,2,3} in the above example)  
>
> So in the above example, if batch-limit >= 3, the monitor_0 actions will 
> still all execute in parallel.
> If batch-limit == 2, one of them will be deferred until the others complete.
>
> Processing of the graph stops the moment any action returns a value that was 
> not expected.
> If that happens, we wait for currently in-flight actions to complete, 
> re-calculate a new graph based on the new information and start again.
So can I infer the following statement? In a big cluster with many
resources, chances are some resource agent actions return unexpected
values, and if any of the in-flight action timeout is long, it would
block pacemaker from re-calculating a new transition graph? I see the
current batch-limit is 30 and I tried to increase it to 100, but did not
help. I'm sure that the cloned MySQL Galera resource is not related to
master-slave RabbitMQ resource. I don't find any dependency, order or
rule connecting them in the cluster deployed by Fuel [1].

Is there anything I can do to make sure all the resource actions return
expected values in a full reassembling? Is it because node-1 and node-2
happen to boot faster than node-3 and form a cluster, when node-3 joins,
it triggers new state transition? Or may because some resources are
already started, so pacemaker needs to stop them firstly? Does setting
default-resource-stickiness to 1 help?

I also tried "crm history XXX" commands in a live and correct cluster,
but didn't find much information. I can see there are many log entries
like "run_graph: Transition 7108 ...". Next I'll inspect the pacemaker
log to see which resource action returns the unexpected value or which
thing triggers new state transition.

[1] http://paste.openstack.org/show/214919/

>> The problem is that pacemaker can only promote a resource after it
>> detects the resource is started.
> First we do a non-recurring monitor (*_monitor_0) to check what state the 
> resource is in.
> We can’t assume its off because a) we might have crashed, b) the admin might 
> have accidentally configured it to start at boot or c) the admin may have 
> asked us to re-check everything.
>
>> During a full reassemble, in the first
>> transition batch, pacemaker starts all the resources including MySQL and
>> RabbitMQ. Pacemaker issues resource agent "start" invocation in parallel
>> and reaps the results.
>>
>> For a multi-state resource agent like RabbitMQ, pacemaker needs the
>> start result reported in the first batch, then transition engine and
>> policy engine decide if it has to retry starting or promote, and put
>> this new transition job into a new batch.
> Also important to know, the order of actions is:
>
> 1. any necessary demotions
> 2. any necessary stops
> 3. any necessary starts
> 4. any necessary promotions
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-- 
Best wishes!
Zhou Zheng Sheng / 周征晟  Software Engineer
Beijing AWcloud Software Co., Ltd.




__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to