Re: [Pacemaker] Configuration recommandations for (very?) large cluster

Andrew Beekhof Tue, 12 Aug 2014 17:40:30 -0700

On 13 Aug 2014, at 2:02 am, Cédric Dufour - Idiap Research Institute 
<cedric.duf...@idiap.ch> wrote:


> On 12/08/14 07:52, Andrew Beekhof wrote:
>> On 11 Aug 2014, at 10:10 pm, Cédric Dufour - Idiap Research Institute 
>> <cedric.duf...@idiap.ch> wrote:
>> 
>>> Hello,
>>> 
>>> Thanks to Pacemaker 1.1.12, I have been able to setup a (very?) large 
>>> cluster:
>> Thats certainly up there as one of the biggest :)
> 
> Well, actually, I sized it down from 444 to 277 resources by merging 
> 'VirtualDomain' and 'MailTo' RA/primitives into a custom/single 'LibvirtQemu' 
> one.
> CIB is now ~3MiB uncompressed / ~100kiB compressed. (also avoids the 
> informational-only 'MailTo' RA to come burden the cluster)
> 'PCMK_ipc_buffer' at 2MiB might be overkill now... but I'd rather stay on the 
> safe side.
> 
> Q: Are there adverse effects in keeping 'PCMK_ipc_buffer' high?

Only that you'll have a larger mmap file shared between processes.

> 
> 277 resources are:
> - 22 (cloned) network-health (ping) resources
> - 88 (cloned) stonith resources (I have 4 stonith devices)
> - 167 LibvirtQemu resources (83 "general-purpose" servers and 84 SGE-driven 
> computation nodes)
> (and more LibvirtQemu resources are expected to come)
> 
>> Have you checked pacemaker's CPU usage during startup/failover?  I'd be 
>> interested in your results.
> 
> I finally set  'batch-limit' set to 22 - the quantity of nodes - as it makes 
> sense when enabling a new primitive, as all monitor operations get dispatched 
> immediately to all nodes at once.
> 
> When bringing a standby node to life:
> 
> - On the "waking" node (E5-2690v2): 167+5 resources monitoring operations get 
> dispatched; the CPU load of the 'cib' process remains below 100% as the 
> operations are executed, batched by 22 (though one can not see that 
> "batching", the monitoring operations succeeding very quickly), and complete 
> in ~2 seconds. With Pacemaker 1.1.7, the 'cib' load would have peaked to 100% 
> even before the first monitoring operation started (because of the CIB 
> refresh, I guess) and would remain so for several tens of seconds (often 
> resulting in timeouts and monitoring operations failure)
> 
> - On the DC node (E5-2690v2): the CPU would also remain below 100%, 
> alternating between the 'cib', 'pengine' and 'crmd' process. The DC is back 
> to IDLE within ~4 seconds.
> 
> I tried raising the 'batch-limit' to 50 and witnessed CPU load peaking at 
> 100% while carrying out the same procedure, but all went well nonetheless.
> 
> While I still had the ~450 resources, I also "accidentally" brought all 22 
> nodes back to life together (well, actually started the DC alone and then 
> started the remaining 21 nodes together). As could be expected, the DC got 
> quite busy (dispatching/executing the ~450*22 monitoring operations on all 
> nodes). It took 40 minutes for the cluster to stabilize. But it did 
> stabilize, with no timeout and not monitor operations failure! A few "high 
> CIB load detected / throttle down mode" messages popped up but all went well.


Thats about 0.12s per operation, not too bad.
More importantly, I'm glad to hear that real-world clusters are seeing the same 
kind of improvements as those in the lab.

It would be interesting to know how the 40 minutes compares to bringing one 
node online at a time.

> 
> Q: Is there a way to favorize more powerful nodes for the DC (iow. push the 
> DC "election" process in a preferred direction) ?

Only by starting it first and ensuring it doesn't die (we prfioritize the node 
with the largest crmd process uptime).

> 
>> 
>>> Last updated: Mon Aug 11 13:40:14 2014
>>> Last change: Mon Aug 11 13:37:55 2014
>>> Stack: classic openais (with plugin)
>> I would at least try running it with corosync 2.x (no plugin)
>> That will use CPG for messaging which should perform even better.
> 
> I'm running into a deadline now and will have to stick to 1.4.x for the 
> moment. But as soon as I can free an old test Intel modular chassis I have 
> around, I'll try backporting Coro 2.x from Debian/Experimental to 
> Debian/Wheezy and see what gives.
> 
>> 
>>> Current DC: bc1hx5a05 - partition with quorum
>>> Version: 1.1.12-561c4cf
>>> 22 Nodes configured, 22 expected votes
>>> 444 Resources configured
>>> 
>>> PS: 'corosync' (1.4.7) traffic goes through a 10GbE network, with strict 
>>> QoS priority over all other traffic.
>>> 
>>> Are there recommended configuration tweaks I should not miss in such 
>>> situation?
>>> 
>>> So far, I have:
>>> - Raised the 'PCMK_ipc_buffer' size to 2MiB
>>> - Lowered the 'batch-limit' to 10 (though I believe my setup could sustain 
>>> the default 30)
>> Yep, definitely worth trying the higher value.
>> We _should_ automatically start throttling ourselves if things get too 
>> intense.
> 
> Yep. As mentioned above, I did see "high CIB load detected / throttle down 
> mode" messages popup. Is this what you think about?
> 
>> 
>> Other than that, I would be making sure all the corosync.conf timeouts and 
>> other settings are appropriate.
> 
> Never paid much attention to it so far. But it seems to me the Debian 
> defaults are quite conservative, especially more so given my 10GbE (~0.2ms 
> latency) interconnect and the care I took in prioritizing Corosync traffic 
> (thanks to switches QoS/GMB and Linux 'tc'):
> 
>    token: 3000
>    token_retransmits_before_loss_const: 10
>    join: 60
>    consensus: 3600
>    vsftype: none
>    max_messages: 20
>    secauth: off
>    amf: disabled
> 
> Am I right?
> 
> PS: this work is being done within the concept of the BEAT european research 
> project - https://www.beat-eu.org/ - which aims, among other things, to 
> "develop an online and open platform to transparently and independently 
> evaluate biometric systems against validated benchmarks". There shall be some 
> "publication" about the infrastructure set up. If interested, I can keep you 
> posted.
> 
> Best,
> 
> Cédric
> 
>> 
>>> Thank you in advance for your response.
>>> 
>>> Best,
>>> 
>>> Cédric
>>> 
>>> -- 
>>> 
>>> Cédric Dufour @ Idiap Research Institute
>>> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Configuration recommandations for (very?) large cluster

Reply via email to