Re: [Pacemaker] Configuration recommandations for (very?) large cluster

Cédric Dufour - Idiap Research Institute Tue, 12 Aug 2014 09:09:14 -0700

On 12/08/14 07:52, Andrew Beekhof wrote:
> On 11 Aug 2014, at 10:10 pm, Cédric Dufour - Idiap Research Institute 
> <cedric.duf...@idiap.ch> wrote:
>
>> Hello,
>>
>> Thanks to Pacemaker 1.1.12, I have been able to setup a (very?) large 
>> cluster:
> Thats certainly up there as one of the biggest :)


Well, actually, I sized it down from 444 to 277 resources by merging 
'VirtualDomain' and 'MailTo' RA/primitives into a custom/single 'LibvirtQemu' 
one.
CIB is now ~3MiB uncompressed / ~100kiB compressed. (also avoids the 
informational-only 'MailTo' RA to come burden the cluster)
'PCMK_ipc_buffer' at 2MiB might be overkill now... but I'd rather stay on the 
safe side.

Q: Are there adverse effects in keeping 'PCMK_ipc_buffer' high?

277 resources are:
 - 22 (cloned) network-health (ping) resources
 - 88 (cloned) stonith resources (I have 4 stonith devices)
 - 167 LibvirtQemu resources (83 "general-purpose" servers and 84 SGE-driven 
computation nodes)
(and more LibvirtQemu resources are expected to come)

> Have you checked pacemaker's CPU usage during startup/failover?  I'd be 
> interested in your results.

I finally set  'batch-limit' set to 22 - the quantity of nodes - as it makes 
sense when enabling a new primitive, as all monitor operations get dispatched 
immediately to all nodes at once.

When bringing a standby node to life:

 - On the "waking" node (E5-2690v2): 167+5 resources monitoring operations get 
dispatched; the CPU load of the 'cib' process remains below 100% as the 
operations are executed, batched by 22 (though one can not see that "batching", 
the monitoring operations succeeding very quickly), and complete in ~2 seconds. 
With Pacemaker 1.1.7, the 'cib' load would have peaked to 100% even before the 
first monitoring operation started (because of the CIB refresh, I guess) and 
would remain so for several tens of seconds (often resulting in timeouts and 
monitoring operations failure)

 - On the DC node (E5-2690v2): the CPU would also remain below 100%, 
alternating between the 'cib', 'pengine' and 'crmd' process. The DC is back to 
IDLE within ~4 seconds.

I tried raising the 'batch-limit' to 50 and witnessed CPU load peaking at 100% 
while carrying out the same procedure, but all went well nonetheless.

While I still had the ~450 resources, I also "accidentally" brought all 22 
nodes back to life together (well, actually started the DC alone and then 
started the remaining 21 nodes together). As could be expected, the DC got 
quite busy (dispatching/executing the ~450*22 monitoring operations on all 
nodes). It took 40 minutes for the cluster to stabilize. But it did stabilize, 
with no timeout and not monitor operations failure! A few "high CIB load 
detected / throttle down mode" messages popped up but all went well.

Q: Is there a way to favorize more powerful nodes for the DC (iow. push the DC 
"election" process in a preferred direction) ?

>
>> Last updated: Mon Aug 11 13:40:14 2014
>> Last change: Mon Aug 11 13:37:55 2014
>> Stack: classic openais (with plugin)
> I would at least try running it with corosync 2.x (no plugin)
> That will use CPG for messaging which should perform even better.

I'm running into a deadline now and will have to stick to 1.4.x for the moment. 
But as soon as I can free an old test Intel modular chassis I have around, I'll 
try backporting Coro 2.x from Debian/Experimental to Debian/Wheezy and see what 
gives.

>
>> Current DC: bc1hx5a05 - partition with quorum
>> Version: 1.1.12-561c4cf
>> 22 Nodes configured, 22 expected votes
>> 444 Resources configured
>>
>> PS: 'corosync' (1.4.7) traffic goes through a 10GbE network, with strict QoS 
>> priority over all other traffic.
>>
>> Are there recommended configuration tweaks I should not miss in such 
>> situation?
>>
>> So far, I have:
>> - Raised the 'PCMK_ipc_buffer' size to 2MiB
>> - Lowered the 'batch-limit' to 10 (though I believe my setup could sustain 
>> the default 30)
> Yep, definitely worth trying the higher value.
> We _should_ automatically start throttling ourselves if things get too 
> intense.

Yep. As mentioned above, I did see "high CIB load detected / throttle down 
mode" messages popup. Is this what you think about?

>
> Other than that, I would be making sure all the corosync.conf timeouts and 
> other settings are appropriate.

Never paid much attention to it so far. But it seems to me the Debian defaults 
are quite conservative, especially more so given my 10GbE (~0.2ms latency) 
interconnect and the care I took in prioritizing Corosync traffic (thanks to 
switches QoS/GMB and Linux 'tc'):

    token: 3000
    token_retransmits_before_loss_const: 10
    join: 60
    consensus: 3600
    vsftype: none
    max_messages: 20
    secauth: off
    amf: disabled

Am I right?

PS: this work is being done within the concept of the BEAT european research 
project - https://www.beat-eu.org/ - which aims, among other things, to 
"develop an online and open platform to transparently and independently 
evaluate biometric systems against validated benchmarks". There shall be some 
"publication" about the infrastructure set up. If interested, I can keep you 
posted.

Best,

Cédric

>
>> Thank you in advance for your response.
>>
>> Best,
>>
>> Cédric
>>
>> -- 
>>
>> Cédric Dufour @ Idiap Research Institute
>>


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Configuration recommandations for (very?) large cluster

Reply via email to