----- Original Message ----- > On 12/08/14 07:52, Andrew Beekhof wrote: > > On 11 Aug 2014, at 10:10 pm, Cédric Dufour - Idiap Research Institute > > <cedric.duf...@idiap.ch> wrote: > > > >> Hello, > >> > >> Thanks to Pacemaker 1.1.12, I have been able to setup a (very?) large > >> cluster: > > Thats certainly up there as one of the biggest :) > > Well, actually, I sized it down from 444 to 277 resources by merging > 'VirtualDomain' and 'MailTo' RA/primitives into a custom/single > 'LibvirtQemu' one. > CIB is now ~3MiB uncompressed / ~100kiB compressed. (also avoids the > informational-only 'MailTo' RA to come burden the cluster) > 'PCMK_ipc_buffer' at 2MiB might be overkill now... but I'd rather stay on the > safe side. > > Q: Are there adverse effects in keeping 'PCMK_ipc_buffer' high?
More system memory will be required for ipc connections. Unless you're running low on ram, you should be fine with the buffer you set. > 277 resources are: > - 22 (cloned) network-health (ping) resources > - 88 (cloned) stonith resources (I have 4 stonith devices) > - 167 LibvirtQemu resources (83 "general-purpose" servers and 84 SGE-driven > computation nodes) > (and more LibvirtQemu resources are expected to come) > > > Have you checked pacemaker's CPU usage during startup/failover? I'd be > > interested in your results. > > I finally set 'batch-limit' set to 22 - the quantity of nodes - as it makes > sense when enabling a new primitive, as all monitor operations get > dispatched immediately to all nodes at once. > > When bringing a standby node to life: > > - On the "waking" node (E5-2690v2): 167+5 resources monitoring operations > get dispatched; the CPU load of the 'cib' process remains below 100% as the > operations are executed, batched by 22 (though one can not see that > "batching", the monitoring operations succeeding very quickly), and > complete in ~2 seconds. With Pacemaker 1.1.7, the 'cib' load would have > peaked to 100% even before the first monitoring operation started (because > of the CIB refresh, I guess) and would remain so for several tens of > seconds (often resulting in timeouts and monitoring operations failure) > > - On the DC node (E5-2690v2): the CPU would also remain below 100%, > alternating between the 'cib', 'pengine' and 'crmd' process. The DC is back > to IDLE within ~4 seconds. > > I tried raising the 'batch-limit' to 50 and witnessed CPU load peaking at > 100% while carrying out the same procedure, but all went well nonetheless. > > While I still had the ~450 resources, I also "accidentally" brought all 22 > nodes back to life together (well, actually started the DC alone and then > started the remaining 21 nodes together). As could be expected, the DC got > quite busy (dispatching/executing the ~450*22 monitoring operations on all > nodes). It took 40 minutes for the cluster to stabilize. But it did > stabilize, with no timeout and not monitor operations failure! A few "high > CIB load detected / throttle down mode" messages popped up but all went > well. > > Q: Is there a way to favorize more powerful nodes for the DC (iow. push the > DC "election" process in a preferred direction) ? > > > > >> Last updated: Mon Aug 11 13:40:14 2014 > >> Last change: Mon Aug 11 13:37:55 2014 > >> Stack: classic openais (with plugin) > > I would at least try running it with corosync 2.x (no plugin) > > That will use CPG for messaging which should perform even better. > > I'm running into a deadline now and will have to stick to 1.4.x for the > moment. But as soon as I can free an old test Intel modular chassis I have > around, I'll try backporting Coro 2.x from Debian/Experimental to > Debian/Wheezy and see what gives. > > > > >> Current DC: bc1hx5a05 - partition with quorum > >> Version: 1.1.12-561c4cf > >> 22 Nodes configured, 22 expected votes > >> 444 Resources configured > >> > >> PS: 'corosync' (1.4.7) traffic goes through a 10GbE network, with strict > >> QoS priority over all other traffic. > >> > >> Are there recommended configuration tweaks I should not miss in such > >> situation? > >> > >> So far, I have: > >> - Raised the 'PCMK_ipc_buffer' size to 2MiB > >> - Lowered the 'batch-limit' to 10 (though I believe my setup could sustain > >> the default 30) > > Yep, definitely worth trying the higher value. > > We _should_ automatically start throttling ourselves if things get too > > intense. > > Yep. As mentioned above, I did see "high CIB load detected / throttle down > mode" messages popup. Is this what you think about? > > > > > Other than that, I would be making sure all the corosync.conf timeouts and > > other settings are appropriate. > > Never paid much attention to it so far. But it seems to me the Debian > defaults are quite conservative, especially more so given my 10GbE (~0.2ms > latency) interconnect and the care I took in prioritizing Corosync traffic > (thanks to switches QoS/GMB and Linux 'tc'): > > token: 3000 > token_retransmits_before_loss_const: 10 > join: 60 > consensus: 3600 > vsftype: none > max_messages: 20 > secauth: off > amf: disabled > > Am I right? > > PS: this work is being done within the concept of the BEAT european research > project - https://www.beat-eu.org/ - which aims, among other things, to > "develop an online and open platform to transparently and independently > evaluate biometric systems against validated benchmarks". There shall be > some "publication" about the infrastructure set up. If interested, I can > keep you posted. > > Best, > > Cédric > > > > >> Thank you in advance for your response. > >> > >> Best, > >> > >> Cédric > >> > >> -- > >> > >> Cédric Dufour @ Idiap Research Institute > >> > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org