On 05/30/2014 12:12 AM, Andrew Beekhof wrote: > There have been some big steps forward in cib for the next upstream release > (its basically 2 orders of magnitude faster/more efficient). > Current versions will regularly max out a core, albeit for hopefully short > periods of time depending on the cluster size: > > https://twitter.com/beekhof/status/412913549837475840 > > Its also a vicious circle - a busy cib leads to failed resource actions, > which leads to recovery operations, which leads to more work for the cib. > > Looking at the size of your cluster, 87 resources on 4 nodes... I can imagine > that benefitting greatly from the coming version. > > I notice you're using a rhel package, are you a RH customer or is this on a > clone? Clone. CentOS. > Also, did anything specific happen prior to the CIB going nuts? >> Only thing that I can think of is a lot of calls to crm_mon via a shell >> script that we use to check >> which resource groups each node is servicing (attached if you're curious). >> We use this script to apply puppet manifests conditionally to our nodes and >> do some monitoring. Also >> we have cron jobs checking via the script if the resource group is active >> before running. >> Maybe the sum of that calls can make cib process very busy...? > If you were running it every second... maybe. But something is _seriously_ > wrong if -KILL isn't working! > I wonder how much memory it was using at the time... perhaps the kernel was > trying to write a huge core file? I don't think so. It was several days in that state.
Is there any way to check if a node has a resource group via a single simple call to crm resource? Because I didn't found a way we had to make a script that parse the entire crm_mon output. > >> Anyway, I've built 1.1.12 rc1 RPMS and this morning I've upgraded the >> cluster. Will let you know if >> there is something weird after this upgrade. > Ok, I'd be interested to hear your feedback. 1.1.12 rc1 working flawlessly until now. So it looks like it's fixed in that version. Thanks!
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org