On 05/26/2014 08:56 PM, Andrew Beekhof wrote:
> On 27 May 2014, at 5:48 am, Gabriel Gomiz <ggo...@cooperativaobrera.coop> 
> wrote:
>
>> Hello Andrew and cluster folks!
>>
>> In the last month we are experiencing some weird problem with cib process in 
>> one of our nodes
>> ('gandalf'), it's a 4-node cluster. Brief description:
>>
>> After some undetermined reason (we still can't figure out why) it begins 
>> looping infinitely and
>> consuming 100% CPU.
> Apart from the CPU usage, is there something in particular that makes you 
> think its looping?
Maybe because stracing the process also hangs and the process is not receiving 
the kill signal
(maybe stuck in a system call inside kernel??).
> There have been some big steps forward in cib for the next upstream release 
> (its basically 2 orders of magnitude faster/more efficient).
> Current versions will regularly max out a core, albeit for hopefully short 
> periods of time depending on the cluster size:
>
>       https://twitter.com/beekhof/status/412913549837475840
>
> Its also a vicious circle - a busy cib leads to failed resource actions, 
> which leads to recovery operations, which leads to more work for the cib.
>
> Looking at the size of your cluster, 87 resources on 4 nodes... I can imagine 
> that benefitting greatly from the coming version.
>
> I notice you're using a rhel package, are you a RH customer or is this on a 
> clone?
> Also, did anything specific happen prior to the CIB going nuts?
Only thing that I can think of is a lot of calls to crm_mon via a shell script 
that we use to check
which resource groups each node is servicing (attached if you're curious).
We use this script to apply puppet manifests conditionally to our nodes and do 
some monitoring. Also
we have cron jobs checking via the script if the resource group is active 
before running.
Maybe the sum of that calls can make cib process very busy...?

Anyway, I've built 1.1.12 rc1 RPMS and this morning I've upgraded the cluster. 
Will let you know if
there is something weird after this upgrade.

Thanks!

Attachment: resources.sh
Description: application/shellscript

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to