On 29 May 2014, at 10:00 pm, Gabriel Gomiz <ggo...@cooperativaobrera.coop> 
wrote:

> On 05/26/2014 08:56 PM, Andrew Beekhof wrote:
>> On 27 May 2014, at 5:48 am, Gabriel Gomiz <ggo...@cooperativaobrera.coop> 
>> wrote:
>> 
>>> Hello Andrew and cluster folks!
>>> 
>>> In the last month we are experiencing some weird problem with cib process 
>>> in one of our nodes
>>> ('gandalf'), it's a 4-node cluster. Brief description:
>>> 
>>> After some undetermined reason (we still can't figure out why) it begins 
>>> looping infinitely and
>>> consuming 100% CPU.
>> Apart from the CPU usage, is there something in particular that makes you 
>> think its looping?
> Maybe because stracing the process also hangs and the process is not 
> receiving the kill signal
> (maybe stuck in a system call inside kernel??).

I wasn't saying you were wrong, just that looping and 'crazy busy' can 
sometimes look similar.

>> There have been some big steps forward in cib for the next upstream release 
>> (its basically 2 orders of magnitude faster/more efficient).
>> Current versions will regularly max out a core, albeit for hopefully short 
>> periods of time depending on the cluster size:
>> 
>>      https://twitter.com/beekhof/status/412913549837475840
>> 
>> Its also a vicious circle - a busy cib leads to failed resource actions, 
>> which leads to recovery operations, which leads to more work for the cib.
>> 
>> Looking at the size of your cluster, 87 resources on 4 nodes... I can 
>> imagine that benefitting greatly from the coming version.
>> 
>> I notice you're using a rhel package, are you a RH customer or is this on a 
>> clone?
>> Also, did anything specific happen prior to the CIB going nuts?
> Only thing that I can think of is a lot of calls to crm_mon via a shell 
> script that we use to check
> which resource groups each node is servicing (attached if you're curious).
> We use this script to apply puppet manifests conditionally to our nodes and 
> do some monitoring. Also
> we have cron jobs checking via the script if the resource group is active 
> before running.
> Maybe the sum of that calls can make cib process very busy...?

If you were running it every second... maybe. But something is _seriously_ 
wrong if -KILL isn't working!
I wonder how much memory it was using at the time... perhaps the kernel was 
trying to write a huge core file?

> 
> Anyway, I've built 1.1.12 rc1 RPMS and this morning I've upgraded the 
> cluster. Will let you know if
> there is something weird after this upgrade.

Ok, I'd be interested to hear your feedback.

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to