On 29 May 2014, at 10:00 pm, Gabriel Gomiz <ggo...@cooperativaobrera.coop> wrote:
> On 05/26/2014 08:56 PM, Andrew Beekhof wrote: >> On 27 May 2014, at 5:48 am, Gabriel Gomiz <ggo...@cooperativaobrera.coop> >> wrote: >> >>> Hello Andrew and cluster folks! >>> >>> In the last month we are experiencing some weird problem with cib process >>> in one of our nodes >>> ('gandalf'), it's a 4-node cluster. Brief description: >>> >>> After some undetermined reason (we still can't figure out why) it begins >>> looping infinitely and >>> consuming 100% CPU. >> Apart from the CPU usage, is there something in particular that makes you >> think its looping? > Maybe because stracing the process also hangs and the process is not > receiving the kill signal > (maybe stuck in a system call inside kernel??). I wasn't saying you were wrong, just that looping and 'crazy busy' can sometimes look similar. >> There have been some big steps forward in cib for the next upstream release >> (its basically 2 orders of magnitude faster/more efficient). >> Current versions will regularly max out a core, albeit for hopefully short >> periods of time depending on the cluster size: >> >> https://twitter.com/beekhof/status/412913549837475840 >> >> Its also a vicious circle - a busy cib leads to failed resource actions, >> which leads to recovery operations, which leads to more work for the cib. >> >> Looking at the size of your cluster, 87 resources on 4 nodes... I can >> imagine that benefitting greatly from the coming version. >> >> I notice you're using a rhel package, are you a RH customer or is this on a >> clone? >> Also, did anything specific happen prior to the CIB going nuts? > Only thing that I can think of is a lot of calls to crm_mon via a shell > script that we use to check > which resource groups each node is servicing (attached if you're curious). > We use this script to apply puppet manifests conditionally to our nodes and > do some monitoring. Also > we have cron jobs checking via the script if the resource group is active > before running. > Maybe the sum of that calls can make cib process very busy...? If you were running it every second... maybe. But something is _seriously_ wrong if -KILL isn't working! I wonder how much memory it was using at the time... perhaps the kernel was trying to write a huge core file? > > Anyway, I've built 1.1.12 rc1 RPMS and this morning I've upgraded the > cluster. Will let you know if > there is something weird after this upgrade. Ok, I'd be interested to hear your feedback.
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org