On Wed, Oct 10, 2012 at 6:44 PM, James Harper <james.har...@bendigoit.com.au> wrote: >> On 10/09/2012 01:42 PM, James Harper wrote: >> > As per previous post, I'm seeing very high cib load whenever I make a >> > configuration change, enough load that things timeout seemingly >> > instantly. I thought this was happening well before the configured >> > timeout but now I'm not so sure, maybe the timeouts are actually >> > working okay and it just seems instant. If the timeouts are in fact >> > working correctly then it's keeping the CPU at 100% for over 30 >> > seconds to the exclusion of any monitoring checks (or maybe locking >> > the cib so the checks can't run?) >> > >> > When I make a change I see the likes of this sort of thing in the logs (see >> data below email), which I thought might be solved by this >> https://github.com/ClusterLabs/pacemaker/commit/10e9e579ab032bde393 >> 8d7f3e13c414e297ba3e9 but i just checked the 1.1.7 source that the Debian >> packages are built from and it turns out that that patch already exists in >> 1.1.7. >> > >> > Are the messages below actually an indication of a problem? If I >> understand it correctly it's failing to apply the configuration diff and is >> instead >> forcing a full resync of the configuration across some or all nodes, which is >> causing the high load. >> > >> > I ran the crm_report but it includes a lot of information I really need to >> > remove so I'm reluctant to submit it in full unless it really all is >> > required to >> > resolve the problem. >> > >> >> You already did some tuning like increasing batch-limit in your cluster >> properties and increased corosync timings? Hard to say more without getting >> more information ... if your configuration details are too sensitive to post >> on >> a public mailing-list you can of course hire someone and give that >> information >> under NDA .... >> > > I guess I'd first like to know if the log entries I was seeing ("Failed > application of an update diff" and "Requesting re-sync from peer") means that > a full resync is being done, and if that's a problem or not.
There are occasions when its not a problem, but I don't think any of them apply to you. Questions: - are you making any config changes when this behaviour is occurring? - if so, from one node only or many? - what version is this? 1.1.7 or 1.1.7 plus some debian patches? which patches? > My understanding of my problem is that for whatever reason, a full resync is > taking a lot more CPU that I would have expected, and is being triggered even > for minor changes (eg adding a location to a resource). Resolving the former > (if it's actually a problem?) would be nice, but resolving the latter would > be acceptable for now. > > As for increasing batch limit, > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html > tells me that this is "The number of jobs that the TE is allowed to execute > in parallel.". If changing a single resource location completely consumes all > CPU in all nodes for many seconds, is allowing more work to be done in > parallel really going to help? > > For the corosync timings, are these the "token", "join", etc values in > corosync.conf? I don't have any evidence in my logs that that layer is having > any problems, although a _lot_ of logs are generated and I could easily miss > something. That would only sidestep the issue though I think > > In any case I'll endeavour to clean up my logs as required and submit a bug > report. > > Thanks for your time and patience > > James > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org