Am 26.08.2010 10:38, schrieb Dejan Muhamedagic: > Hi, > > On Wed, Aug 25, 2010 at 08:56:08PM +0200, Cnut Jansen wrote: >> Am 25.08.2010 16:00, schrieb Dejan Muhamedagic: >>> Hi, >>> >>> On Tue, Aug 24, 2010 at 05:19:23PM +0200, Cnut Jansen wrote: >>>> Hi, >>>> >>>> just (for now) a short question for to make sure I didn't miss anything: >>>> What's the designated reaction of Pacemaker when a resource agents >>>> called for monitoring a resource, which is supposed to run and thus >>>> resulting in a return of 0 (OCF_SUCCESS), returns 7 (OCF_NOT_RUNNING)? >>>> Shall Pacemaker's very next call be for stopping the resource or shall >>>> it be yet another (or even several) monitorings? >>> >>> It should be stop, followed by start, either on the same node or >>> on another depending on the migration-threshold setting and >>> failcount. >> >> Ok, that's what I expected. >> So there are neither so-far-unknown-to-me circumstances where it's by >> design that Pacemaker - after having gotten a rc=7 from the RA; and for >> adding a "FAILED" behind the resource in crm_mon, it obviously also >> understood it correctly - calls the RA yet another several times for >> monitoring (while letting the rest of the cluster hang) before finally >> calling the desired stop, instead of immediately calling the RA for >> stopping and continueing with the pending transactions and migrations. > > Yes, that sounds quite unusual.
Just for reference: Though I'm not absolutely sure about it, from today's point of view that strange not-stopping-resource-after-after-rc=7 maybe might have been symptoms/combinations of quite sluggish cluster (Pacemaker still waiting for returns of RAs and/or Pacemaker itself) and zombie-monitor-ops (since I only saw in my own RAs' outputs that they'd get called for monitor-action, but not the id or something of the monitor-op calling them). Since yesterday, when we patched to latest officially released SLES11-HAE-SP1-packages, the zombie-monitor-ops (as well as many other problems) are gone (and only a few minor new ones so far (-;); and though not having explicitly looked/searched for it, I lately haven't seen such ignoring rc=7 and re-calling monitor-actions several times anymore. (But lately I also - due to enhancements to my own RAs (Tomcat6/Apache) - could remove the 15sec-start-delays for the monitor-op, which speeded them up a lot and thus them then only rarely being the ones attracting the zombie-monitor-ops) Current version now is (SLES11-HAE-SP1): 1.1.2-0.6.1 (Arch: x86_64) Displays in crm_mon as: 1.1.2-ecb1e2ea172ba2551f0bd763e557fccde68c849b >> (btw., jfyi: migration-thresholds are currently completely banned out of > > Why? Anything wrong with them? See my other thread, the filed bugzilla linked in there and Andrew Beekhof's confirmational cleared-upstream-note about fail-counts in bugzilla. http://developerbugs.linux-foundation.org/show_bug.cgi?id=2468 migration-threshold and failure-timeout seem to be fixed in this new, current SLES-release too. >> my configurations, so this is another issue; I probably also might have >> yet another issue / possible bug regarding zombie-(monitor-)operations, >> with symptoms like of an off-by-one-error) > > Please file a bugzilla if you find a bug. Though I allready had collected dozens of hb_reports with zombie-monitor-operations occuring and could quite exactly "predict" such a zombie from only watching the crm_mon during nodes switching to standby, I haven't found/identified an exact cause for it yet (turned out to at least not show up as an ordinary off-by-one-error; in the beginning it often hit the resources controlled by my own RAs, which were the ones starting last, but after having speeded them up, it rather hit them the least(-#), therefor I haven't file anything about that yet. Anyway, those zombie-monitor-operations seem to be gone now too, so they probably were only yet another long resolved old-version-bug, due to the very conservative policies for enterprise distributions. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker