Hi Andrew,
First thanks for remembering my issue and looking into it :)
Jul 30 11:37:50 [..]
Yes but... See the time line pasted below. (at 11:37, it starts to do
something)
11:20AM : cluster is up and running
11:25AM : shutdown the IP
11:30AM : force a refresh with attrd_updater (because pingd=1 still)
It doesn't change anything still seen as up...
11:37AM : change a value in the CIB dampen from 120 to 121 for instance
Now db2 pingd is null but db1 is still 1. crm changes have
been done on db2 - dunno if it's linked.
11:40AM : start the IP again
12:00AM : IP is still seen as down...
So if you can look earlier in the logs, you might see the problem.
Around 11:25AM I shutdown the IP (see the timeline above) so the CIB
should have been updated with pingd=0 for both nodes but it's not or
half done. At 11:37, I updated a value in the config which usually force
a flush of the CIB and fix everything, that's what you saw. I can redo a
better test so you can maybe see more. So far, when the gateway flips
and then pacemaker goes "berko", my trick to fix the status is
attrd_update -R (sorry can't remember the correct syntax on top of my
memory) and then everything is fine again. Something doesn't update the
CIB for sure but I don't know what.
Alas there is no debug running so I can't say for sure that the call
returned, but this makes it pretty likely:
Anyway, how do I enable more debug so we can see what doesn't update the
CIB ? Then I will give you a fresh hb_report :)
Cheers,
Thomas
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker