Hi, On Fri, Dec 18, 2009 at 03:44:11PM +0100, Sebastian Reitenbach wrote: > Hi, > > I have a 4 node cluster, managing some XEN resouces. The XEN resources have > location constrains defined, based on pingd. On each node, a pingd clone is > running. XEN resources are only started, when the pingd is able to ping the > ping node. The xen nodes also have a preferred and fallback location defined. > The pingd resources have a timeout of 60 seconds defined. > The cluster nodes run on SLES11, x86_64, with those rpms installed: > heartbeat-3.0.0-33.2 > pacemaker-1.0.5-4.1 > libpacemaker3-1.0.5-4.1 > pacemaker-mgmt-client-1.99.2-7.1 > pacemaker-mgmt-1.99.2-7.1 > openais-0.80.3-26.1 > libopenais2-0.80.3-26.1 > > I want to switch to a redundant network layout, using spanning tree between > the switches. In case of a spanning tree recalculation because of a path > failure or whatever other reason, I don't want to have nodes declared as dead > because they cannot send heartbeat at that time to each other. > > Therefore I tried to prepare pacemaker on the cluster nodes. > I put the whole cluster in maintenance mode via the hb_gui. > > Then I reconfigured /etc/ha.d/ha.cf and defined deadtime 70 and initdead 100. > Then I restarted heartbeat on each cluster node. I waited until all cluster > members were marked green/online in the GUI again. Then I turned off the > maintenance mode. > All XEN resources were shut down immediately.
Oops. > Then A sentence missing? > In the hb_gui, the pingd resources looked a bit "strange". After leaving the > maintenance mode, only one pingd resource showed the description > ocf.:pacemaker:pingd, in hb_gui under Management. They were green, and showed > it running on ['<server>']. > > Then I tried to restart the XEN resources manually, but the cluster only > tried > to start them on one host, not on the preferred or fallback location. > > Then I shutted down heartbeat on all 4 cluster nodes again, and put back the > old ha.cf file, with deadtime 15 and initdead 40. And restarted heartbeat. > After the cluster was running, the pingd resources were also started up. > And then after the 60 seconds, the ping attribute was set, and the XEN > resources were started up on all hosts. > > I wonder about some things: > 1. why three of the pingd resources had no description shown after leaving > the > maintenance mode. > > 2. why all XEN resources were shut down after leaving the maintenance mode. > Here I have a theory: In maintenance mode, the pingd attribute did not got > updated, and because heartbeat was restarted on each node, the attribute was > not set. Therefore when leaving the maintenance mode, pacemaker decided to > shut down the XEN resources, because the pingd attribute was not set. Sounds like a plausible explanation. > 3. Why the pingd attribute was not set immediately after pingd started up, > and > was able to ping the ping node. After the pingd was started, then it waited > 60 > seconds (the timeout value) to set the attribute so that then the XEN > resources were able to start, due to their location constraint. > > 4. Maybe the answers to the other questions will answer this alaready: > Why the cluster behaved that strange at all with the large timeout values set > in ha.cf. > > I could also send a cluster-report in case it may help to figure out what was > wrong here, I just did not wanted to send a large attachement to the list in > the first place. Probably the best to open a bugzilla and attach there the report. I guess that special care is necessary on setting resources to the unmanaged mode in case there are constraints which depend on pingd attributes. Thanks, Dejan > regards, > Sebastian > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker