hi, On Sun, February 15, 2009 7:37 pm, Raoul Bhatia [IPAX] wrote: > hi, > > i had the cluster up and running (2 nodes, wc01 and wc02). > > 1) following recent changes to the globally-unique handling, > i reviewed all my settings and explicitly set globally-unique=false to > several resources. > > 2) i tried to change the logging to log to the local7 changing my > logd.cf and ha.cf configuration files on the non-dc node. > > 3) i configured rsyslog to redirect local7 to /var/log/ha.log and > /var/log/ha-debug.log. > > > 4) i issued /etc/init.d/heartbeat reload
So you wanted all resources to be migrated to wc02 ... as this kills and restarts all heartbeat/pacemaker processes. Why not setting the cluster into unmanaged mode? > > i observed the following thing: > > a) the node kind of fenced itself. most programs (including ssh) got > killed. wc02 took over all resources after some time. Looks like the "group_webservice" also needs a "globally_unique=false" attribute: Feb 15 18:10:15 wc01 crmd: [17757]: info: do_lrm_rsc_op: Performing key=134:44:0:9949f46a-b1a8-454e-a180-b7d746507937 op=fs_www:0_stop_0 ) Feb 15 18:10:15 wc01 lrmd: [17754]: info: rsc:fs_www:0: stop Feb 15 18:10:15 wc01 crmd: [17757]: info: do_lrm_rsc_op: Performing key=146:44:0:9949f46a-b1a8-454e-a180-b7d746507937 op=fs_www:1_stop_0 ) Feb 15 18:10:15 wc01 lrmd: [17754]: info: rsc:fs_www:1: stop ... the clones cannot be told apart, they seem to be running both on this node so they are both stopped .... Feb 15 18:10:15 wc01 Filesystem[22969]: INFO: unmounted /data/www successfully ... first umount works as expected .... Feb 15 18:10:15 wc01 lrmd: [17754]: info: RA output: (fs_www:1:stop:stderr) umount2: Invalid argument Feb 15 18:10:15 wc01 lrmd: [17754]: info: RA output: (fs_www:1:stop:stderr) umount: /data/www: not mounted ... some sort of race condition ... the second umount returns an error and the Filesystem RA now tries to kill all processes accessing the "filesystem" which is now the root filesystem where the mountpoint /data/www resides :-( Feb 15 18:10:15 wc01 lrmd: [17754]: info: RA output: (fs_www:1:stop:stderr) Feb 15 18:10:15 wc01 Filesystem[22970]: ERROR: Couldn't unmount /data/www; trying cleanup with SIGTERM ... and _all_ processes accessing your root filesystem receive a SIGTERM. So the node did some kind of suicide. Regards, Andreas > > can any1 explain why this happened? > > cheers, raoul -- > ____________________________________________________________________ > DI (FH) Raoul Bhatia M.Sc. email. r.bha...@ipax.at > Technischer Leiter > > > IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at > Barawitzkagasse 10/2/2/11 email. off...@ipax.at > 1190 Wien tel. +43 1 3670030 > FN 277995t HG Wien fax. +43 1 3670030 15 > ____________________________________________________________________ > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > -- : Andreas Kurz : LINBIT | Your Way to High Availability : Tel +43-1-8178292-64, Fax +43-1-8178292-82 : : http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. This e-mail is solely for use by the intended recipient(s). Information contained in this e-mail and its attachments may be confidential, privileged or copyrighted. If you are not the intended recipient you are hereby formally notified that any use, copying, disclosure or distribution of the contents of this e-mail, in whole or in part, is prohibited. Also please notify immediately the sender by return e-mail and delete this e-mail from your system. Thank you for your co-operation. _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker