Looks like the VirtualDomain RA isn't correctly implementing stop. Stop of an undefined domain shouldn't produce an error.
On Mon, Jul 4, 2011 at 9:51 PM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > Hi all, > > There is feeling that race condition is possible during live migration > of resources. > > I put one node to standby mode, that made all resources migrate to > another one. > Virtual machines were successfully live-migrated, but then marked as > FAILED almost immediately. > Logs show some interesting details: > ========= > Jul 4 10:21:48 s01-1 VirtualDomain[22988]: INFO: > mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration to s01-0 succeeded. > Jul 4 10:21:48 s01-1 lrmd: [7741]: info: RA output: > (mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_to:stdout) Domain > mgmt01.c01.ttc.prague.cz.vds-ok.com has been undefined > Jul 4 10:21:48 s01-0 VirtualDomain[4641]: INFO: > mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration from s01-1 succeeded. > Jul 4 10:21:49 s01-0 lrmd: [1927]: info: RA output: > (mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_from:stderr) > mgmt01.c01.ttc.prague.cz.vds-ok.com-vm is active on more than one node, > returning the default value for <null> > Jul 4 10:21:49 s01-1 crmd: [7744]: info: do_lrm_rsc_op: Performing > key=110:695:0:7ae65826-5d35-41c0-945a-8336ecb0bc3c > op=mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 ) > Jul 4 10:21:49 s01-1 lrmd: [7741]: info: > rsc:mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:1006: stop > Jul 4 10:21:49 s01-1 VirtualDomain[24062]: ERROR: Virtual domain > mgmt01.c01.ttc.prague.cz.vds-ok.com has no state during stop operation, > bailing out. > Jul 4 10:21:49 s01-1 crmd: [7744]: info: process_lrm_event: LRM > operation mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 (call=1006, > rc=0, cib-update=1031, confirmed=true) ok > ========= > Note that line with "is active on more than one node" follows "migration > from s01-1 succeeded" immediately in syslog (in both local and remote > files), so it was put into syslog queue immediately after former one. > > From what I understand, lrmd made decision to fail resource just because > 'stop' operation was not yet run on another node. > > What else can it be if my feeling is wrong? > > Version of pacemaker is 'almost' 1.1-devel tip. > cluster-glue is 1.0.7 > I use own version of VirtualDomain RA, but it has the same migration > logic as a stock one. > > Best, > Vladislav > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker