Re: [Pacemaker] Question about behavior of the post-failure during the migrate_to

Kazunori INOUE Wed, 18 Dec 2013 23:41:01 -0800

Hi David,

2013/12/19 David Vossel <dvos...@redhat.com>:
>
> ----- Original Message -----
>> From: "Kazunori INOUE" <kazunori.ino...@gmail.com>
>> To: "pm" <pacemaker@oss.clusterlabs.org>
>> Sent: Wednesday, December 18, 2013 4:56:20 AM
>> Subject: [Pacemaker] Question about behavior of the post-failure during the  
>>  migrate_to
>>
>> Hi,
>>
>> When a node crashed while VM resource was migrating, the VM started
>> in two nodes. [1]
>> Is this the designed behavior?
>>
>> [1]
>>    Stack: corosync
>>    Current DC: bl460g1n6 (3232261592) - partition with quorum
>>    Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
>>    3 Nodes configured
>>    8 Resources configured
>>
>>
>>    Online: [ bl460g1n6 bl460g1n8 ]
>>    OFFLINE: [ bl460g1n7 ]
>>
>>    Full list of resources:
>>
>>    prmDummy        (ocf::pacemaker:Dummy): Started bl460g1n6
>>    prmVM2  (ocf::heartbeat:VirtualDomain): Started bl460g1n8
>>
>>
>>    # ssh bl460g1n6 virsh list --all
>>     Id    Name                           State
>>    ----------------------------------------------------
>>     113   vm2                            running
>>
>>    # ssh bl460g1n8 virsh list --all
>>     Id    Name                           State
>>    ----------------------------------------------------
>>     34    vm2                            running
>>
>>
>> [Steps to reproduce]
>> 1) Before migrate : vm2 running on bl460g1n7 (DC)
>>
>>    Stack: corosync
>>    Current DC: bl460g1n7 (3232261593) - partition with quorum
>>    Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
>>    3 Nodes configured
>>    8 Resources configured
>>
>>
>>    Online: [ bl460g1n6 bl460g1n7 bl460g1n8 ]
>>
>>    Full list of resources:
>>
>>    prmDummy        (ocf::pacemaker:Dummy): Started bl460g1n7
>>    prmVM2  (ocf::heartbeat:VirtualDomain): Started bl460g1n7
>>
>>    ...snip...
>>
>> 2) Migrate the VM resource,
>>
>>    # crm resource move prmVM2
>>
>>    bl460g1n6 was selected to migration destination.
>>
>>    Dec 18 14:11:36 bl460g1n7 crmd[6928]:   notice: te_rsc_command:
>> Initiating action 47: migrate_to prmVM2_migrate_to_0 on bl460g1n7
>> (local)
>>    Dec 18 14:11:36 bl460g1n7 lrmd[6925]:     info:
>> cancel_recurring_action: Cancelling operation prmVM2_monitor_10000
>>    Dec 18 14:11:36 bl460g1n7 crmd[6928]:     info: do_lrm_rsc_op:
>> Performing key=47:5:0:ddf348fe-fbad-4abb-9a12-8250f71b075a
>> op=prmVM2_migrate_to_0
>>    Dec 18 14:11:36 bl460g1n7 lrmd[6925]:     info: log_execute:
>> executing - rsc:prmVM2 action:migrate_to call_id:33
>>    Dec 18 14:11:36 bl460g1n7 crmd[6928]:     info: process_lrm_event:
>> LRM operation prmVM2_monitor_10000 (call=31, status=1, cib-update=0,
>> confirmed=true) Cancelled
>>    Dec 18 14:11:36 bl460g1n7 VirtualDomain(prmVM2)[7387]: INFO: vm2:
>> Starting live migration to bl460g1n6 (using remote hypervisor URI
>> qemu+ssh://bl460g1n6/system ).
>>
>> 3) And then, before migrate_to is completed after "virsh migrate"
>>    in VirtualDomain was completed, I made bl460g1n7 crash.
>>
>>    As a result, vm2 was running in bl460g1n6 already, but it was
>>    even started in bl460g1n8 by pacemaker. [1]
>
> Oh, wow. I see what is going on.  If the migrate_to action fails, we actually 
> have to call stop on the target node. I believe we attempt handle these 
> "dangling migrations" already, but something about your situation must be 
> different.  Can you please create a crm_report so we can have your pengine 
> files to test with?
>
> Creating a bug on bugs.clusterlabs.org to track this issue would also be a 
> good idea.  The holidays are coming up and I could see this getting lost 
> otherwise.
>
> Thanks,
> -- Vossel
>


I opened Bugzilla about this.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5186

Although I attached crm_report to bugzilla, is this enough as information?

>
>
>
>>    Dec 18 14:11:49 bl460g1n8 crmd[25981]:   notice: process_lrm_event:
>> LRM operation prmVM2_start_0 (call=31, rc=0, cib-update=28,
>> confirmed=true) ok
>>
>>
>> Best Regards,
>> Kazunori INOUE
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Question about behavior of the post-failure during the migrate_to

Reply via email to