Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Fabian Grünbichler Tue, 01 Apr 2025 02:52:47 -0700

> Alexandre Derumier via pve-devel <pve-devel@lists.proxmox.com> hat am 
> 24.03.2025 12:15 CET geschrieben:
> verify that node is dead from corosync && ssh
> and move config file from /etc/pve directly


there are two reasons why this is dangerous and why we haven't exposed anything 
like this in the API or UI..

the first one is the risk of corruption - just because a (supposedly dead) node 
X is not reachable from your local node A doesn't mean it isn't still running. 
if it is still running, any guests that were already started before might still 
be running as well. any guests still running might still be able to talk to 
shared storage. unless there are other safeguards in place (like MMP, which is 
not a given for all storages), this can easily completely corrupt guest volumes 
if you attempt to recover and start such a guest. HA protects against this - 
node X will fence itself before node A will attempt recovery, so there is never 
a situation where both nodes will try to write to the same volume. just 
checking whether other cluster nodes can still connect to node X is not enough 
by any stretch to make this safe.

the second one is ownership of a VM/CT - PVE relies on node-local locking of 
guests to avoid contention. this only works because each guest/VMID has a clear 
owner - the node where the config is currently on. if you steal a config by 
moving it, you are violating this assumption. we only change the owner of a 
VMID in two scenarios with careful consideration of the implications:
- when doing a migration, which is initiated by the source node that is 
currently owning the guest, so it willingly hands over control to the new node 
which is safe by definition (no stealing involved and proper locking in place)
- when doing a HA recovery, which is protected by the HA locks and the watchdog 
- we know that the original node has been fenced before the recovery happens 
and we know it cannot do anything with the guest before it has been informed 
about the recovery (this is ensured by the design of the HA locks).
your code below is not protected by the HA stack, so there is a race involved - 
your node where the "deadnode migration" is initiated cannot lock the VMID in a 
way that the supposedly "dead" node knows about (config locking for guests is 
node-local, so it can only happen on the node that "owns" the config, anything 
else doesn't make sense/doesn't protect anything). if the "dead" node rejoins 
the cluster at the right moment, it still owns the VMID/config and can start 
it, while the other node thinks it can still steal it. there's also no 
protection against initiating multiple deadnode migrations in parallel for the 
same VMID, although of course all but one will fail because pmxcfs ensures the 
VMID.conf only exists under a single node. we'd need to give up node-local 
guest locking to close this gap, which is a no-go for performance reasons.

I understand that this would be convenient to expose, but it is also really 
dangerous without understanding the implications - and once there is an option 
to trigger it via the UI, no matter how many disclaimers you put on it, people 
will press that button and mess up and blame PVE. at the same time there is an 
actual implementation that safely implements it - it's called HA ;) so I'd 
rather spend some time focusing on improving the robustness of our HA stack, 
rather than adding such a footgun. 

> 
> Signed-off-by: Alexandre Derumier <alexandre.derum...@groupe-cyllene.com>
> ---
>  PVE/API2/Qemu.pm | 56 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 52 insertions(+), 4 deletions(-)
> 
> diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
> index 156b1c7b..58c454a6 100644
> --- a/PVE/API2/Qemu.pm
> +++ b/PVE/API2/Qemu.pm
> @@ -4764,6 +4764,9 @@ __PACKAGE__->register_method({
>               description => "Target node.",
>               completion =>  \&PVE::Cluster::complete_migration_target,
>              }),
> +            deadnode => get_standard_option('pve-node', {
> +                description => "Dead source node.",
> +            }),
>           online => {
>               type => 'boolean',
>               description => "Use online/live migration if VM is running. 
> Ignored if VM is stopped.",
> @@ -4813,8 +4816,9 @@ __PACKAGE__->register_method({
>       my $authuser = $rpcenv->get_user();
>  
>       my $target = extract_param($param, 'target');
> +     my $deadnode = extract_param($param, 'deadnode');
>  
> -     my $localnode = PVE::INotify::nodename();
> +     my $localnode = $deadnode ? $deadnode : PVE::INotify::nodename();
>       raise_param_exc({ target => "target is local node."}) if $target eq 
> $localnode;
>  
>       PVE::Cluster::check_cfs_quorum();
> @@ -4835,14 +4839,43 @@ __PACKAGE__->register_method({
>       raise_param_exc({ migration_network => "Only root may use this option." 
> })
>           if $param->{migration_network} && $authuser ne 'root@pam';
>  
> +     raise_param_exc({ deadnode => "Only root may use this option." })
> +         if $param->{deadnode} && $authuser ne 'root@pam';
> +
>       # test if VM exists
> -     my $conf = PVE::QemuConfig->load_config($vmid);
> +     my $conf = $deadnode ? PVE::QemuConfig->load_config($vmid, $deadnode) : 
> PVE::QemuConfig->load_config($vmid);
>  
>       # try to detect errors early
>  
>       PVE::QemuConfig->check_lock($conf);
>  
> -     if (PVE::QemuServer::check_running($vmid)) {
> +        if ($deadnode) {
> +         die "Can't do online migration of a dead node.\n" if 
> $param->{online};
> +         my $members = PVE::Cluster::get_members();
> +         die "The deadnode $deadnode seem to be alive" if 
> $members->{$deadnode} && $members->{$deadnode}->{online};
> +
> +         print "test if deadnode $deadnode respond to ping\n";
> +         eval {
> +             PVE::Tools::run_command("/usr/bin/ping -c 1 
> $members->{$deadnode}->{ip}");
> +         };
> +         if(!$@){
> +             die "error: ping to target $deadnode is still working. Node 
> seem to be alive.";
> +         }
> +
> +         #make an extra ssh connection to double check that it's not just a 
> corosync crash
> +         my $sshinfo = PVE::SSHInfo::get_ssh_info($deadnode);
> +         my $sshcmd = PVE::SSHInfo::ssh_info_to_command($sshinfo);
> +         push @$sshcmd, 'hostname';
> +         print "test if deadnode $deadnode respond to ssh\n";
> +         eval {
> +             PVE::Tools::run_command($sshcmd, timeout => 1);
> +         };
> +         if(!$@){
> +             die "error: ssh connection to target $deadnode is still 
> working. Node seem to be alive.";
> +         }
> +
> +
> +     } elsif (PVE::QemuServer::check_running($vmid)) {
>           die "can't migrate running VM without --online\n" if 
> !$param->{online};
>  
>           my $repl_conf = PVE::ReplicationConfig->new();
> @@ -4881,7 +4914,22 @@ __PACKAGE__->register_method({
>           PVE::QemuServer::check_storage_availability($storecfg, $conf, 
> $target);
>       }
>  
> -     if (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} ne 
> 'ha') {
> +     if ($deadnode) {
> +         my $realcmd = sub {
> +             my $config_fn = PVE::QemuConfig->config_file($vmid, $deadnode);
> +             my $new_config_fn = PVE::QemuConfig->config_file($vmid, 
> $target);
> +
> +             rename($config_fn, $new_config_fn)
> +                 or die "failed to move config file to node '$target': $!\n";
> +         };
> +
> +         my $worker = sub {
> +             return PVE::GuestHelpers::guest_migration_lock($vmid, 10, 
> $realcmd);
> +         };
> +
> +         return $rpcenv->fork_worker('qmigrate', $vmid, $authuser, $worker);
> +
> +        } elsif (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} 
> ne 'ha') {
>  
>           my $hacmd = sub {
>               my $upid = shift;
> -- 
> 2.39.5


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Reply via email to