Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Thomas Lamprecht Tue, 01 Apr 2025 02:58:33 -0700

Am 01.04.25 um 11:52 schrieb Fabian Grünbichler:
>> Alexandre Derumier via pve-devel <pve-devel@lists.proxmox.com> hat am 
>> 24.03.2025 12:15 CET geschrieben:
>> verify that node is dead from corosync && ssh
>> and move config file from /etc/pve directly
> there are two reasons why this is dangerous and why we haven't exposed 
> anything like this in the API or UI..
> 
> the first one is the risk of corruption - just because a (supposedly dead) 
> node X is not reachable from your local node A doesn't mean it isn't still 
> running. if it is still running, any guests that were already started before 
> might still be running as well. any guests still running might still be able 
> to talk to shared storage. unless there are other safeguards in place (like 
> MMP, which is not a given for all storages), this can easily completely 
> corrupt guest volumes if you attempt to recover and start such a guest. HA 
> protects against this - node X will fence itself before node A will attempt 
> recovery, so there is never a situation where both nodes will try to write to 
> the same volume. just checking whether other cluster nodes can still connect 
> to node X is not enough by any stretch to make this safe.
> 
> the second one is ownership of a VM/CT - PVE relies on node-local locking of 
> guests to avoid contention. this only works because each guest/VMID has a 
> clear owner - the node where the config is currently on. if you steal a 
> config by moving it, you are violating this assumption. we only change the 
> owner of a VMID in two scenarios with careful consideration of the 
> implications:
> - when doing a migration, which is initiated by the source node that is 
> currently owning the guest, so it willingly hands over control to the new 
> node which is safe by definition (no stealing involved and proper locking in 
> place)
> - when doing a HA recovery, which is protected by the HA locks and the 
> watchdog - we know that the original node has been fenced before the recovery 
> happens and we know it cannot do anything with the guest before it has been 
> informed about the recovery (this is ensured by the design of the HA locks).
> your code below is not protected by the HA stack, so there is a race involved 
> - your node where the "deadnode migration" is initiated cannot lock the VMID 
> in a way that the supposedly "dead" node knows about (config locking for 
> guests is node-local, so it can only happen on the node that "owns" the 
> config, anything else doesn't make sense/doesn't protect anything). if the 
> "dead" node rejoins the cluster at the right moment, it still owns the 
> VMID/config and can start it, while the other node thinks it can still steal 
> it. there's also no protection against initiating multiple deadnode 
> migrations in parallel for the same VMID, although of course all but one will 
> fail because pmxcfs ensures the VMID.conf only exists under a single node. 
> we'd need to give up node-local guest locking to close this gap, which is a 
> no-go for performance reasons.
> 
> I understand that this would be convenient to expose, but it is also really 
> dangerous without understanding the implications - and once there is an 
> option to trigger it via the UI, no matter how many disclaimers you put on 
> it, people will press that button and mess up and blame PVE. at the same time 
> there is an actual implementation that safely implements it - it's called HA 
> 😉 so I'd rather spend some time focusing on improving the robustness of our 
> HA stack, rather than adding such a footgun. 
>


+1 to all of the above.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Reply via email to