Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Dominik Csapak Tue, 01 Apr 2025 03:20:18 -0700

On 4/1/25 11:57, Thomas Lamprecht wrote:

Am 01.04.25 um 11:52 schrieb Fabian Grünbichler:

Alexandre Derumier via pve-devel <pve-devel@lists.proxmox.com> hat am 
24.03.2025 12:15 CET geschrieben:
verify that node is dead from corosync && ssh
and move config file from /etc/pve directly

there are two reasons why this is dangerous and why we haven't exposed anything 
like this in the API or UI..

the first one is the risk of corruption - just because a (supposedly dead) node
X is not reachable from your local node A doesn't mean it isn't still running.
if it is still running, any guests that were already started before might still
be running as well. any guests still running might still be able to talk to
shared storage. unless there are other safeguards in place (like MMP, which is
not a given for all storages), this can easily completely corrupt guest volumes
if you attempt to recover and start such a guest. HA protects against this -
node X will fence itself before node A will attempt recovery, so there is never
a situation where both nodes will try to write to the same volume. just
checking whether other cluster nodes can still connect to node X is not enough
by any stretch to make this safe.

the second one is ownership of a VM/CT - PVE relies on node-local locking of
guests to avoid contention. this only works because each guest/VMID has a clear
owner - the node where the config is currently on. if you steal a config by
moving it, you are violating this assumption. we only change the owner of a
VMID in two scenarios with careful consideration of the implications:
- when doing a migration, which is initiated by the source node that is
currently owning the guest, so it willingly hands over control to the new node
which is safe by definition (no stealing involved and proper locking in place)
- when doing a HA recovery, which is protected by the HA locks and the watchdog
- we know that the original node has been fenced before the recovery happens
and we know it cannot do anything with the guest before it has been informed
about the recovery (this is ensured by the design of the HA locks).
your code below is not protected by the HA stack, so there is a race involved - your node where the "deadnode
migration" is initiated cannot lock the VMID in a way that the supposedly "dead" node knows about
(config locking for guests is node-local, so it can only happen on the node that "owns" the config, anything
else doesn't make sense/doesn't protect anything). if the "dead" node rejoins the cluster at the right
moment, it still owns the VMID/config and can start it, while the other node thinks it can still steal it. there's also
no protection against initiating multiple deadnode migrations in parallel for the same VMID, although of course all but
one will fail because pmxcfs ensures the VMID.conf only exists under a single node. we'd need to give up node-local
guest locking to close this gap, which is a no-go for performance reasons.

I understand that this would be convenient to expose, but it is also really
dangerous without understanding the implications - and once there is an option
to trigger it via the UI, no matter how many disclaimers you put on it, people
will press that button and mess up and blame PVE. at the same time there is an
actual implementation that safely implements it - it's called HA 😉 so I'd
rather spend some time focusing on improving the robustness of our HA stack,
rather than adding such a footgun.


+1 to all of the above.

while i also agree to all said here, I have one counter point to offer:

In the case that such an operation is necessary (e.g. HA is not 
wanted/needed/possible
for what ever reason), the user will fall back to do it manually (iow. 'mv 
source target')
which is at least as dangerous as exposing over the API, since

* now the admins sharing the system must share root@pam credentials 
(ssh/console access)
  (alternatively setup sudo, which has it's own problems)

* it promotes manually modifying /etc/pve/ content

* any error could be even more fatal than if done via the API
  (e.g. mv of the wrong file, from the wrong node, etc.)


IMHO ways forward for this scenario could be:

* use cluster level locking only for config move? (not sure if performance is 
still
  a concern for this action, since parallel moves don't happen too much?)

* provide a special CLI tool/cmd to deal with that -> would minimize potential
  errors but is still contained to root equivalent users

* link to the doc section for it from the UI with a big caveat
  https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovery

just my 2c


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Reply via email to