Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Dominik Csapak Tue, 01 Apr 2025 06:21:00 -0700

On 4/1/25 14:54, Thomas Lamprecht wrote:

Am 01.04.25 um 13:37 schrieb Dominik Csapak:

Mhmm, what I meant here is that instructing the user to manually
do 'mv some-path some-other-path' has more error potential (e.g.
typos, misremembering nodenames/vmids/etc.) than e.g. clicking
the vm on the offline node and pressing a button (or
following a CLI tool output/options)


Which all have their error potential too, especially with hostnames
being free-form and not exclusive.

I mentioned it because fabian wrote we could maybe solve it with a
cluster wide VM lock, I think restricting the moving to such a lock
could work, under the assumption that the admin makes sure the offline
node is and stays offline. (Which he has to do anyway)


Still not sure what this would provide, pmxcfs gurantees that the VMID
config can exist only once already anyway, so only one node can do a
move and such moves can only happen if they would be equal to a file
rename as any resource must be shared already to make this work.
Well replication could be fixed up I guess, but that can be handled on
VM start too. Cannot think of anything else (without an in-depth
evaluation though) that an API can/should do different for the actual
move itself. Doing some up-front checks is a different story, but that
could also result in a false sense of safety.

It still improves the UX for that situation since it's then a
provided/guided way vs. mv'ing files on the filesystem.


I'd not touch the move part though, at least for starters, just like the
upgrade checker scripts it should only assist.

Just to clarify, I'm not for blindly implementing such an API call/CLI tool/etc.
but wanted to argue that we probably want to improve the UX of that situation
as good as we can and offered my thoughts on how we could do it.

That's certainly fine; having it improved would be good, but I'm very wary

of hot takes and hand waving (not meaning you here, just in general), this
isn't a purge/remove/wipe of some resource on a working system, like wiping
disks or removing guests, as that can present the information to the admin
from a known good node that manages its state itself.
An unknown/dead node is literally breaking core clustering assumption that
we build upon on a lot of places, IMO a very different thing. Mentioning this
as it might be easy to question why other destructive actions are exposed in
the UI.

And FWIW, if I should reconsider this it would be much easier to argue for
further integration if the basic assistant/checker guide/tool already
existed for some time and was somewhat battle tested, as that would allow a
much more confident evaluation of options, whatever those then look like;
some "scary" hint in the UI with lots of exclamation marks does not cut it
for me though, no offense to anybody.



I agree with all of your points, so I think the best and easiest way to improve 
the current
situation would be to:

* Improve the docs to emphasize more that this situation should be an exception
  and that working around cluster assumptions can have severe consequences.
  (Maybe nudge users towards HA if this is a common situation for them)
  Also it be good for it to be in a (like you suggested) check-list style
  manner, so that admins have an guided way to check for things like
  storage, running nodes, etc.

* Change the migration UI to show a warning that the node is offline
  and provide a direct link to above mentioned improved docs

What do you think?


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Reply via email to