On 2/16/26 10:15 AM, Fiona Ebner wrote:
Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler:
On February 13, 2026 2:16 pm, Fiona Ebner wrote:
Am 13.02.26 um 1:20 PM schrieb Fabian Grünbichler:
On February 13, 2026 1:14 pm, Fiona Ebner wrote:
Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
+                my $timeout = 30;
+                my $starttime = time();
                  my $pid = PVE::QemuServer::check_running($vmid);
-                die "vm still running\n" if $pid;
+                warn "vm still running - waiting up to $timeout seconds\n" if 
$pid;

While we're at it, we could improve the message here. Something like
'QEMU process $pid for VM $vmid still running (or newly started)'
Having the PID is nice info for developers/support engineers and the
case where a new instance is started before the cleanup was done is also
possible.

In fact, the case with the new instance is easily triggered by 'stop'
mode backups. Maybe we should fix that up first before adding a timeout
here?

Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running

does this mean we should actually have some sort of mechanism similar to
the reboot flag to indicate a pending cleanup, and block/delay starts if
it is still set?

Blocking/delaying starts is not what happens for the reboot flag/file:

that's not what I meant, the similarity was just "have a flag", not
"have a flag that behaves identical" ;)

my proposal was:
- add a flag that indicates cleanup is pending (similar to reboot is
   pending)
- *handle that flag* in the start flow to wait for the cleanup to be
   done before starting

Shouldn't we change the reboot flag to also do this?

Feb 13 14:00:16 pve9a1 qm[124470]: <root@pam> starting task 
UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam:
Feb 13 14:00:16 pve9a1 qm[124472]: <root@pam> starting task 
UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
Feb 13 14:00:16 pve9a1 qm[124474]: start VM 102: 
UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
[...]
Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Consumed 2min 3.333s CPU time, 2G 
memory peak.
Feb 13 14:00:23 pve9a1 qmeventd[124565]: Starting cleanup for 102
Feb 13 14:00:23 pve9a1 qmeventd[124565]: trying to acquire lock...
Feb 13 14:00:23 pve9a1 qm[124470]: <root@pam> end task 
UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: OK
Feb 13 14:00:23 pve9a1 systemd[1]: Started 102.scope.
Feb 13 14:00:23 pve9a1 qm[124474]: VM 102 started with PID 124620.
Feb 13 14:00:23 pve9a1 qmeventd[124565]:  OK
Feb 13 14:00:23 pve9a1 qmeventd[124565]: vm still running

Currently, it's just indicating whether the cleanup handler should start
the VM again afterwards.

Am 13.02.26 um 1:22 PM schrieb Dominik Csapak:
Sounds good, one possibility would be to do no cleanup at all when doing
a stop mode backup?
We already know we'll need the resources (pid/socket/etc. files, vgpus,...) 
again?

Or is there some situation where that might not be the case?

We do it for reboot (if not another start task sneaks in like in my
example above), and I don't see a good reason from the top of my head
why 'stop' mode backup should behave differently from a reboot (for
running VMs). It even applies pending changes just like a reboot right now.

but what about external callers doing something like:

- stop
- do whatever
- start

in rapid (automated) succession? those would still (possibly) trigger
cleanup after "doing whatever" and starting the VM again already? and in
particular if we skip cleanup for "our" cases of stop;start it will be
easy to introduce sideeffects in cleanup that break such usage?

I did not argue for skipping cleanup. I argued for being consistent with
reboot where we (try to) do cleanup. I just wasn't sure it's really needed.

I'm not sure if there is an actual need to do cleanup or if we could

I guess the actual need is to have more consistent behavior.


ok so i think we'd need to
* create a cleanup flag for each vm when qmevent detects a vm shutting down (in /var/run/qemu-server/VMID.cleanup, possibly with timestamp)
* removing that cleanup flag after cleanup (obviously)
* on start, check for that flag and block for some timeout before starting (e.g. check the timestamp in the flag if it's longer than some
time, start it regardless?)

?

also skip it when we are planning to spin up another instance right
away. But we do it for reboot, so the "safe" variant is also doing it
for 'stop' mode backup. History tells me it's been there since the
reboot functionality was added:
https://lists.proxmox.com/pipermail/pve-devel/2019-September/038988.html






Reply via email to