Am 19.02.26 um 11:15 AM schrieb Dominik Csapak: > On 2/16/26 10:15 AM, Fiona Ebner wrote: >> Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler: >>> On February 13, 2026 2:16 pm, Fiona Ebner wrote: >>>> Am 13.02.26 um 1:20 PM schrieb Fabian Grünbichler: >>>>> On February 13, 2026 1:14 pm, Fiona Ebner wrote: >>>>>> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak: >>>>>>> + my $timeout = 30; >>>>>>> + my $starttime = time(); >>>>>>> my $pid = PVE::QemuServer::check_running($vmid); >>>>>>> - die "vm still running\n" if $pid; >>>>>>> + warn "vm still running - waiting up to $timeout >>>>>>> seconds\n" if $pid; >>>>>> >>>>>> While we're at it, we could improve the message here. Something like >>>>>> 'QEMU process $pid for VM $vmid still running (or newly started)' >>>>>> Having the PID is nice info for developers/support engineers and the >>>>>> case where a new instance is started before the cleanup was done >>>>>> is also >>>>>> possible. >>>>>> >>>>>> In fact, the case with the new instance is easily triggered by 'stop' >>>>>> mode backups. Maybe we should fix that up first before adding a >>>>>> timeout >>>>>> here? >>>>>> >>>>>> Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task >>>>>> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK >>>>>> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope. >>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102 >>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock... >>>>>> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116. >>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: OK >>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running >>>>> >>>>> does this mean we should actually have some sort of mechanism >>>>> similar to >>>>> the reboot flag to indicate a pending cleanup, and block/delay >>>>> starts if >>>>> it is still set? >>>> >>>> Blocking/delaying starts is not what happens for the reboot flag/file: >>> >>> that's not what I meant, the similarity was just "have a flag", not >>> "have a flag that behaves identical" ;) >>> >>> my proposal was: >>> - add a flag that indicates cleanup is pending (similar to reboot is >>> pending) >>> - *handle that flag* in the start flow to wait for the cleanup to be >>> done before starting >> >> Shouldn't we change the reboot flag to also do this? >> >>>>> Feb 13 14:00:16 pve9a1 qm[124470]: <root@pam> starting task >>>>> UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: >>>>> Feb 13 14:00:16 pve9a1 qm[124472]: <root@pam> starting task >>>>> UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam: >>>>> Feb 13 14:00:16 pve9a1 qm[124474]: start VM 102: >>>>> UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam: >>>>> [...] >>>>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Deactivated >>>>> successfully. >>>>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Consumed 2min 3.333s >>>>> CPU time, 2G memory peak. >>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: Starting cleanup for 102 >>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: trying to acquire lock... >>>>> Feb 13 14:00:23 pve9a1 qm[124470]: <root@pam> end task >>>>> UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: OK >>>>> Feb 13 14:00:23 pve9a1 systemd[1]: Started 102.scope. >>>>> Feb 13 14:00:23 pve9a1 qm[124474]: VM 102 started with PID 124620. >>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: OK >>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: vm still running >>>> >>>> Currently, it's just indicating whether the cleanup handler should >>>> start >>>> the VM again afterwards. >>>> >>>> Am 13.02.26 um 1:22 PM schrieb Dominik Csapak: >>>>> Sounds good, one possibility would be to do no cleanup at all when >>>>> doing >>>>> a stop mode backup? >>>>> We already know we'll need the resources (pid/socket/etc. files, >>>>> vgpus,...) again? >>>>> >>>>> Or is there some situation where that might not be the case? >>>> >>>> We do it for reboot (if not another start task sneaks in like in my >>>> example above), and I don't see a good reason from the top of my head >>>> why 'stop' mode backup should behave differently from a reboot (for >>>> running VMs). It even applies pending changes just like a reboot >>>> right now. >>> >>> but what about external callers doing something like: >>> >>> - stop >>> - do whatever >>> - start >>> >>> in rapid (automated) succession? those would still (possibly) trigger >>> cleanup after "doing whatever" and starting the VM again already? and in >>> particular if we skip cleanup for "our" cases of stop;start it will be >>> easy to introduce sideeffects in cleanup that break such usage? >> >> I did not argue for skipping cleanup. I argued for being consistent with >> reboot where we (try to) do cleanup. I just wasn't sure it's really >> needed. >> >>>> I'm not sure if there is an actual need to do cleanup or if we could >> >> I guess the actual need is to have more consistent behavior. >> > > ok so i think we'd need to > * create a cleanup flag for each vm when qmevent detects a vm shutting > down (in /var/run/qemu-server/VMID.cleanup, possibly with timestamp) > * removing that cleanup flag after cleanup (obviously) > * on start, check for that flag and block for some timeout before > starting (e.g. check the timestamp in the flag if it's longer than some > time, start it regardless?)
Sounds good to me. Unfortunately, something else: turns out that we kinda rely on qmeventd not doing the cleanup for the optimization with keeping the volumes active (i.e. $keepActive). And actually, the optimization applies randomly depending on who wins the race. Output below with added log line "doing cleanup for $vmid with keepActive=$keepActive" in vm_stop_cleanup() to be able to see what happens. We try to use the optimization but qmeventd interferes: > Feb 19 14:09:43 pve9a1 vzdump[168878]: <root@pam> starting task > UPID:pve9a1:000293AF:0017CFF8:69970B97:vzdump:102:root@pam: > Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: starting new backup job: vzdump > 102 --storage pbs --mode stop > Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: Starting Backup of VM 102 (qemu) > Feb 19 14:09:44 pve9a1 qm[168960]: shutdown VM 102: > UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: > Feb 19 14:09:44 pve9a1 qm[168959]: <root@pam> starting task > UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: > Feb 19 14:09:47 pve9a1 qm[168960]: VM 102 qga command failed - VM 102 qga > command 'guest-ping' failed - got timeout > Feb 19 14:09:50 pve9a1 qmeventd[166736]: read: Connection reset by peer > Feb 19 14:09:50 pve9a1 pvedaemon[166884]: <root@pam> end task > UPID:pve9a1:000290CD:0017B515:69970B52:vncproxy:102:root@pam: OK > Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Deactivated successfully. > Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Consumed 41.780s CPU time, 1.9G > memory peak. > Feb 19 14:09:51 pve9a1 qm[168960]: doing cleanup for 102 with keepActive=1 > Feb 19 14:09:51 pve9a1 qm[168959]: <root@pam> end task > UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: OK > Feb 19 14:09:51 pve9a1 qmeventd[168986]: Starting cleanup for 102 > Feb 19 14:09:51 pve9a1 qm[168986]: doing cleanup for 102 with keepActive=0 > Feb 19 14:09:51 pve9a1 qmeventd[168986]: Finished cleanup for 102 > Feb 19 14:09:51 pve9a1 systemd[1]: Started 102.scope. > Feb 19 14:09:51 pve9a1 vzdump[168879]: VM 102 started with PID 169021. We manage to get the optimization: > Feb 19 14:16:01 pve9a1 qm[174585]: shutdown VM 102: > UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam: > Feb 19 14:16:04 pve9a1 qm[174585]: VM 102 qga command failed - VM 102 qga > command 'guest-ping' failed - got timeout > Feb 19 14:16:07 pve9a1 qmeventd[166736]: read: Connection reset by peer > Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Deactivated successfully. > Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Consumed 46.363s CPU time, 2G > memory peak. > Feb 19 14:16:08 pve9a1 qm[174585]: doing cleanup for 102 with keepActive=1 > Feb 19 14:16:08 pve9a1 qm[174582]: <root@pam> end task > UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam: OK > Feb 19 14:16:08 pve9a1 systemd[1]: Started 102.scope. > Feb 19 14:16:08 pve9a1 qmeventd[174685]: Starting cleanup for 102 > Feb 19 14:16:08 pve9a1 qmeventd[174685]: trying to acquire lock... > Feb 19 14:16:08 pve9a1 vzdump[174326]: VM 102 started with PID 174718. > Feb 19 14:16:08 pve9a1 qmeventd[174685]: OK > Feb 19 14:16:08 pve9a1 qmeventd[174685]: vm still running For regular shutdown, we'll also do the cleanup twice. Maybe we also need a way to tell qmeventd that we already did the cleanup?
