Am 20.02.26 um 10:36 AM schrieb Dominik Csapak: > On 2/19/26 2:27 PM, Fiona Ebner wrote: >> Am 19.02.26 um 11:15 AM schrieb Dominik Csapak: >>> On 2/16/26 10:15 AM, Fiona Ebner wrote: >>>> Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler: >>>>> On February 13, 2026 2:16 pm, Fiona Ebner wrote: >>>> >>>> I guess the actual need is to have more consistent behavior. >>>> >>> >>> ok so i think we'd need to >>> * create a cleanup flag for each vm when qmevent detects a vm shutting >>> down (in /var/run/qemu-server/VMID.cleanup, possibly with timestamp) >>> * removing that cleanup flag after cleanup (obviously) >>> * on start, check for that flag and block for some timeout before >>> starting (e.g. check the timestamp in the flag if it's longer than some >>> time, start it regardless?) >> >> Sounds good to me. >> >> Unfortunately, something else: turns out that we kinda rely on qmeventd >> not doing the cleanup for the optimization with keeping the volumes >> active (i.e. $keepActive). And actually, the optimization applies >> randomly depending on who wins the race. >> >> Output below with added log line >> "doing cleanup for $vmid with keepActive=$keepActive" >> in vm_stop_cleanup() to be able to see what happens. >> >> We try to use the optimization but qmeventd interferes: >> >>> Feb 19 14:09:43 pve9a1 vzdump[168878]: <root@pam> starting task >>> UPID:pve9a1:000293AF:0017CFF8:69970B97:vzdump:102:root@pam: >>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: starting new backup job: >>> vzdump 102 --storage pbs --mode stop >>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: Starting Backup of VM >>> 102 (qemu) >>> Feb 19 14:09:44 pve9a1 qm[168960]: shutdown VM 102: >>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: >>> Feb 19 14:09:44 pve9a1 qm[168959]: <root@pam> starting task >>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: >>> Feb 19 14:09:47 pve9a1 qm[168960]: VM 102 qga command failed - VM 102 >>> qga command 'guest-ping' failed - got timeout >>> Feb 19 14:09:50 pve9a1 qmeventd[166736]: read: Connection reset by peer >>> Feb 19 14:09:50 pve9a1 pvedaemon[166884]: <root@pam> end task >>> UPID:pve9a1:000290CD:0017B515:69970B52:vncproxy:102:root@pam: OK >>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Deactivated successfully. >>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Consumed 41.780s CPU >>> time, 1.9G memory peak. >>> Feb 19 14:09:51 pve9a1 qm[168960]: doing cleanup for 102 with >>> keepActive=1 >>> Feb 19 14:09:51 pve9a1 qm[168959]: <root@pam> end task >>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: OK >>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Starting cleanup for 102 >>> Feb 19 14:09:51 pve9a1 qm[168986]: doing cleanup for 102 with >>> keepActive=0 >>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Finished cleanup for 102 >>> Feb 19 14:09:51 pve9a1 systemd[1]: Started 102.scope. >>> Feb 19 14:09:51 pve9a1 vzdump[168879]: VM 102 started with PID 169021. >> >> We manage to get the optimization: >> >>> Feb 19 14:16:01 pve9a1 qm[174585]: shutdown VM 102: >>> UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam: >>> Feb 19 14:16:04 pve9a1 qm[174585]: VM 102 qga command failed - VM 102 >>> qga command 'guest-ping' failed - got timeout >>> Feb 19 14:16:07 pve9a1 qmeventd[166736]: read: Connection reset by peer >>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Deactivated successfully. >>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Consumed 46.363s CPU >>> time, 2G memory peak. >>> Feb 19 14:16:08 pve9a1 qm[174585]: doing cleanup for 102 with >>> keepActive=1 >>> Feb 19 14:16:08 pve9a1 qm[174582]: <root@pam> end task >>> UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam: OK >>> Feb 19 14:16:08 pve9a1 systemd[1]: Started 102.scope. >>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: Starting cleanup for 102 >>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: trying to acquire lock... >>> Feb 19 14:16:08 pve9a1 vzdump[174326]: VM 102 started with PID 174718. >>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: OK >>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: vm still running >> >> For regular shutdown, we'll also do the cleanup twice. >> >> Maybe we also need a way to tell qmeventd that we already did the >> cleanup? > > > ok well then i'd try to do something like this: > > in > > 'vm_stop' we'll create a cleanup flag with timestamp + state (e.g. > 'queued') > > in vm_stop_cleanup we change/create the flag with > 'started' and clear the flag after cleanup
Why is the one in vm_stop needed? Is there any advantage over creating it directly in vm_stop_cleanup()? > (if it's here already in 'started' state within a timelimit, ignore it) > > in vm_start we block until the cleanup flag is gone or until some timeout > > in 'qm cleanup' we only start it if the flag does not exist Hmm, it does also call vm_stop_cleanup() so we could just re-use the check there for that part? I guess doing an early check doesn't hurt either, as long as we do call the post-stop hook. > I think this should make the behavior consistent?
