> Denis Kanchev <denis.kanc...@storpool.com> hat am 29.05.2025 09:33 CEST > geschrieben: > > > The issue here is that the storage plugin activate_volume() is called after > migration cancel which in case of network shared storage can make things bad. > This is a sort of race condition, because migration_cancel wont stop the > storage migration on the remote server. As you can see below a call to > activate_volume() is performed after migration_cancel. > In this case we issue volume detach from the old node ( to keep the data > consistent ) and we end up with a VM ( not migrated ) without this volume > attached. > We keep a track if activate_volume() is used for migration by the flag 'lock' > => 'migrate', which is cleared on migration_cancel - in case of migration we > won't detach the volume from the old VM. > In short: when the parent of this storage migration task gets killed, the > source node stops the migration, but the storage migration on the destination > node continues. > > Source node: > 2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03' > (10.10.17.3) > 2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03' > 2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed > with exit code 255 > 2025-04-11 03:26:52 aborting phase 2 - cleanup resources > 2025-04-11 03:26:52 migrate_cancel # <<< NOTE the time2025-04-11 03:26:53 > ERROR: migration finished with problems (duration 00:00:03) > TASK ERROR: migration problems
could you provide the full migration task log and the VM config? I thought your storage plugin is a shared storage, so there is no storage migration at all, yet you keep talking about storage migration? > Destination node:2025-04-11T03:26:51.559671+07:00 telpr01pve03 qm[3670216]: > <root@pam> starting task > UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam: > 2025-04-11T03:26:51.559897+07:00 telpr01pve03 qm[3670228]: start VM 2421: > UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam: so starting the VM on the target node failed? why? > 2025-04-11T03:26:51.837905+07:00 telpr01pve03 qm[3670228]: StorPool plugin: > Volume ~bj7n.b.abe is related to VM 2421, checking status ### Call to > PVE::Storage::Plugin::activate_volume()2025-04-11T03:26:53.072206+07:00 > telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, > will force detach volume ~bj7n.b.abe ###'lock' flag missing > 2025-04-11T03:26:53.108206+07:00 telpr01pve03 qm[3670228]: StorPool plugin: > Volume ~bj7n.b.sdj is related to VM 2421, checking status ### Second call to > activate_volume() after migrate_cancel2025-04-11T03:26:53.903357+07:00 > telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, > will force detach volume ~bj7n.b.sdj###'lock' flag missing > > > > > On Wed, May 28, 2025 at 9:33 AM Fabian Grünbichler > <f.gruenbich...@proxmox.com> wrote: > > > > > Denis Kanchev <denis.kanc...@storpool.com> hat am 28.05.2025 08:13 CEST > > geschrieben: > > > > > > > > > Here is the task log > > > 2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05' > > (10.10.17.5) > > > 2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05' > > > 2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class > > 10001 is big. Consider r2q change. > > > 2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be > > inconsistent. > > > 2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file > > '/usr/share/qemu-server/bootsplash.jpg' > > > 2025-04-11 03:45:46 start remote tunnel > > > 2025-04-11 03:45:46 ssh tunnel ver 1 > > > 2025-04-11 03:45:46 starting online/live migration on > > unix:/run/qemu-server/2282.migrate > > > 2025-04-11 03:45:46 set migration capabilities > > > 2025-04-11 03:45:46 migration downtime limit: 100 ms > > > 2025-04-11 03:45:46 migration cachesize: 4.0 GiB > > > 2025-04-11 03:45:46 set migration parameters > > > 2025-04-11 03:45:46 start migrate command to > > unix:/run/qemu-server/2282.migrate > > > 2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB > > VM-state, 162.1 MiB/s > > > ... > > > 2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB > > VM-state, 2.0 GiB/s > > > 2025-04-11 03:46:50 migration status error: failed > > > 2025-04-11 03:46:50 ERROR: online migrate failure - aborting > > > 2025-04-11 03:46:50 aborting phase 2 - cleanup resources > > > 2025-04-11 03:46:50 migrate_cancel > > > 2025-04-11 03:46:52 ERROR: migration finished with problems (duration > > 00:01:11) > > > TASK ERROR: migration problems > > > > okay, so no local disks involved.. not sure which process got killed then? > > ;) > > the state transfer happens entirely within the Qemu process, perl is just > > polling > > it to print the status, and that perl task worker is not OOM killed since > > it > > continues to print all the error handling messages.. > > > > > > that has weird implications with regards to threads, so I don't think > > that > > > > is a good idea.. > > > What you mean by that? Are any threads involved? > > > > not intentionally, no. the issue is that the whole "pr_set_deathsig" > > machinery > > works on the thread level, not the process level for historical reasons. > > so it > > actually would kill the child if the thread that called pr_set_deathsig > > exits.. > > > > I think we do want to improve how run_command handles the parent > > disappearing. > > but it's not that straight-forward to implement in a race-free fashion (in > > Perl). > > > > > _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel