> Denis Kanchev <denis.kanc...@storpool.com> hat am 22.05.2025 08:55 CEST > geschrieben: > > > The parent of the storage migration process gets killed. > > It seems that this is the desired behavior and as far i understand it > correctly - the child worker is detached from the parent and it has > nothing to do with it after spawning.
was this a remote migration or a regular migration? could you maybe post the full task log? for a regular migration, the storage migration just uses our "run_command" helper. run_command uses open3 to spawn the command, and select for command output handling. basically the process tree would look like this API worker (one of X in pvedaemon) -> task worker (executing the migration code) --> storage migration command (xxx | ssh target_node xxx) and it does seem like run_command doesn't properly forward the parent being killed/terminated: $ perl -e 'use strict; use warnings; use PVE::Tools; warn "parent pid: $$\n"; PVE::Tools::run_command([["bash", "-c", "sleep 10; sleep 20; echo after > /tmp/file"]]);' parent pid: 204620 [1] 204618 terminated sudo perl -e (sending SIGTERM from another shell to 204620). the bash command continues executing, and also writes to /tmp/file after the sleeps are finished.. the same is also true for SIGKILL. SIGINT properly cleans up the child though. @Wolfgang: is this desired behaviour? > > Thanks for the information, it was very helpful. > > On 22.05.25 г. 9:30 ч., Fabian Grünbichler wrote: > >> Denis Kanchev via pve-devel <pve-devel@lists.proxmox.com> hat am > >> 21.05.2025 15:13 CEST geschrieben: > >> Hello, > >> > >> We had an issue with a customer migrating a VM between nodes using our > >> shared storage solution. > >> > >> On the target host the OOM killer killed the main migration process, but > >> the child process (which actually performs the migration) kept on > >> working, which we did not expect, and that caused some issues. > > could you be more specific which process got killed? > > > > when you do a migration, a task worker is forked and its UPID is returned > > to the caller for further querying. > > > > as part of the migration, other processes get spawned: > > - ssh tunnel to the target node > > - storage migration processes (on both nodes) > > - VM state management CLI calls (on the target node) > > > > which of those is the "main migration process"? which is the child process? > > > >> This leads us to the broader question - after a request is submitted, > >> the parent can be terminated, and not return a response to the client, > >> while the work is being done, and the request can be wrongly retried or > >> considered unfinished. > > the parent should return almost immediately, as all it is doing at that > > point is returning the UPID to the client (the process then continues to > > do other work though, but that is no longer related to this task). > > > > the only exception is for "sync" task workers, like in a CLI context, > > where the "parent" has no other work to do, so it waits for the child/task > > to finish and prints its output while doing so, and some "bulk action" > > style API calls that fork multiple task workers and poll them themselves. > > > >> Should the child processes terminate together with the parent to guard > >> against this, or is this expected behavior? > > the parent (API worker process) and child (task worker process) have no > > direct relation after the task worker has been spawned. > > > >> Here is an example patch to do this: > >> > >> > >> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm > >> > >> index bfde7e6..744fffc 100644 > >> > >> --- a/src/PVE/RESTEnvironment.pm > >> > >> +++ b/src/PVE/RESTEnvironment.pm > >> > >> @@ -13,8 +13,9 @@ use Fcntl qw(:flock); > >> > >> use IO::File; > >> > >> use IO::Handle; > >> > >> use IO::Select; > >> > >> -use POSIX qw(:sys_wait_h EINTR); > >> > >> +use POSIX qw(:sys_wait_h EINTR SIGKILL); > >> > >> use AnyEvent; > >> > >> +use Linux::Prctl qw(set_pdeathsig); > >> > >> > >> use PVE::Exception qw(raise raise_perm_exc); > >> > >> use PVE::INotify; > >> > >> @@ -549,6 +550,9 @@ sub fork_worker { > >> > >> POSIX::setsid(); > >> > >> } > >> > >> > >> + # The signal that the calling process will get when its parent dies > >> > >> + set_pdeathsig(SIGKILL); > > that has weird implications with regards to threads, so I don't think that > > is a good idea.. > > > >> + > >> > >> POSIX::close ($psync[0]); > >> > >> POSIX::close ($ctrlfd[0]) if $sync; > >> > >> POSIX::close ($csync[1]); _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel