Happily, I see that Erlang/OTP switched to using a single process "erl_child_setup" to fork all open_port children for performance reasons, some time around OTP-13087. In theory this makes it simple to stop all children when the parent terminates because there's already an intermediate process which can clean up its own children before exit, and it means there's already some indirection of signals and file descriptors which I expect will simplify reasoning about potential edge cases.
Thanks, I'll follow this upstream! I've also submitted a small documentation patch to Elixir. -Adam On Sunday, February 16, 2025 at 10:00:57 AM UTC+1 José Valim wrote: > Thank you for the proposal Adam. In this case, the proposal has to be sent > upstream to Erlang, as Elixir simply delegates the Port functionality to > the Erlang VM. > > > *José Valimhttps://dashbit.co/ <https://dashbit.co/>* > > > On Sun, Feb 16, 2025 at 5:46 AM Adam Wight <adam.m...@gmail.com> wrote: > >> While writing a library to integrate with an indivisibly long-running, >> external program (rsync), I came across the problem described in >> https://hexdocs.pm/elixir/Port.html#module-zombie-operating-system-processes >> and I think there may be some fundamental mistakes in the advice given >> there. >> >> Our analysis in the Port documentation says that a polite application >> will detect when its stdio communication pipes are closed and will then >> terminate itself. The fact that this is the case seems to be accidental, >> and is based on an empirical observation that most applications do some >> sort of I/O, so when one of the standard file descriptors closes the >> application will encounter a read or write error and will stop. However, >> there are plenty of applications which can and should continue beyond this >> condition, and there's even a utility `nohup(1)` for exactly the purpose of >> allowing applications to ignore problems with stdio, for example when >> they're launched and backgrounded from an interactive terminal that will be >> closed. >> >> An example of a utility which does no I/O and therefore ignores stdio >> file descriptor statuses by default is `sleep(3)`, and I don't think it >> would be correct to make it stop because stdio is closed. Running under >> elixir provides a good demonstration of the problem we're looking at here: >> >> elixir -e 'System.cmd(System.find_executable("sleep"), ["60"])' >> >> Start that command and then kill the BEAM, and look for the sleep >> process. It should still be running, in process state "Ss". >> >> This will also demonstrate a second problem with the Port documentation, >> that the condition we're dealing with is an "orphan process" which is still >> running but is now unassociated with a BEAM parent and can no longer be >> controlled or communicated with by Elixir. Orphans are a bigger issue than >> "zombie processes", which have already terminated and will show up in `ps` >> output in state "Z", because an orphan can still cause side-effects and >> consume resources. >> >> Some helpful Internet posts led me to what I believe is the correct way >> to prevent an orphan child process, by calling it through an intermediate >> application similar to the one suggested by Port docs but using `prctl(2)` >> instead, which allows the intermediate to monitor the parent process (the >> BEAM) and kill its child if the parent is terminated. The code below still >> has a small race condition on launch, but I'll share it anyway: >> >> ```c >> #define _XOPEN_SOURCE 700 >> #include <signal.h> >> #include <stddef.h> >> #include <stdlib.h> >> #include <sys/prctl.h> >> #include <sys/wait.h> >> #include <unistd.h> >> >> pid_t child_pid; >> >> void handle_signal(int signum) { >> if (signum == SIGHUP && child_pid > 0) { >> kill(child_pid, SIGKILL); >> } >> } >> >> int main(int argc, char* argv[]) { >> // Send this process a HUP if the parent BEAM VM dies. >> // FIXME: race condition until this line, if the parent is already dead. >> prctl(PR_SET_PDEATHSIG, SIGHUP); >> >> // Listen for HUP and respond by killing the child process. >> struct sigaction action; >> action.sa_handler = handle_signal; >> action.sa_flags = 0; >> sigemptyset(&action.sa_mask); >> sigaction(SIGHUP, &action, NULL); >> >> child_pid = fork(); >> if (child_pid == 0) { >> const char* command = argv[1]; >> for (int i = 0; i < argc; i++) { >> argv[i] = argv[i + 1]; >> } >> execv(command, argv); >> } else { >> waitpid(child_pid, NULL, 0); >> } >> >> return 0; >> } >> ``` >> >> To try it out, save as main.c and compile like so: >> >> cc -g -O3 -std=c99 -pedantic -o parent-monitor main.c >> >> This tool shifts its argv by one to construct the child process, so from >> the command line it would be called like `parent-monitor /usr/bin/sleep 60` >> and to exercise a BEAM crash you can call it like so (after adjusting the >> paths for your system): >> >> elixir -e 'System.cmd("./parent-monitor", ["/usr/bin/sleep", "60"])' >> >> Now you can see the sleep process is killed as soon as the VM is stopped. >> >> Although it's a fringe issue since the BEAM is normally stopped only >> during development or when deploying new code, I feel like it could be >> useful to bundle the behavior into Elixir or Erlang itself. This could be >> seen as an elegant extension of the OTP supervisor tree principle beyond >> the VM boundary, it seems to have some real-world consequences, and it's an >> obscure problem for an application developer to solve from scratch each >> time. >> >> There's one more use case to mention, that such a behavior should >> probably be made optional, maybe as a flag to Port. I would imagine the >> default should be to use the wrapper, so the implicit option might look >> like `allow_orphan: false`. The rare case where we omit the wrapper would >> be when it's more useful to let the child continue than to maintain control >> over it. >> >> Kind regards, >> Adam Wight >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elixir-lang-core" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elixir-lang-co...@googlegroups.com. >> To view this discussion visit >> https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-core+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/elixir-lang-core/d8e674c4-1953-4449-b902-8029889f4507n%40googlegroups.com.