For visibility, here are the upstream discussion and patch: * https://erlangforums.com/t/open-port-and-zombie-processes/3111/5 * https://github.com/erlang/otp/pull/9453
My biggest question for that group (and here as well!) is whether not terminating child processes was a bug or a feature :-D — I feel that it's a big change to behavior but very much in a corner case, where behavior was unspecified. For comparison, the erlexec library always kills its spawned processes and in fact it offers a range of killing methods: kill only the direct child process, kill the process group, kill with a custom command, wait some amount of time between SIGTERM and then SIGKILL. There is no option to not kill the spawned processes as far as I see, which at 57M total downloads for erlexec, and many pro-kill options, is a strong hint that nobody wants no-kill. On Monday, February 17, 2025 at 9:47:48 AM UTC+1 Adam Wight wrote: > Happily, I see that Erlang/OTP switched to using a single process > "erl_child_setup" to fork all open_port children for performance reasons, > some time around OTP-13087. In theory this makes it simple to stop all > children when the parent terminates because there's already an intermediate > process which can clean up its own children before exit, and it means > there's already some indirection of signals and file descriptors which I > expect will simplify reasoning about potential edge cases. > > Thanks, I'll follow this upstream! I've also submitted a small > documentation patch to Elixir. > > -Adam > > On Sunday, February 16, 2025 at 10:00:57 AM UTC+1 José Valim wrote: > >> Thank you for the proposal Adam. In this case, the proposal has to be >> sent upstream to Erlang, as Elixir simply delegates the Port functionality >> to the Erlang VM. >> >> >> *José Valimhttps://dashbit.co/ <https://dashbit.co/>* >> >> >> On Sun, Feb 16, 2025 at 5:46 AM Adam Wight <adam.m...@gmail.com> wrote: >> >>> While writing a library to integrate with an indivisibly long-running, >>> external program (rsync), I came across the problem described in >>> https://hexdocs.pm/elixir/Port.html#module-zombie-operating-system-processes >>> >>> and I think there may be some fundamental mistakes in the advice given >>> there. >>> >>> Our analysis in the Port documentation says that a polite application >>> will detect when its stdio communication pipes are closed and will then >>> terminate itself. The fact that this is the case seems to be accidental, >>> and is based on an empirical observation that most applications do some >>> sort of I/O, so when one of the standard file descriptors closes the >>> application will encounter a read or write error and will stop. However, >>> there are plenty of applications which can and should continue beyond this >>> condition, and there's even a utility `nohup(1)` for exactly the purpose of >>> allowing applications to ignore problems with stdio, for example when >>> they're launched and backgrounded from an interactive terminal that will be >>> closed. >>> >>> An example of a utility which does no I/O and therefore ignores stdio >>> file descriptor statuses by default is `sleep(3)`, and I don't think it >>> would be correct to make it stop because stdio is closed. Running under >>> elixir provides a good demonstration of the problem we're looking at here: >>> >>> elixir -e 'System.cmd(System.find_executable("sleep"), ["60"])' >>> >>> Start that command and then kill the BEAM, and look for the sleep >>> process. It should still be running, in process state "Ss". >>> >>> This will also demonstrate a second problem with the Port documentation, >>> that the condition we're dealing with is an "orphan process" which is still >>> running but is now unassociated with a BEAM parent and can no longer be >>> controlled or communicated with by Elixir. Orphans are a bigger issue than >>> "zombie processes", which have already terminated and will show up in `ps` >>> output in state "Z", because an orphan can still cause side-effects and >>> consume resources. >>> >>> Some helpful Internet posts led me to what I believe is the correct way >>> to prevent an orphan child process, by calling it through an intermediate >>> application similar to the one suggested by Port docs but using `prctl(2)` >>> instead, which allows the intermediate to monitor the parent process (the >>> BEAM) and kill its child if the parent is terminated. The code below still >>> has a small race condition on launch, but I'll share it anyway: >>> >>> ```c >>> #define _XOPEN_SOURCE 700 >>> #include <signal.h> >>> #include <stddef.h> >>> #include <stdlib.h> >>> #include <sys/prctl.h> >>> #include <sys/wait.h> >>> #include <unistd.h> >>> >>> pid_t child_pid; >>> >>> void handle_signal(int signum) { >>> if (signum == SIGHUP && child_pid > 0) { >>> kill(child_pid, SIGKILL); >>> } >>> } >>> >>> int main(int argc, char* argv[]) { >>> // Send this process a HUP if the parent BEAM VM dies. >>> // FIXME: race condition until this line, if the parent is already >>> dead. >>> prctl(PR_SET_PDEATHSIG, SIGHUP); >>> >>> // Listen for HUP and respond by killing the child process. >>> struct sigaction action; >>> action.sa_handler = handle_signal; >>> action.sa_flags = 0; >>> sigemptyset(&action.sa_mask); >>> sigaction(SIGHUP, &action, NULL); >>> >>> child_pid = fork(); >>> if (child_pid == 0) { >>> const char* command = argv[1]; >>> for (int i = 0; i < argc; i++) { >>> argv[i] = argv[i + 1]; >>> } >>> execv(command, argv); >>> } else { >>> waitpid(child_pid, NULL, 0); >>> } >>> >>> return 0; >>> } >>> ``` >>> >>> To try it out, save as main.c and compile like so: >>> >>> cc -g -O3 -std=c99 -pedantic -o parent-monitor main.c >>> >>> This tool shifts its argv by one to construct the child process, so from >>> the command line it would be called like `parent-monitor /usr/bin/sleep 60` >>> and to exercise a BEAM crash you can call it like so (after adjusting the >>> paths for your system): >>> >>> elixir -e 'System.cmd("./parent-monitor", ["/usr/bin/sleep", "60"])' >>> >>> Now you can see the sleep process is killed as soon as the VM is stopped. >>> >>> Although it's a fringe issue since the BEAM is normally stopped only >>> during development or when deploying new code, I feel like it could be >>> useful to bundle the behavior into Elixir or Erlang itself. This could be >>> seen as an elegant extension of the OTP supervisor tree principle beyond >>> the VM boundary, it seems to have some real-world consequences, and it's an >>> obscure problem for an application developer to solve from scratch each >>> time. >>> >>> There's one more use case to mention, that such a behavior should >>> probably be made optional, maybe as a flag to Port. I would imagine the >>> default should be to use the wrapper, so the implicit option might look >>> like `allow_orphan: false`. The rare case where we omit the wrapper would >>> be when it's more useful to let the child continue than to maintain control >>> over it. >>> >>> Kind regards, >>> Adam Wight >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elixir-lang-core" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elixir-lang-co...@googlegroups.com. >>> To view this discussion visit >>> https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-core+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/elixir-lang-core/8908a312-4c8d-4685-ba82-1c87a70b5c8fn%40googlegroups.com.