Fantastic work, thank you very much for carrying this torch!
*José Valimhttps://dashbit.co/ <https://dashbit.co/>* On Thu, Feb 20, 2025 at 8:18 AM Adam Wight <adam.m.wi...@gmail.com> wrote: > For visibility, here are the upstream discussion and patch: > * https://erlangforums.com/t/open-port-and-zombie-processes/3111/5 > * https://github.com/erlang/otp/pull/9453 > > My biggest question for that group (and here as well!) is whether not > terminating child processes was a bug or a feature :-D — I feel that it's a > big change to behavior but very much in a corner case, where behavior was > unspecified. > > For comparison, the erlexec library always kills its spawned processes and > in fact it offers a range of killing methods: kill only the direct child > process, kill the process group, kill with a custom command, wait some > amount of time between SIGTERM and then SIGKILL. There is no option to not > kill the spawned processes as far as I see, which at 57M total downloads > for erlexec, and many pro-kill options, is a strong hint that nobody wants > no-kill. > On Monday, February 17, 2025 at 9:47:48 AM UTC+1 Adam Wight wrote: > >> Happily, I see that Erlang/OTP switched to using a single process >> "erl_child_setup" to fork all open_port children for performance reasons, >> some time around OTP-13087. In theory this makes it simple to stop all >> children when the parent terminates because there's already an intermediate >> process which can clean up its own children before exit, and it means >> there's already some indirection of signals and file descriptors which I >> expect will simplify reasoning about potential edge cases. >> >> Thanks, I'll follow this upstream! I've also submitted a small >> documentation patch to Elixir. >> >> -Adam >> >> On Sunday, February 16, 2025 at 10:00:57 AM UTC+1 José Valim wrote: >> >>> Thank you for the proposal Adam. In this case, the proposal has to be >>> sent upstream to Erlang, as Elixir simply delegates the Port functionality >>> to the Erlang VM. >>> >>> >>> *José Valimhttps://dashbit.co/ <https://dashbit.co/>* >>> >>> >>> On Sun, Feb 16, 2025 at 5:46 AM Adam Wight <adam.m...@gmail.com> wrote: >>> >>>> While writing a library to integrate with an indivisibly long-running, >>>> external program (rsync), I came across the problem described in >>>> https://hexdocs.pm/elixir/Port.html#module-zombie-operating-system-processes >>>> and I think there may be some fundamental mistakes in the advice given >>>> there. >>>> >>>> Our analysis in the Port documentation says that a polite application >>>> will detect when its stdio communication pipes are closed and will then >>>> terminate itself. The fact that this is the case seems to be accidental, >>>> and is based on an empirical observation that most applications do some >>>> sort of I/O, so when one of the standard file descriptors closes the >>>> application will encounter a read or write error and will stop. However, >>>> there are plenty of applications which can and should continue beyond this >>>> condition, and there's even a utility `nohup(1)` for exactly the purpose of >>>> allowing applications to ignore problems with stdio, for example when >>>> they're launched and backgrounded from an interactive terminal that will be >>>> closed. >>>> >>>> An example of a utility which does no I/O and therefore ignores stdio >>>> file descriptor statuses by default is `sleep(3)`, and I don't think it >>>> would be correct to make it stop because stdio is closed. Running under >>>> elixir provides a good demonstration of the problem we're looking at here: >>>> >>>> elixir -e 'System.cmd(System.find_executable("sleep"), ["60"])' >>>> >>>> Start that command and then kill the BEAM, and look for the sleep >>>> process. It should still be running, in process state "Ss". >>>> >>>> This will also demonstrate a second problem with the Port >>>> documentation, that the condition we're dealing with is an "orphan process" >>>> which is still running but is now unassociated with a BEAM parent and can >>>> no longer be controlled or communicated with by Elixir. Orphans are a >>>> bigger issue than "zombie processes", which have already terminated and >>>> will show up in `ps` output in state "Z", because an orphan can still cause >>>> side-effects and consume resources. >>>> >>>> Some helpful Internet posts led me to what I believe is the correct way >>>> to prevent an orphan child process, by calling it through an intermediate >>>> application similar to the one suggested by Port docs but using `prctl(2)` >>>> instead, which allows the intermediate to monitor the parent process (the >>>> BEAM) and kill its child if the parent is terminated. The code below still >>>> has a small race condition on launch, but I'll share it anyway: >>>> >>>> ```c >>>> #define _XOPEN_SOURCE 700 >>>> #include <signal.h> >>>> #include <stddef.h> >>>> #include <stdlib.h> >>>> #include <sys/prctl.h> >>>> #include <sys/wait.h> >>>> #include <unistd.h> >>>> >>>> pid_t child_pid; >>>> >>>> void handle_signal(int signum) { >>>> if (signum == SIGHUP && child_pid > 0) { >>>> kill(child_pid, SIGKILL); >>>> } >>>> } >>>> >>>> int main(int argc, char* argv[]) { >>>> // Send this process a HUP if the parent BEAM VM dies. >>>> // FIXME: race condition until this line, if the parent is already >>>> dead. >>>> prctl(PR_SET_PDEATHSIG, SIGHUP); >>>> >>>> // Listen for HUP and respond by killing the child process. >>>> struct sigaction action; >>>> action.sa_handler = handle_signal; >>>> action.sa_flags = 0; >>>> sigemptyset(&action.sa_mask); >>>> sigaction(SIGHUP, &action, NULL); >>>> >>>> child_pid = fork(); >>>> if (child_pid == 0) { >>>> const char* command = argv[1]; >>>> for (int i = 0; i < argc; i++) { >>>> argv[i] = argv[i + 1]; >>>> } >>>> execv(command, argv); >>>> } else { >>>> waitpid(child_pid, NULL, 0); >>>> } >>>> >>>> return 0; >>>> } >>>> ``` >>>> >>>> To try it out, save as main.c and compile like so: >>>> >>>> cc -g -O3 -std=c99 -pedantic -o parent-monitor main.c >>>> >>>> This tool shifts its argv by one to construct the child process, so >>>> from the command line it would be called like `parent-monitor >>>> /usr/bin/sleep 60` and to exercise a BEAM crash you can call it like so >>>> (after adjusting the paths for your system): >>>> >>>> elixir -e 'System.cmd("./parent-monitor", ["/usr/bin/sleep", "60"])' >>>> >>>> Now you can see the sleep process is killed as soon as the VM is >>>> stopped. >>>> >>>> Although it's a fringe issue since the BEAM is normally stopped only >>>> during development or when deploying new code, I feel like it could be >>>> useful to bundle the behavior into Elixir or Erlang itself. This could be >>>> seen as an elegant extension of the OTP supervisor tree principle beyond >>>> the VM boundary, it seems to have some real-world consequences, and it's an >>>> obscure problem for an application developer to solve from scratch each >>>> time. >>>> >>>> There's one more use case to mention, that such a behavior should >>>> probably be made optional, maybe as a flag to Port. I would imagine the >>>> default should be to use the wrapper, so the implicit option might look >>>> like `allow_orphan: false`. The rare case where we omit the wrapper would >>>> be when it's more useful to let the child continue than to maintain control >>>> over it. >>>> >>>> Kind regards, >>>> Adam Wight >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elixir-lang-core" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to elixir-lang-co...@googlegroups.com. >>>> To view this discussion visit >>>> https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- > You received this message because you are subscribed to the Google Groups > "elixir-lang-core" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elixir-lang-core+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/elixir-lang-core/8908a312-4c8d-4685-ba82-1c87a70b5c8fn%40googlegroups.com > <https://groups.google.com/d/msgid/elixir-lang-core/8908a312-4c8d-4685-ba82-1c87a70b5c8fn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-core+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4LnLbkNA1rc8sUOo-Ny7aK%3DSKF5%3DuozgXAgPSVpFfi1Gw%40mail.gmail.com.