While writing a library to integrate with an indivisibly long-running, external program (rsync), I came across the problem described in https://hexdocs.pm/elixir/Port.html#module-zombie-operating-system-processes and I think there may be some fundamental mistakes in the advice given there.
Our analysis in the Port documentation says that a polite application will detect when its stdio communication pipes are closed and will then terminate itself. The fact that this is the case seems to be accidental, and is based on an empirical observation that most applications do some sort of I/O, so when one of the standard file descriptors closes the application will encounter a read or write error and will stop. However, there are plenty of applications which can and should continue beyond this condition, and there's even a utility `nohup(1)` for exactly the purpose of allowing applications to ignore problems with stdio, for example when they're launched and backgrounded from an interactive terminal that will be closed. An example of a utility which does no I/O and therefore ignores stdio file descriptor statuses by default is `sleep(3)`, and I don't think it would be correct to make it stop because stdio is closed. Running under elixir provides a good demonstration of the problem we're looking at here: elixir -e 'System.cmd(System.find_executable("sleep"), ["60"])' Start that command and then kill the BEAM, and look for the sleep process. It should still be running, in process state "Ss". This will also demonstrate a second problem with the Port documentation, that the condition we're dealing with is an "orphan process" which is still running but is now unassociated with a BEAM parent and can no longer be controlled or communicated with by Elixir. Orphans are a bigger issue than "zombie processes", which have already terminated and will show up in `ps` output in state "Z", because an orphan can still cause side-effects and consume resources. Some helpful Internet posts led me to what I believe is the correct way to prevent an orphan child process, by calling it through an intermediate application similar to the one suggested by Port docs but using `prctl(2)` instead, which allows the intermediate to monitor the parent process (the BEAM) and kill its child if the parent is terminated. The code below still has a small race condition on launch, but I'll share it anyway: ```c #define _XOPEN_SOURCE 700 #include <signal.h> #include <stddef.h> #include <stdlib.h> #include <sys/prctl.h> #include <sys/wait.h> #include <unistd.h> pid_t child_pid; void handle_signal(int signum) { if (signum == SIGHUP && child_pid > 0) { kill(child_pid, SIGKILL); } } int main(int argc, char* argv[]) { // Send this process a HUP if the parent BEAM VM dies. // FIXME: race condition until this line, if the parent is already dead. prctl(PR_SET_PDEATHSIG, SIGHUP); // Listen for HUP and respond by killing the child process. struct sigaction action; action.sa_handler = handle_signal; action.sa_flags = 0; sigemptyset(&action.sa_mask); sigaction(SIGHUP, &action, NULL); child_pid = fork(); if (child_pid == 0) { const char* command = argv[1]; for (int i = 0; i < argc; i++) { argv[i] = argv[i + 1]; } execv(command, argv); } else { waitpid(child_pid, NULL, 0); } return 0; } ``` To try it out, save as main.c and compile like so: cc -g -O3 -std=c99 -pedantic -o parent-monitor main.c This tool shifts its argv by one to construct the child process, so from the command line it would be called like `parent-monitor /usr/bin/sleep 60` and to exercise a BEAM crash you can call it like so (after adjusting the paths for your system): elixir -e 'System.cmd("./parent-monitor", ["/usr/bin/sleep", "60"])' Now you can see the sleep process is killed as soon as the VM is stopped. Although it's a fringe issue since the BEAM is normally stopped only during development or when deploying new code, I feel like it could be useful to bundle the behavior into Elixir or Erlang itself. This could be seen as an elegant extension of the OTP supervisor tree principle beyond the VM boundary, it seems to have some real-world consequences, and it's an obscure problem for an application developer to solve from scratch each time. There's one more use case to mention, that such a behavior should probably be made optional, maybe as a flag to Port. I would imagine the default should be to use the wrapper, so the implicit option might look like `allow_orphan: false`. The rare case where we omit the wrapper would be when it's more useful to let the child continue than to maintain control over it. Kind regards, Adam Wight -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-core+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com.