For visibility, here are the upstream discussion and patch:
* https://erlangforums.com/t/open-port-and-zombie-processes/3111/5
* https://github.com/erlang/otp/pull/9453

My biggest question for that group (and here as well!) is whether not 
terminating child processes was a bug or a feature :-D — I feel that it's a 
big change to behavior but very much in a corner case, where behavior was 
unspecified.

For comparison, the erlexec library always kills its spawned processes and 
in fact it offers a range of killing methods: kill only the direct child 
process, kill the process group, kill with a custom command, wait some 
amount of time between SIGTERM and then SIGKILL.  There is no option to not 
kill the spawned processes as far as I see, which at 57M total downloads 
for erlexec, and many pro-kill options, is a strong hint that nobody wants 
no-kill.
On Monday, February 17, 2025 at 9:47:48 AM UTC+1 Adam Wight wrote:

> Happily, I see that Erlang/OTP switched to using a single process 
> "erl_child_setup" to fork all open_port children for performance reasons, 
> some time around OTP-13087.  In theory this makes it simple to stop all 
> children when the parent terminates because there's already an intermediate 
> process which can clean up its own children before exit, and it means 
> there's already some indirection of signals and file descriptors which I 
> expect will simplify reasoning about potential edge cases.
>
> Thanks, I'll follow this upstream!  I've also submitted a small 
> documentation patch to Elixir.
>
> -Adam
>
> On Sunday, February 16, 2025 at 10:00:57 AM UTC+1 José Valim wrote:
>
>> Thank you for the proposal Adam. In this case, the proposal has to be 
>> sent upstream to Erlang, as Elixir simply delegates the Port functionality 
>> to the Erlang VM.
>>
>>
>> *José Valimhttps://dashbit.co/ <https://dashbit.co/>*
>>
>>
>> On Sun, Feb 16, 2025 at 5:46 AM Adam Wight <adam.m...@gmail.com> wrote:
>>
>>> While writing a library to integrate with an indivisibly long-running, 
>>> external program (rsync), I came across the problem described in 
>>> https://hexdocs.pm/elixir/Port.html#module-zombie-operating-system-processes
>>>  
>>> and I think there may be some fundamental mistakes in the advice given 
>>> there.
>>>
>>> Our analysis in the Port documentation says that a polite application 
>>> will detect when its stdio communication pipes are closed and will then 
>>> terminate itself.  The fact that this is the case seems to be accidental, 
>>> and is based on an empirical observation that most applications do some 
>>> sort of I/O, so when one of the standard file descriptors closes the 
>>> application will encounter a read or write error and will stop.  However, 
>>> there are plenty of applications which can and should continue beyond this 
>>> condition, and there's even a utility `nohup(1)` for exactly the purpose of 
>>> allowing applications to ignore problems with stdio, for example when 
>>> they're launched and backgrounded from an interactive terminal that will be 
>>> closed.
>>>
>>> An example of a utility which does no I/O and therefore ignores stdio 
>>> file descriptor statuses by default is `sleep(3)`, and I don't think it 
>>> would be correct to make it stop because stdio is closed.  Running under 
>>> elixir provides a good demonstration of the problem we're looking at here:
>>>
>>>     elixir -e 'System.cmd(System.find_executable("sleep"), ["60"])'
>>>
>>> Start that command and then kill the BEAM, and look for the sleep 
>>> process.  It should still be running, in process state "Ss".
>>>
>>> This will also demonstrate a second problem with the Port documentation, 
>>> that the condition we're dealing with is an "orphan process" which is still 
>>> running but is now unassociated with a BEAM parent and can no longer be 
>>> controlled or communicated with by Elixir.  Orphans are a bigger issue than 
>>> "zombie processes", which have already terminated and will show up in `ps` 
>>> output in state "Z", because an orphan can still cause side-effects and 
>>> consume resources.
>>>
>>> Some helpful Internet posts led me to what I believe is the correct way 
>>> to prevent an orphan child process, by calling it through an intermediate 
>>> application similar to the one suggested by Port docs but using `prctl(2)` 
>>> instead, which allows the intermediate to monitor the parent process (the 
>>> BEAM) and kill its child if the parent is terminated.  The code below still 
>>> has a small race condition on launch, but I'll share it anyway:
>>>
>>> ```c
>>> #define _XOPEN_SOURCE 700
>>> #include <signal.h>
>>> #include <stddef.h>
>>> #include <stdlib.h>
>>> #include <sys/prctl.h>
>>> #include <sys/wait.h>
>>> #include <unistd.h>
>>>
>>> pid_t child_pid;
>>>
>>> void handle_signal(int signum) {
>>>   if (signum == SIGHUP && child_pid > 0) {
>>>     kill(child_pid, SIGKILL);
>>>   }
>>> }
>>>
>>> int main(int argc, char* argv[]) {
>>>   // Send this process a HUP if the parent BEAM VM dies.
>>>   // FIXME: race condition until this line, if the parent is already 
>>> dead.
>>>   prctl(PR_SET_PDEATHSIG, SIGHUP);
>>>
>>>   // Listen for HUP and respond by killing the child process.
>>>   struct sigaction action;
>>>   action.sa_handler = handle_signal;
>>>   action.sa_flags = 0;
>>>   sigemptyset(&action.sa_mask);
>>>   sigaction(SIGHUP, &action, NULL);
>>>
>>>   child_pid = fork();
>>>   if (child_pid == 0) {
>>>     const char* command = argv[1];
>>>     for (int i = 0; i < argc; i++) {
>>>       argv[i] = argv[i + 1];
>>>     }
>>>     execv(command, argv);
>>>   } else {
>>>     waitpid(child_pid, NULL, 0);
>>>   }
>>>
>>>   return 0;
>>> }
>>> ```
>>>
>>> To try it out, save as main.c and compile like so:
>>>
>>>     cc -g -O3 -std=c99 -pedantic -o parent-monitor main.c 
>>>
>>> This tool shifts its argv by one to construct the child process, so from 
>>> the command line it would be called like `parent-monitor /usr/bin/sleep 60` 
>>> and to exercise a BEAM crash you can call it like so (after adjusting the 
>>> paths for your system):
>>>
>>>     elixir -e 'System.cmd("./parent-monitor", ["/usr/bin/sleep", "60"])'
>>>
>>> Now you can see the sleep process is killed as soon as the VM is stopped.
>>>
>>> Although it's a fringe issue since the BEAM is normally stopped only 
>>> during development or when deploying new code, I feel like it could be 
>>> useful to bundle the behavior into Elixir or Erlang itself.  This could be 
>>> seen as an elegant extension of the OTP supervisor tree principle beyond 
>>> the VM boundary, it seems to have some real-world consequences, and it's an 
>>> obscure problem for an application developer to solve from scratch each 
>>> time.
>>>
>>> There's one more use case to mention, that such a behavior should 
>>> probably be made optional, maybe as a flag to Port.  I would imagine the 
>>> default should be to use the wrapper, so the implicit option might look 
>>> like `allow_orphan: false`.  The rare case where we omit the wrapper would 
>>> be when it's more useful to let the child continue than to maintain control 
>>> over it.
>>>
>>> Kind regards,
>>> Adam Wight
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elixir-lang-core" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elixir-lang-co...@googlegroups.com.
>>> To view this discussion visit 
>>> https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elixir-lang-core+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/elixir-lang-core/8908a312-4c8d-4685-ba82-1c87a70b5c8fn%40googlegroups.com.

Reply via email to