Fantastic work, thank you very much for carrying this torch!

*José Valimhttps://dashbit.co/ <https://dashbit.co/>*


On Thu, Feb 20, 2025 at 8:18 AM Adam Wight <adam.m.wi...@gmail.com> wrote:

> For visibility, here are the upstream discussion and patch:
> * https://erlangforums.com/t/open-port-and-zombie-processes/3111/5
> * https://github.com/erlang/otp/pull/9453
>
> My biggest question for that group (and here as well!) is whether not
> terminating child processes was a bug or a feature :-D — I feel that it's a
> big change to behavior but very much in a corner case, where behavior was
> unspecified.
>
> For comparison, the erlexec library always kills its spawned processes and
> in fact it offers a range of killing methods: kill only the direct child
> process, kill the process group, kill with a custom command, wait some
> amount of time between SIGTERM and then SIGKILL.  There is no option to not
> kill the spawned processes as far as I see, which at 57M total downloads
> for erlexec, and many pro-kill options, is a strong hint that nobody wants
> no-kill.
> On Monday, February 17, 2025 at 9:47:48 AM UTC+1 Adam Wight wrote:
>
>> Happily, I see that Erlang/OTP switched to using a single process
>> "erl_child_setup" to fork all open_port children for performance reasons,
>> some time around OTP-13087.  In theory this makes it simple to stop all
>> children when the parent terminates because there's already an intermediate
>> process which can clean up its own children before exit, and it means
>> there's already some indirection of signals and file descriptors which I
>> expect will simplify reasoning about potential edge cases.
>>
>> Thanks, I'll follow this upstream!  I've also submitted a small
>> documentation patch to Elixir.
>>
>> -Adam
>>
>> On Sunday, February 16, 2025 at 10:00:57 AM UTC+1 José Valim wrote:
>>
>>> Thank you for the proposal Adam. In this case, the proposal has to be
>>> sent upstream to Erlang, as Elixir simply delegates the Port functionality
>>> to the Erlang VM.
>>>
>>>
>>> *José Valimhttps://dashbit.co/ <https://dashbit.co/>*
>>>
>>>
>>> On Sun, Feb 16, 2025 at 5:46 AM Adam Wight <adam.m...@gmail.com> wrote:
>>>
>>>> While writing a library to integrate with an indivisibly long-running,
>>>> external program (rsync), I came across the problem described in
>>>> https://hexdocs.pm/elixir/Port.html#module-zombie-operating-system-processes
>>>> and I think there may be some fundamental mistakes in the advice given
>>>> there.
>>>>
>>>> Our analysis in the Port documentation says that a polite application
>>>> will detect when its stdio communication pipes are closed and will then
>>>> terminate itself.  The fact that this is the case seems to be accidental,
>>>> and is based on an empirical observation that most applications do some
>>>> sort of I/O, so when one of the standard file descriptors closes the
>>>> application will encounter a read or write error and will stop.  However,
>>>> there are plenty of applications which can and should continue beyond this
>>>> condition, and there's even a utility `nohup(1)` for exactly the purpose of
>>>> allowing applications to ignore problems with stdio, for example when
>>>> they're launched and backgrounded from an interactive terminal that will be
>>>> closed.
>>>>
>>>> An example of a utility which does no I/O and therefore ignores stdio
>>>> file descriptor statuses by default is `sleep(3)`, and I don't think it
>>>> would be correct to make it stop because stdio is closed.  Running under
>>>> elixir provides a good demonstration of the problem we're looking at here:
>>>>
>>>>     elixir -e 'System.cmd(System.find_executable("sleep"), ["60"])'
>>>>
>>>> Start that command and then kill the BEAM, and look for the sleep
>>>> process.  It should still be running, in process state "Ss".
>>>>
>>>> This will also demonstrate a second problem with the Port
>>>> documentation, that the condition we're dealing with is an "orphan process"
>>>> which is still running but is now unassociated with a BEAM parent and can
>>>> no longer be controlled or communicated with by Elixir.  Orphans are a
>>>> bigger issue than "zombie processes", which have already terminated and
>>>> will show up in `ps` output in state "Z", because an orphan can still cause
>>>> side-effects and consume resources.
>>>>
>>>> Some helpful Internet posts led me to what I believe is the correct way
>>>> to prevent an orphan child process, by calling it through an intermediate
>>>> application similar to the one suggested by Port docs but using `prctl(2)`
>>>> instead, which allows the intermediate to monitor the parent process (the
>>>> BEAM) and kill its child if the parent is terminated.  The code below still
>>>> has a small race condition on launch, but I'll share it anyway:
>>>>
>>>> ```c
>>>> #define _XOPEN_SOURCE 700
>>>> #include <signal.h>
>>>> #include <stddef.h>
>>>> #include <stdlib.h>
>>>> #include <sys/prctl.h>
>>>> #include <sys/wait.h>
>>>> #include <unistd.h>
>>>>
>>>> pid_t child_pid;
>>>>
>>>> void handle_signal(int signum) {
>>>>   if (signum == SIGHUP && child_pid > 0) {
>>>>     kill(child_pid, SIGKILL);
>>>>   }
>>>> }
>>>>
>>>> int main(int argc, char* argv[]) {
>>>>   // Send this process a HUP if the parent BEAM VM dies.
>>>>   // FIXME: race condition until this line, if the parent is already
>>>> dead.
>>>>   prctl(PR_SET_PDEATHSIG, SIGHUP);
>>>>
>>>>   // Listen for HUP and respond by killing the child process.
>>>>   struct sigaction action;
>>>>   action.sa_handler = handle_signal;
>>>>   action.sa_flags = 0;
>>>>   sigemptyset(&action.sa_mask);
>>>>   sigaction(SIGHUP, &action, NULL);
>>>>
>>>>   child_pid = fork();
>>>>   if (child_pid == 0) {
>>>>     const char* command = argv[1];
>>>>     for (int i = 0; i < argc; i++) {
>>>>       argv[i] = argv[i + 1];
>>>>     }
>>>>     execv(command, argv);
>>>>   } else {
>>>>     waitpid(child_pid, NULL, 0);
>>>>   }
>>>>
>>>>   return 0;
>>>> }
>>>> ```
>>>>
>>>> To try it out, save as main.c and compile like so:
>>>>
>>>>     cc -g -O3 -std=c99 -pedantic -o parent-monitor main.c
>>>>
>>>> This tool shifts its argv by one to construct the child process, so
>>>> from the command line it would be called like `parent-monitor
>>>> /usr/bin/sleep 60` and to exercise a BEAM crash you can call it like so
>>>> (after adjusting the paths for your system):
>>>>
>>>>     elixir -e 'System.cmd("./parent-monitor", ["/usr/bin/sleep", "60"])'
>>>>
>>>> Now you can see the sleep process is killed as soon as the VM is
>>>> stopped.
>>>>
>>>> Although it's a fringe issue since the BEAM is normally stopped only
>>>> during development or when deploying new code, I feel like it could be
>>>> useful to bundle the behavior into Elixir or Erlang itself.  This could be
>>>> seen as an elegant extension of the OTP supervisor tree principle beyond
>>>> the VM boundary, it seems to have some real-world consequences, and it's an
>>>> obscure problem for an application developer to solve from scratch each
>>>> time.
>>>>
>>>> There's one more use case to mention, that such a behavior should
>>>> probably be made optional, maybe as a flag to Port.  I would imagine the
>>>> default should be to use the wrapper, so the implicit option might look
>>>> like `allow_orphan: false`.  The rare case where we omit the wrapper would
>>>> be when it's more useful to let the child continue than to maintain control
>>>> over it.
>>>>
>>>> Kind regards,
>>>> Adam Wight
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elixir-lang-core" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to elixir-lang-co...@googlegroups.com.
>>>> To view this discussion visit
>>>> https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/elixir-lang-core/CAF56aJK-R_gyTSLBmYE%3DsWMOHMgZyujAJZBO6sHO4x2tekC41w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "elixir-lang-core" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elixir-lang-core+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/elixir-lang-core/8908a312-4c8d-4685-ba82-1c87a70b5c8fn%40googlegroups.com
> <https://groups.google.com/d/msgid/elixir-lang-core/8908a312-4c8d-4685-ba82-1c87a70b5c8fn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elixir-lang-core+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4LnLbkNA1rc8sUOo-Ny7aK%3DSKF5%3DuozgXAgPSVpFfi1Gw%40mail.gmail.com.

Reply via email to