(this is the same issue discussed in
https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html)
On MSYS2, running on Windows on ARM64 only, we've been plagued by issues
with processes hanging up. Usually pacman, when it is trying to validate
signatures with gpgme. When a process is hung in this way, no debugger
seems to be able to attach properly.
After many months of off-and-on progress trying to debug this, we've
*finally* got an idea of what behavior is causing this, and a standalone
reproducer that runs on Cygwin.
> A common symptom is that the hanging process has a command-line that is
> identical to its parent process' command-line (indicating that it has
> been fork()ed), and anecdotally, the hang occurs when _exit() calls
> proc_terminate() which is then blocked by a call to TerminateThread()
> with an invalid thread handle (for more details, see
> https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-1951796327).
>
> In my tests, I found that the hanging process is spawned from
> _gpgme_io_spawn() which lets the child process immediately spawn another
> child. That seems like a fantastic way to find timing-related bugs in
> the MSYS2/Cygwin runtime.
>
> As a work-around, it does seem to help if we avoid that double-fork.
That led me to make the attached reproducer, which is based on the code
from _gpgme_io_spawn. I originally expected that this would require some
timing adjustment, hence the defines to change the binary and argument (I
expected to use /bin/sleep and different values). It turns out, this
reproduces readily with /bin/true.
I build this with `gcc -ggdb -o testfork testfork.c`, and this reproduces:
* on a Raspberry PI 4 running Windows 10, with an i686 msys2 runtime
* on a QC710 running Windows 11 23H2, with x86_64 msys2 runtime (this
seems to reproduce it most readily).
* on a hyper-v virtual machine on Dev Kit 2023 running Windows 11 23H2,
with x86_64 msys2 runtime or Cygwin 3.5.3. This seems to require running
two instances of testfork.exe at the same time.
When attaching to the hung process, gdb shows
(gdb) i thr
Id Target Id Frame
1 Thread 6516.0xbe8 error return
/cygdrive/d/a/scallywag/gdb/gdb-13.2-1.x86_64/src/gdb-13.2/gdb/windows-nat.c:748
was 31: A device attached to the system is not functioning.
0x0000000000000000 in ?? ()
2 Thread 6516.0x1b28 "sig" 0x00007ff8051a8a64 in ?? ()
* 3 Thread 6516.0x12b4 0x00007ff8051b4374 in ?? ()
Let me know if I can provide any additional info, or anything else we can
try to help debug this.
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>
#ifndef BINARY
#define BINARY "/bin/true"
#endif
#ifndef ARG
#define ARG "0.1"
#endif
int main(int argc, char ** argv)
{
while (1)
{
int pid;
printf("Starting group of 100x " BINARY " " ARG "\n");
for (int i = 0; i < 100; ++i)
{
pid = fork();
if (pid == -1)
{
perror("fork error");
return 1;
}
else if (pid == 0)
{
if ((pid = fork()) == 0)
{
char * const args[] = {BINARY, ARG, NULL};
execv(BINARY, args);
perror("execv failed");
_exit(5);
}
if (pid == -1)
{
perror("inner fork error");
_exit(1);
}
else
{
_exit(0);
}
}
else
{
int status;
if (waitpid(pid, &status, 0) == -1)
{
perror("waitpid error");
return 2;
}
else if (status != 0)
{
fprintf(stderr, "subprocess exited non-zero: %d\n", status);
return WEXITSTATUS(status);
}
}
}
}
return 0;
}