From: Jeremy Drake <cyg...@jdrake.com>

This addresses an extremely difficult to debug deadlock when running
under emulation on ARM64.

A relatively easy way to trigger this bug is to call `fork()`, then within the
child process immediately call another `fork()` and then `exit()` the
intermediate process.

It would seem that there is a "code emulation" lock on the wait thread at
this stage, and if the thread is terminated too early, that lock still exists
albeit without a thread, and nothing moves forward.

It seems that a `SuspendThread()` combined with a `GetThreadContext()`
(to force the thread to _actually_ be suspended, for more details see
https://devblogs.microsoft.com/oldnewthing/20150205-00/?p=44743)
makes sure the thread is "booted" from emulation before it is suspended.

Hopefully this means it won't be holding any locks or otherwise leave
emulation in a bad state when the thread is terminated.

Also, attempt to use `CancelSynchonousIo()` (as seen in `flock.cc`) to avoid
the need for `TerminateThread()` altogether.  This doesn't always work,
however, so was not a complete fix for the deadlock issue.

Addresses: https://cygwin.com/pipermail/cygwin-developers/2024-May/012694.html
Signed-off-by: Jeremy Drake <cyg...@jdrake.com>
---

v2: suppress output for a now-expected error in `proc_waiter` due to the
addition of the CancelSynchronousIo call in `proc_terminate`

v3: address review comments (add comments, change nested ifs to &&
conditons), and add CancelSynchronousIo call to try to avoid another
terminate_thread call that I happened to notice.

(I searched for other callers of terminate_thread after this, and the only
one left without CancelSynchronousIo is in ldap.cc and I'm pretty sure
that's a "can't happen" case.)


 winsup/cygwin/cygthread.cc | 14 ++++++++++++++
 winsup/cygwin/pinfo.cc     | 10 +++++++---
 winsup/cygwin/sigproc.cc   | 12 ++++++++++--
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/winsup/cygwin/cygthread.cc b/winsup/cygwin/cygthread.cc
index 54918e7677..4f16097531 100644
--- a/winsup/cygwin/cygthread.cc
+++ b/winsup/cygwin/cygthread.cc
@@ -302,6 +302,20 @@ cygthread::terminate_thread ()
   if (!inuse)
     goto force_notterminated;

+  if (_my_tls._ctinfo != this)
+    {
+      CONTEXT context;
+      context.ContextFlags = CONTEXT_CONTROL;
+      /* SuspendThread makes sure a thread is "booted" from emulation before
+        it is suspended.  As such, the emulator hopefully won't be in a bad
+        state (aka, holding any locks) when the thread is terminated. */
+      SuspendThread (h);
+      /* We need to call GetThreadContext, even though we don't care about the
+        context, because SuspendThread is asynchronous and GetThreadContext
+        will make sure the thread is *really* suspended before returning */
+      GetThreadContext (h, &context);
+    }
+
   TerminateThread (h, 0);
   WaitForSingleObject (h, INFINITE);
   CloseHandle (h);
diff --git a/winsup/cygwin/pinfo.cc b/winsup/cygwin/pinfo.cc
index e31a67d8f4..f1db715951 100644
--- a/winsup/cygwin/pinfo.cc
+++ b/winsup/cygwin/pinfo.cc
@@ -1252,13 +1252,17 @@ proc_waiter (void *arg)

   for (;;)
     {
-      DWORD nb;
+      DWORD nb, err;
       char buf = '\0';

       if (!ReadFile (vchild.rd_proc_pipe, &buf, 1, &nb, NULL)
-         && GetLastError () != ERROR_BROKEN_PIPE)
+         && (err = GetLastError ()) != ERROR_BROKEN_PIPE)
        {
-         system_printf ("error on read of child wait pipe %p, %E", 
vchild.rd_proc_pipe);
+         /* ERROR_OPERATION_ABORTED is expected due to the possibility that
+            CancelSynchronousIo interruped the ReadFile call, so don't output
+            that error */
+         if (err != ERROR_OPERATION_ABORTED)
+           system_printf ("error on read of child wait pipe %p, %E", 
vchild.rd_proc_pipe);
          break;
        }

diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc
index 81b6c31695..9e20ae6f71 100644
--- a/winsup/cygwin/sigproc.cc
+++ b/winsup/cygwin/sigproc.cc
@@ -409,7 +409,11 @@ proc_terminate ()
             to 1 iff it is a Cygwin process.  */
          if (!have_execed || !have_execed_cygwin)
            chld_procs[i]->ppid = 1;
-         if (chld_procs[i].wait_thread)
+         /* Attempt to exit the wait_thread cleanly via CancelSynchronousIo
+            before falling back to the (explicitly dangerous) cross-thread
+            termination */
+         if (chld_procs[i].wait_thread
+             && !CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle 
()))
            chld_procs[i].wait_thread->terminate_thread ();
          /* Release memory associated with this process unless it is 'myself'.
             'myself' is only in the chld_procs table when we've execed.  We
@@ -1174,7 +1178,11 @@ remove_proc (int ci)
 {
   if (have_execed)
     {
-      if (_my_tls._ctinfo != chld_procs[ci].wait_thread)
+      /* Attempt to exit the wait_thread cleanly via CancelSynchronousIo
+        before falling back to the (explicitly dangerous) cross-thread
+        termination */
+      if (_my_tls._ctinfo != chld_procs[ci].wait_thread
+         && !CancelSynchronousIo (chld_procs[ci].wait_thread->thread_handle 
()))
        chld_procs[ci].wait_thread->terminate_thread ();
     }
   else if (chld_procs[ci] && chld_procs[ci]->exists ())
-- 
2.47.0.windows.2

Reply via email to