On Fri, 17 Jan 2025 18:52:41 +0900
Takashi Yano wrote:
> On Wed, 8 Jan 2025 18:05:53 -0800 (PST)
> Jeremy Drake wrote:
> > On Thu, 9 Jan 2025, Takashi Yano wrote:
> > 
> > > On Wed, 8 Jan 2025 16:48:41 +0100
> > > Corinna Vinschen wrote:
> > > > Does this patch fix Bruno's bash issue as well?
> > >
> > > I'm not sure because it is not reproducible as he said.
> > > I also could not reproduce that.
> > >
> > > However, at least this fixes the issue that Jeremy encountered:
> > > https://cygwin.com/pipermail/cygwin/2024-December/256977.html
> > >
> > > But, even with this patch, Jeremy reported another hang issue
> > > that also is not reproducible:
> > > https://cygwin.com/pipermail/cygwin/2024-December/256987.html
> > 
> > Yes, this patch helped the hangs I was seeing on Windows on ARM64.
> > However, there is still some hang issue in 3.5.5 (which occurs on
> > native x86_64) that is not there in 3.5.4.  Git for Windows' test suite
> > seems to be somewhat reliable at triggering this, but it's hardly a
> > minimal test case ;).
> > 
> > Because of this issue, MSYS2 has been keeping 3.5.5 in its 'staging' state
> > (rather than deploying it to normal users), and Git for Windows rolled
> > back to 3.5.4 before the release of the latest Git RC.
> 
> I might have successfully reproduced this issue. I tried building
> cygwin1.dll repeatedly for some of my machines, and one of them
> hung in fhandler_pipe::raw_read() as lazka's case:
> https://github.com/msys2/msys2-runtime/pull/251#issuecomment-2571338429
> 
> The call:
> L358:         waitret = cygwait (select_sem, select_sem_timeout);
> never returned even with select_sem_timeout == 1 until a signal
> (such as SIGTERM, SIGKILL) arrives. In this situation, attaching
> gdb to the process hanging and issuing 'si' command do not return.
> Something (stack?) seems to be completely broken.
> 
> I'll try to bisect which commit causes this issue. Please wait
> a while.

Done.

This issue also seems to be related to the commit:

commit d243e51ef1d30312ba1e21b4d25a1ca9a8dc1f63
Author: Takashi Yano <takashi.y...@nifty.ne.jp>
Date:   Mon Nov 25 19:51:53 2024 +0900

    Cygwin: signal: Fix deadlock between main thread and sig thread

    Previously, a deadlock happened if many SIGSTOP/SIGCONT signals were
    received rapidly. If the main thread sends __SIGFLUSH at the timing
    when SIGSTOP is handled by the sig thread, but not is handled by the
    main thread yet (sig_handle_tty_stop() not called yet), and if SIGCONT
    is received, the sig thread waits for cygtls::current_sig (is SIGSTOP
    now) cleared. However, the main thread waits for the pack.wakeup using
    WaitForSingleObject(), so the main thread cannot handle SIGSTOP. This
    is the mechanism of the deadlock. This patch uses cygwait() instead of
    WaitForSingleObject() to be able to handle the pending SIGSTOP.

    Addresses: https://cygwin.com/pipermail/cygwin/2024-November/256744.html
    Fixes: 7759daa979c4 ("(sig_send): Fill out sigpacket structure to send to 
signal thread rather than racily sending separate packets.")
    Reported-by: Christian Franke <christian.fra...@t-online.de>
    Reviewed-by: Corinna Vinschen <cori...@vinschen.de>
    Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp>

Even though the reason why this issue happens is not clear at all,
I perhaps found the solution for that.

Applying the attached patch:
0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
instead of previous v2 __SIGFLUSHFAST patch solves the both issues.

However, strangely enough, the similar patch:
ng-0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
which uses cygwait() instead of WF[SM]O does not solve the issue
Jeremy reported.

The reason is also unclear. What is the difference between cygwait()
and WF[SM]O? I expected both patches work almost the same. The v2
__SIGFLUSHFAST patch also uses cygwait(), so the reason might be
the same (the reason why we should use WF[SM]O rather than cygwait()).

Corinna, any idea? I need some clue.


While debugging this problem, I encountered another hang issue,
which is fixed by:
0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch

If we are confident in the patch 0003, I think we should apply
0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch
0002-Revert-Cygwin-signal-Do-not-handle-signal-when-__SIG.patch
0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
0004-Revert-Cygwin-signal-Fix-high-load-when-retrying-to-.patch
for main branch and
0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch
0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
for cygwin-3_5-branch.

Corinna, what do you think?

Jeremy,
could you please apply the attached patches:
0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch
0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
against cygwin-3_5-branch and test if these fix the issue?

-- 
Takashi Yano <takashi.y...@nifty.ne.jp>
From 0330b182621c0dac851c5c5269af2ccd58060f31 Mon Sep 17 00:00:00 2001
From: Takashi Yano <takashi.y...@nifty.ne.jp>
Date: Sat, 18 Jan 2025 19:03:23 +0900
Subject: [PATCH 1/4] Cygwin: signal: Avoid frequent tls lock/unlock for
 SIGCONT processing

It seems that current _cygtls::handle_SIGCONT() code sometimes falls
into a deadlock due to frequent TLS lock/unlock operation in the
yield() loop. With this patch, the yield() in the wait loop for an
existing signal is placed outside the TLS lock. Conversely, the
yield() in the wait loop for a self-issued SIGCONT is placed inside
the TLS lock and do not unlock every time yield() is called.

Fixes: 9ae51bcc51a7 ("Cygwin: signal: Fix another deadlock between main and sig 
thread")
Reviewed-by:
Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp>
---
 winsup/cygwin/exceptions.cc           | 38 +++++++++++----------------
 winsup/cygwin/local_includes/cygtls.h |  2 +-
 2 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/winsup/cygwin/exceptions.cc b/winsup/cygwin/exceptions.cc
index 4dc4be278..a36ef4998 100644
--- a/winsup/cygwin/exceptions.cc
+++ b/winsup/cygwin/exceptions.cc
@@ -1420,7 +1420,7 @@ api_fatal_debug ()
 
 /* Attempt to carefully handle SIGCONT when we are stopped. */
 void
-_cygtls::handle_SIGCONT (threadlist_t * &tl_entry)
+_cygtls::handle_SIGCONT ()
 {
   if (NOTSTATE (myself, PID_STOPPED))
     return;
@@ -1431,28 +1431,24 @@ _cygtls::handle_SIGCONT (threadlist_t * &tl_entry)
      Make sure that any pending signal is handled before trying to
      send a new one.  Then make sure that SIGCONT has been recognized
      before exiting the loop.  */
-  bool sigsent = false;
-  while (1)
-    if (current_sig)   /* Assume that it's ok to just test sig outside of a
-                          lock since setup_handler does it this way.  */
-      {
-       cygheap->unlock_tls (tl_entry);
-       yield ();       /* Attempt to schedule another thread.  */
-       tl_entry = cygheap->find_tls (_main_tls);
-      }
-    else if (sigsent)
-      break;           /* SIGCONT has been recognized by other thread */
-    else
-      {
-       current_sig = SIGCONT;
-       set_signal_arrived (); /* alert sig_handle_tty_stop */
-       sigsent = true;
-      }
+  while (current_sig)
+    yield ();
+
+  threadlist_t *tl_entry = cygheap->find_tls (this);
+
+  current_sig = SIGCONT;
+  set_signal_arrived (); /* alert sig_handle_tty_stop */
+
+  while (current_sig == SIGCONT)
+    yield ();
+
   /* Clear pending stop signals */
   sig_clear (SIGSTOP, false);
   sig_clear (SIGTSTP, false);
   sig_clear (SIGTTIN, false);
   sig_clear (SIGTTOU, false);
+
+  cygheap->unlock_tls (tl_entry);
 }
 
 int
@@ -1479,11 +1475,7 @@ sigpacket::process ()
   myself->rusage_self.ru_nsignals++;
 
   if (si.si_signo == SIGCONT)
-    {
-      tl_entry = cygheap->find_tls (_main_tls);
-      _main_tls->handle_SIGCONT (tl_entry);
-      cygheap->unlock_tls (tl_entry);
-    }
+    _main_tls->handle_SIGCONT ();
 
   /* SIGKILL is special.  It always goes through.  */
   if (si.si_signo == SIGKILL)
diff --git a/winsup/cygwin/local_includes/cygtls.h 
b/winsup/cygwin/local_includes/cygtls.h
index 2d490646a..da22e5d14 100644
--- a/winsup/cygwin/local_includes/cygtls.h
+++ b/winsup/cygwin/local_includes/cygtls.h
@@ -274,7 +274,7 @@ public: /* Do NOT remove this public: line, it's a marker 
for gentls_offsets. */
   {
     will_wait_for_signal = false;
   }
-  void handle_SIGCONT (threadlist_t * &);
+  void handle_SIGCONT ();
   static void cleanup_early(struct _reent *);
 private:
   void call2 (DWORD (*) (void *, void *), void *, void *);
-- 
2.45.1

From faea6b05f55541ce60275f7e5d8db1347d648e0d Mon Sep 17 00:00:00 2001
From: Takashi Yano <takashi.y...@nifty.ne.jp>
Date: Sat, 18 Jan 2025 19:19:56 +0900
Subject: [PATCH 2/4] Revert "Cygwin: signal: Do not handle signal when
 __SIGFLUSHFAST is sent"

This reverts commit a22a0ad7c4f0 to apply a new patch for the same
purpose.

Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp>
---
 winsup/cygwin/release/3.5.6 |  3 ---
 winsup/cygwin/sigproc.cc    | 20 +++++---------------
 2 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/winsup/cygwin/release/3.5.6 b/winsup/cygwin/release/3.5.6
index 0fff0de40..d17a6af53 100644
--- a/winsup/cygwin/release/3.5.6
+++ b/winsup/cygwin/release/3.5.6
@@ -7,6 +7,3 @@ Fixes:
 
 - Fix a regression since 3.5.0 which fails to use POSIX semantics in
   unlink/rename on NTFS.
-
-- Fix zsh hang at startup.
-  Addresses: https://cygwin.com/pipermail/cygwin/2024-December/256954.html
diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc
index 35ec3e70e..ba7818a68 100644
--- a/winsup/cygwin/sigproc.cc
+++ b/winsup/cygwin/sigproc.cc
@@ -751,14 +751,10 @@ sig_send (_pinfo *p, siginfo_t& si, _cygtls *tls)
       res = WriteFile (sendsig, leader, packsize, &nb, NULL);
       if (!res || packsize == nb)
        break;
-      if (cygwait (NULL, 10, cw_sig_eintr) == WAIT_SIGNALED
-         && pack.si.si_signo != __SIGFLUSHFAST)
+      if (cygwait (NULL, 10, cw_sig_eintr) == WAIT_SIGNALED)
        _my_tls.call_signal_handler ();
       res = 0;
     }
-  /* Re-assert signal_arrived which has been cleared in cygwait(). */
-  if (_my_tls.current_sig)
-    _my_tls.set_signal_arrived ();
 
   if (!res)
     {
@@ -789,16 +785,7 @@ sig_send (_pinfo *p, siginfo_t& si, _cygtls *tls)
   if (wait_for_completion)
     {
       sigproc_printf ("Waiting for pack.wakeup %p", pack.wakeup);
-      do
-       {
-         rc = cygwait (pack.wakeup, WSSC, cw_sig_eintr);
-         if (rc == WAIT_SIGNALED && pack.si.si_signo != __SIGFLUSHFAST)
-           _my_tls.call_signal_handler ();
-       }
-      while (rc != WAIT_OBJECT_0 && rc != WAIT_TIMEOUT);
-      /* Re-assert signal_arrived which has been cleared in cygwait(). */
-      if (_my_tls.current_sig)
-       _my_tls.set_signal_arrived ();
+      rc = cygwait (pack.wakeup, WSSC);
       ForceCloseHandle (pack.wakeup);
     }
   else
@@ -819,6 +806,9 @@ sig_send (_pinfo *p, siginfo_t& si, _cygtls *tls)
       rc = -1;
     }
 
+  if (wait_for_completion && si.si_signo != __SIGFLUSHFAST)
+    _my_tls.call_signal_handler ();
+
 out:
   if (communing && rc)
     {
-- 
2.45.1

From 72044bae562c40f59f20ec69ce8ec695aaff7a5d Mon Sep 17 00:00:00 2001
From: Takashi Yano <takashi.y...@nifty.ne.jp>
Date: Sat, 18 Jan 2025 19:21:51 +0900
Subject: [PATCH 3/4] Cygwin: signal: Do not handle signal when __SIGFLUSHFAST
 is sent

The commit a22a0ad7c4f0 was not exactly the correct thing. Even with
the patch, some hangs still happen. This patch overrides the previous
commit to fix these hangs.

Addresses: https://cygwin.com/pipermail/cygwin/2024-December/256987.html
Fixes: a22a0ad7c4f0 ("Cygwin: signal: Do not handle signal when __SIGFLUSHFAST 
is sent")
Reported-by: Jeremy Drake <cyg...@jdrake.com>
Reviewed-by:
Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp>
---
 winsup/cygwin/sigproc.cc | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc
index ba7818a68..4dcdd94de 100644
--- a/winsup/cygwin/sigproc.cc
+++ b/winsup/cygwin/sigproc.cc
@@ -751,8 +751,19 @@ sig_send (_pinfo *p, siginfo_t& si, _cygtls *tls)
       res = WriteFile (sendsig, leader, packsize, &nb, NULL);
       if (!res || packsize == nb)
        break;
-      if (cygwait (NULL, 10, cw_sig_eintr) == WAIT_SIGNALED)
-       _my_tls.call_signal_handler ();
+      if (pack.si.si_signo == __SIGFLUSHFAST)
+       Sleep (10);
+      else /* Handle signals */
+       do
+         {
+           DWORD rc = WaitForSingleObject (_my_tls.get_signal_arrived (), 10);
+           if (rc == WAIT_OBJECT_0)
+             {
+               _my_tls.call_signal_handler ();
+               continue;
+             }
+         }
+       while (false);
       res = 0;
     }
 
@@ -785,7 +796,20 @@ sig_send (_pinfo *p, siginfo_t& si, _cygtls *tls)
   if (wait_for_completion)
     {
       sigproc_printf ("Waiting for pack.wakeup %p", pack.wakeup);
-      rc = cygwait (pack.wakeup, WSSC);
+      if (pack.si.si_signo == __SIGFLUSHFAST)
+       rc = WaitForSingleObject (pack.wakeup, WSSC);
+      else /* Handle signals */
+       do
+         {
+           HANDLE w[2] = {pack.wakeup, _my_tls.get_signal_arrived ()};
+           rc = WaitForMultipleObjects (2, w, FALSE, WSSC);
+           if (rc == WAIT_OBJECT_0 + 1) /* signal arrived */
+             {
+               _my_tls.call_signal_handler ();
+               continue;
+             }
+         }
+       while (false);
       ForceCloseHandle (pack.wakeup);
     }
   else
@@ -806,9 +830,6 @@ sig_send (_pinfo *p, siginfo_t& si, _cygtls *tls)
       rc = -1;
     }
 
-  if (wait_for_completion && si.si_signo != __SIGFLUSHFAST)
-    _my_tls.call_signal_handler ();
-
 out:
   if (communing && rc)
     {
-- 
2.45.1

From d8e6b1f33979b9d7eb0960eca8dbcbf7d2c1f5d1 Mon Sep 17 00:00:00 2001
From: Takashi Yano <takashi.y...@nifty.ne.jp>
Date: Sat, 18 Jan 2025 19:51:27 +0900
Subject: [PATCH 4/4] Revert "Cygwin: signal: Fix high load when retrying to
 process pending signal"

This reverts commit 1d1451ccd2a6 because this does not seem to
necessary anymore. After the commit "Cygwin: signal: Do not handle
signal when __SIGFLUSHFAST is sent", the problem does not happen
even without this patch.

Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp>
---
 winsup/cygwin/sigproc.cc | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc
index 4dcdd94de..68f84a848 100644
--- a/winsup/cygwin/sigproc.cc
+++ b/winsup/cygwin/sigproc.cc
@@ -1366,12 +1366,6 @@ wait_sig (VOID *)
 
   hntdll = GetModuleHandle ("ntdll.dll");
 
-  /* GetTickCount() here is enough because GetTickCount() - t0 does
-     not overflow until 49 days psss. Even if GetTickCount() overflows,
-     GetTickCount() - t0 returns correct value, since underflow in
-     unsigned wraps correctly. Pending a signal for more than 49
-     days makes no sense. */
-  DWORD t0 = GetTickCount ();
   for (;;)
     {
       DWORD nb;
@@ -1381,7 +1375,7 @@ wait_sig (VOID *)
       else if (sigq.start.next
               && PeekNamedPipe (my_readsig, NULL, 0, NULL, &nb, NULL) && !nb)
        {
-         Sleep (GetTickCount () - t0 > 10 ? 1 : 0);
+         yield ();
          pack.si.si_signo = __SIGFLUSH;
        }
       else if (!ReadFile (my_readsig, &pack, sizeof (pack), &nb, NULL))
@@ -1391,8 +1385,6 @@ wait_sig (VOID *)
          system_printf ("garbled signal pipe data nb %u, sig %d", nb, 
pack.si.si_signo);
          continue;
        }
-      if (pack.si.si_signo != __SIGFLUSH)
-       t0 = GetTickCount ();
 
       sigq.retry = false;
       /* Don't process signals when we start exiting */
-- 
2.45.1

From 70a69950b21cb5f1f73e5750e25f43aa25b8bf18 Mon Sep 17 00:00:00 2001
From: Takashi Yano <takashi.y...@nifty.ne.jp>
Date: Sat, 18 Jan 2025 19:39:26 +0900
Subject: [PATCH 3/3] Cygwin: signal: Do not handle signal when __SIGFLUSHFAST
 is sent

THIS PATCH DOES NOT FIX THE ISSUE THAT JEREMY REPORTED IN:
https://cygwin.com/pipermail/cygwin/2024-December/256987.html
FOR UNKNOWN REASON.

Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp>
---
 winsup/cygwin/sigproc.cc | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc
index ba7818a68..41854b114 100644
--- a/winsup/cygwin/sigproc.cc
+++ b/winsup/cygwin/sigproc.cc
@@ -751,8 +751,10 @@ sig_send (_pinfo *p, siginfo_t& si, _cygtls *tls)
       res = WriteFile (sendsig, leader, packsize, &nb, NULL);
       if (!res || packsize == nb)
        break;
-      if (cygwait (NULL, 10, cw_sig_eintr) == WAIT_SIGNALED)
-       _my_tls.call_signal_handler ();
+      if (pack.si.si_signo == __SIGFLUSHFAST)
+       Sleep (10);
+      else /* Handle signals in cygwait() */
+       cygwait (NULL, 10, cw_sig_restart);
       res = 0;
     }
 
@@ -785,7 +787,10 @@ sig_send (_pinfo *p, siginfo_t& si, _cygtls *tls)
   if (wait_for_completion)
     {
       sigproc_printf ("Waiting for pack.wakeup %p", pack.wakeup);
-      rc = cygwait (pack.wakeup, WSSC);
+      if (pack.si.si_signo == __SIGFLUSHFAST)
+       rc = WaitForSingleObject (pack.wakeup, WSSC);
+      else /* Handle signals in cygwait() */
+       rc = cygwait (pack.wakeup, WSSC, cw_sig_restart);
       ForceCloseHandle (pack.wakeup);
     }
   else
@@ -806,9 +811,6 @@ sig_send (_pinfo *p, siginfo_t& si, _cygtls *tls)
       rc = -1;
     }
 
-  if (wait_for_completion && si.si_signo != __SIGFLUSHFAST)
-    _my_tls.call_signal_handler ();
-
 out:
   if (communing && rc)
     {
-- 
2.45.1

Reply via email to