On Sat, 18 Jan 2025 20:41:37 +0900
Takashi Yano wrote:
> On Fri, 17 Jan 2025 18:52:41 +0900
> Takashi Yano wrote:
> > On Wed, 8 Jan 2025 18:05:53 -0800 (PST)
> > Jeremy Drake wrote:
> > > On Thu, 9 Jan 2025, Takashi Yano wrote:
> > > 
> > > > On Wed, 8 Jan 2025 16:48:41 +0100
> > > > Corinna Vinschen wrote:
> > > > > Does this patch fix Bruno's bash issue as well?
> > > >
> > > > I'm not sure because it is not reproducible as he said.
> > > > I also could not reproduce that.
> > > >
> > > > However, at least this fixes the issue that Jeremy encountered:
> > > > https://cygwin.com/pipermail/cygwin/2024-December/256977.html
> > > >
> > > > But, even with this patch, Jeremy reported another hang issue
> > > > that also is not reproducible:
> > > > https://cygwin.com/pipermail/cygwin/2024-December/256987.html
> > > 
> > > Yes, this patch helped the hangs I was seeing on Windows on ARM64.
> > > However, there is still some hang issue in 3.5.5 (which occurs on
> > > native x86_64) that is not there in 3.5.4.  Git for Windows' test suite
> > > seems to be somewhat reliable at triggering this, but it's hardly a
> > > minimal test case ;).
> > > 
> > > Because of this issue, MSYS2 has been keeping 3.5.5 in its 'staging' state
> > > (rather than deploying it to normal users), and Git for Windows rolled
> > > back to 3.5.4 before the release of the latest Git RC.
> > 
> > I might have successfully reproduced this issue. I tried building
> > cygwin1.dll repeatedly for some of my machines, and one of them
> > hung in fhandler_pipe::raw_read() as lazka's case:
> > https://github.com/msys2/msys2-runtime/pull/251#issuecomment-2571338429
> > 
> > The call:
> > L358:         waitret = cygwait (select_sem, select_sem_timeout);
> > never returned even with select_sem_timeout == 1 until a signal
> > (such as SIGTERM, SIGKILL) arrives. In this situation, attaching
> > gdb to the process hanging and issuing 'si' command do not return.
> > Something (stack?) seems to be completely broken.
> > 
> > I'll try to bisect which commit causes this issue. Please wait
> > a while.
> 
> Done.
> 
> This issue also seems to be related to the commit:
> 
> commit d243e51ef1d30312ba1e21b4d25a1ca9a8dc1f63
> Author: Takashi Yano <takashi.y...@nifty.ne.jp>
> Date:   Mon Nov 25 19:51:53 2024 +0900
> 
>     Cygwin: signal: Fix deadlock between main thread and sig thread
> 
>     Previously, a deadlock happened if many SIGSTOP/SIGCONT signals were
>     received rapidly. If the main thread sends __SIGFLUSH at the timing
>     when SIGSTOP is handled by the sig thread, but not is handled by the
>     main thread yet (sig_handle_tty_stop() not called yet), and if SIGCONT
>     is received, the sig thread waits for cygtls::current_sig (is SIGSTOP
>     now) cleared. However, the main thread waits for the pack.wakeup using
>     WaitForSingleObject(), so the main thread cannot handle SIGSTOP. This
>     is the mechanism of the deadlock. This patch uses cygwait() instead of
>     WaitForSingleObject() to be able to handle the pending SIGSTOP.
> 
>     Addresses: https://cygwin.com/pipermail/cygwin/2024-November/256744.html
>     Fixes: 7759daa979c4 ("(sig_send): Fill out sigpacket structure to send to 
> signal thread rather than racily sending separate packets.")
>     Reported-by: Christian Franke <christian.fra...@t-online.de>
>     Reviewed-by: Corinna Vinschen <cori...@vinschen.de>
>     Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp>
> 
> Even though the reason why this issue happens is not clear at all,
> I perhaps found the solution for that.
> 
> Applying the attached patch:
> 0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
> instead of previous v2 __SIGFLUSHFAST patch solves the both issues.
> 
> However, strangely enough, the similar patch:
> ng-0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
> which uses cygwait() instead of WF[SM]O does not solve the issue
> Jeremy reported.
> 
> The reason is also unclear. What is the difference between cygwait()
> and WF[SM]O? I expected both patches work almost the same. The v2
> __SIGFLUSHFAST patch also uses cygwait(), so the reason might be
> the same (the reason why we should use WF[SM]O rather than cygwait()).
> 
> Corinna, any idea? I need some clue.
> 
> 
> While debugging this problem, I encountered another hang issue,
> which is fixed by:
> 0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch
> 
> If we are confident in the patch 0003, I think we should apply
> 0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch
> 0002-Revert-Cygwin-signal-Do-not-handle-signal-when-__SIG.patch
> 0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
> 0004-Revert-Cygwin-signal-Fix-high-load-when-retrying-to-.patch
> for main branch and
> 0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch
> 0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
> for cygwin-3_5-branch.
> 
> Corinna, what do you think?
> 
> Jeremy,
> could you please apply the attached patches:
> 0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch
> 0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch
> against cygwin-3_5-branch and test if these fix the issue?

The patch 0001 still seems to have hang problem. I revised it
as attached.

Jeremy, could you please replace 0001 patch with attached v2 patch?

-- 
Takashi Yano <takashi.y...@nifty.ne.jp>
From 390c6f1a0165171066e04504540831fd902f590a Mon Sep 17 00:00:00 2001
From: Takashi Yano <takashi.y...@nifty.ne.jp>
Date: Sat, 18 Jan 2025 19:03:23 +0900
Subject: [PATCH v2] Cygwin: signal: Avoid frequent tls lock/unlock for SIGCONT
 processing

It seems that current _cygtls::handle_SIGCONT() code sometimes falls
into a deadlock due to frequent TLS lock/unlock operation in the
yield() loop. With this patch, the yield() in the wait loop is placed
placed outside the TLS lock to avoid frequent TLS lock/unlock.

Fixes: 9ae51bcc51a7 ("Cygwin: signal: Fix another deadlock between main and sig 
thread")
Reviewed-by:
Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp>
---
 winsup/cygwin/exceptions.cc           | 36 ++++++++++-----------------
 winsup/cygwin/local_includes/cygtls.h |  2 +-
 2 files changed, 14 insertions(+), 24 deletions(-)

diff --git a/winsup/cygwin/exceptions.cc b/winsup/cygwin/exceptions.cc
index 4dc4be278..09529061f 100644
--- a/winsup/cygwin/exceptions.cc
+++ b/winsup/cygwin/exceptions.cc
@@ -1420,7 +1420,7 @@ api_fatal_debug ()
 
 /* Attempt to carefully handle SIGCONT when we are stopped. */
 void
-_cygtls::handle_SIGCONT (threadlist_t * &tl_entry)
+_cygtls::handle_SIGCONT ()
 {
   if (NOTSTATE (myself, PID_STOPPED))
     return;
@@ -1431,23 +1431,17 @@ _cygtls::handle_SIGCONT (threadlist_t * &tl_entry)
      Make sure that any pending signal is handled before trying to
      send a new one.  Then make sure that SIGCONT has been recognized
      before exiting the loop.  */
-  bool sigsent = false;
-  while (1)
-    if (current_sig)   /* Assume that it's ok to just test sig outside of a
-                          lock since setup_handler does it this way.  */
-      {
-       cygheap->unlock_tls (tl_entry);
-       yield ();       /* Attempt to schedule another thread.  */
-       tl_entry = cygheap->find_tls (_main_tls);
-      }
-    else if (sigsent)
-      break;           /* SIGCONT has been recognized by other thread */
-    else
-      {
-       current_sig = SIGCONT;
-       set_signal_arrived (); /* alert sig_handle_tty_stop */
-       sigsent = true;
-      }
+  while (current_sig)
+    yield ();
+
+  threadlist_t *tl_entry = cygheap->find_tls (this);
+  current_sig = SIGCONT;
+  set_signal_arrived (); /* alert sig_handle_tty_stop */
+  cygheap->unlock_tls (tl_entry);
+
+  while (current_sig == SIGCONT)
+    yield ();
+
   /* Clear pending stop signals */
   sig_clear (SIGSTOP, false);
   sig_clear (SIGTSTP, false);
@@ -1479,11 +1473,7 @@ sigpacket::process ()
   myself->rusage_self.ru_nsignals++;
 
   if (si.si_signo == SIGCONT)
-    {
-      tl_entry = cygheap->find_tls (_main_tls);
-      _main_tls->handle_SIGCONT (tl_entry);
-      cygheap->unlock_tls (tl_entry);
-    }
+    _main_tls->handle_SIGCONT ();
 
   /* SIGKILL is special.  It always goes through.  */
   if (si.si_signo == SIGKILL)
diff --git a/winsup/cygwin/local_includes/cygtls.h 
b/winsup/cygwin/local_includes/cygtls.h
index 2d490646a..da22e5d14 100644
--- a/winsup/cygwin/local_includes/cygtls.h
+++ b/winsup/cygwin/local_includes/cygtls.h
@@ -274,7 +274,7 @@ public: /* Do NOT remove this public: line, it's a marker 
for gentls_offsets. */
   {
     will_wait_for_signal = false;
   }
-  void handle_SIGCONT (threadlist_t * &);
+  void handle_SIGCONT ();
   static void cleanup_early(struct _reent *);
 private:
   void call2 (DWORD (*) (void *, void *), void *, void *);
-- 
2.45.1

Reply via email to