On Sat, 18 Jan 2025 20:41:37 +0900 Takashi Yano wrote: > On Fri, 17 Jan 2025 18:52:41 +0900 > Takashi Yano wrote: > > On Wed, 8 Jan 2025 18:05:53 -0800 (PST) > > Jeremy Drake wrote: > > > On Thu, 9 Jan 2025, Takashi Yano wrote: > > > > > > > On Wed, 8 Jan 2025 16:48:41 +0100 > > > > Corinna Vinschen wrote: > > > > > Does this patch fix Bruno's bash issue as well? > > > > > > > > I'm not sure because it is not reproducible as he said. > > > > I also could not reproduce that. > > > > > > > > However, at least this fixes the issue that Jeremy encountered: > > > > https://cygwin.com/pipermail/cygwin/2024-December/256977.html > > > > > > > > But, even with this patch, Jeremy reported another hang issue > > > > that also is not reproducible: > > > > https://cygwin.com/pipermail/cygwin/2024-December/256987.html > > > > > > Yes, this patch helped the hangs I was seeing on Windows on ARM64. > > > However, there is still some hang issue in 3.5.5 (which occurs on > > > native x86_64) that is not there in 3.5.4. Git for Windows' test suite > > > seems to be somewhat reliable at triggering this, but it's hardly a > > > minimal test case ;). > > > > > > Because of this issue, MSYS2 has been keeping 3.5.5 in its 'staging' state > > > (rather than deploying it to normal users), and Git for Windows rolled > > > back to 3.5.4 before the release of the latest Git RC. > > > > I might have successfully reproduced this issue. I tried building > > cygwin1.dll repeatedly for some of my machines, and one of them > > hung in fhandler_pipe::raw_read() as lazka's case: > > https://github.com/msys2/msys2-runtime/pull/251#issuecomment-2571338429 > > > > The call: > > L358: waitret = cygwait (select_sem, select_sem_timeout); > > never returned even with select_sem_timeout == 1 until a signal > > (such as SIGTERM, SIGKILL) arrives. In this situation, attaching > > gdb to the process hanging and issuing 'si' command do not return. > > Something (stack?) seems to be completely broken. > > > > I'll try to bisect which commit causes this issue. Please wait > > a while. > > Done. > > This issue also seems to be related to the commit: > > commit d243e51ef1d30312ba1e21b4d25a1ca9a8dc1f63 > Author: Takashi Yano <takashi.y...@nifty.ne.jp> > Date: Mon Nov 25 19:51:53 2024 +0900 > > Cygwin: signal: Fix deadlock between main thread and sig thread > > Previously, a deadlock happened if many SIGSTOP/SIGCONT signals were > received rapidly. If the main thread sends __SIGFLUSH at the timing > when SIGSTOP is handled by the sig thread, but not is handled by the > main thread yet (sig_handle_tty_stop() not called yet), and if SIGCONT > is received, the sig thread waits for cygtls::current_sig (is SIGSTOP > now) cleared. However, the main thread waits for the pack.wakeup using > WaitForSingleObject(), so the main thread cannot handle SIGSTOP. This > is the mechanism of the deadlock. This patch uses cygwait() instead of > WaitForSingleObject() to be able to handle the pending SIGSTOP. > > Addresses: https://cygwin.com/pipermail/cygwin/2024-November/256744.html > Fixes: 7759daa979c4 ("(sig_send): Fill out sigpacket structure to send to > signal thread rather than racily sending separate packets.") > Reported-by: Christian Franke <christian.fra...@t-online.de> > Reviewed-by: Corinna Vinschen <cori...@vinschen.de> > Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp> > > Even though the reason why this issue happens is not clear at all, > I perhaps found the solution for that. > > Applying the attached patch: > 0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch > instead of previous v2 __SIGFLUSHFAST patch solves the both issues. > > However, strangely enough, the similar patch: > ng-0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch > which uses cygwait() instead of WF[SM]O does not solve the issue > Jeremy reported. > > The reason is also unclear. What is the difference between cygwait() > and WF[SM]O? I expected both patches work almost the same. The v2 > __SIGFLUSHFAST patch also uses cygwait(), so the reason might be > the same (the reason why we should use WF[SM]O rather than cygwait()). > > Corinna, any idea? I need some clue. > > > While debugging this problem, I encountered another hang issue, > which is fixed by: > 0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch > > If we are confident in the patch 0003, I think we should apply > 0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch > 0002-Revert-Cygwin-signal-Do-not-handle-signal-when-__SIG.patch > 0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch > 0004-Revert-Cygwin-signal-Fix-high-load-when-retrying-to-.patch > for main branch and > 0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch > 0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch > for cygwin-3_5-branch. > > Corinna, what do you think? > > Jeremy, > could you please apply the attached patches: > 0001-Cygwin-signal-Avoid-frequent-tls-lock-unlock-for-SIG.patch > 0003-Cygwin-signal-Do-not-handle-signal-when-__SIGFLUSHFA.patch > against cygwin-3_5-branch and test if these fix the issue?
The patch 0001 still seems to have hang problem. I revised it as attached. Jeremy, could you please replace 0001 patch with attached v2 patch? -- Takashi Yano <takashi.y...@nifty.ne.jp>
From 390c6f1a0165171066e04504540831fd902f590a Mon Sep 17 00:00:00 2001 From: Takashi Yano <takashi.y...@nifty.ne.jp> Date: Sat, 18 Jan 2025 19:03:23 +0900 Subject: [PATCH v2] Cygwin: signal: Avoid frequent tls lock/unlock for SIGCONT processing It seems that current _cygtls::handle_SIGCONT() code sometimes falls into a deadlock due to frequent TLS lock/unlock operation in the yield() loop. With this patch, the yield() in the wait loop is placed placed outside the TLS lock to avoid frequent TLS lock/unlock. Fixes: 9ae51bcc51a7 ("Cygwin: signal: Fix another deadlock between main and sig thread") Reviewed-by: Signed-off-by: Takashi Yano <takashi.y...@nifty.ne.jp> --- winsup/cygwin/exceptions.cc | 36 ++++++++++----------------- winsup/cygwin/local_includes/cygtls.h | 2 +- 2 files changed, 14 insertions(+), 24 deletions(-) diff --git a/winsup/cygwin/exceptions.cc b/winsup/cygwin/exceptions.cc index 4dc4be278..09529061f 100644 --- a/winsup/cygwin/exceptions.cc +++ b/winsup/cygwin/exceptions.cc @@ -1420,7 +1420,7 @@ api_fatal_debug () /* Attempt to carefully handle SIGCONT when we are stopped. */ void -_cygtls::handle_SIGCONT (threadlist_t * &tl_entry) +_cygtls::handle_SIGCONT () { if (NOTSTATE (myself, PID_STOPPED)) return; @@ -1431,23 +1431,17 @@ _cygtls::handle_SIGCONT (threadlist_t * &tl_entry) Make sure that any pending signal is handled before trying to send a new one. Then make sure that SIGCONT has been recognized before exiting the loop. */ - bool sigsent = false; - while (1) - if (current_sig) /* Assume that it's ok to just test sig outside of a - lock since setup_handler does it this way. */ - { - cygheap->unlock_tls (tl_entry); - yield (); /* Attempt to schedule another thread. */ - tl_entry = cygheap->find_tls (_main_tls); - } - else if (sigsent) - break; /* SIGCONT has been recognized by other thread */ - else - { - current_sig = SIGCONT; - set_signal_arrived (); /* alert sig_handle_tty_stop */ - sigsent = true; - } + while (current_sig) + yield (); + + threadlist_t *tl_entry = cygheap->find_tls (this); + current_sig = SIGCONT; + set_signal_arrived (); /* alert sig_handle_tty_stop */ + cygheap->unlock_tls (tl_entry); + + while (current_sig == SIGCONT) + yield (); + /* Clear pending stop signals */ sig_clear (SIGSTOP, false); sig_clear (SIGTSTP, false); @@ -1479,11 +1473,7 @@ sigpacket::process () myself->rusage_self.ru_nsignals++; if (si.si_signo == SIGCONT) - { - tl_entry = cygheap->find_tls (_main_tls); - _main_tls->handle_SIGCONT (tl_entry); - cygheap->unlock_tls (tl_entry); - } + _main_tls->handle_SIGCONT (); /* SIGKILL is special. It always goes through. */ if (si.si_signo == SIGKILL) diff --git a/winsup/cygwin/local_includes/cygtls.h b/winsup/cygwin/local_includes/cygtls.h index 2d490646a..da22e5d14 100644 --- a/winsup/cygwin/local_includes/cygtls.h +++ b/winsup/cygwin/local_includes/cygtls.h @@ -274,7 +274,7 @@ public: /* Do NOT remove this public: line, it's a marker for gentls_offsets. */ { will_wait_for_signal = false; } - void handle_SIGCONT (threadlist_t * &); + void handle_SIGCONT (); static void cleanup_early(struct _reent *); private: void call2 (DWORD (*) (void *, void *), void *, void *); -- 2.45.1