On Sun, 24 Nov 2024 01:15:09 +0900 Takashi Yano wrote: > On Sat, 23 Nov 2024 16:53:21 +0100 > Christian Franke wrote: > > Takashi Yano via Cygwin wrote: > > > On Wed, 20 Nov 2024 22:43:08 +0900 > > > Takashi Yano wrote: > > >> On Tue, 19 Nov 2024 18:21:52 +0900 > > >> Takashi Yano wrote: > > >>> On Tue, 12 Nov 2024 10:53:58 +0100 > > >>> Christian Franke wrote: > > >>>> Found with 'stress-ng --cpu-sched' from current stress-ng upstream > > >>>> HEAD: > > >>>> > > >>>> Testcase (attached): > > >>>> > > >>>> $ gcc -O2 -o manysignals manysignals.c > > >>>> > > >>>> $ ./manysignals > > >>>> fork() = 1833 > > >>>> ... > > >>>> fork() = 1848 > > >>>> ... > > >>>> kill(1833, 17) > > >>>> ... > > >>>> kill(1848, 17) > > >>>> kill(1833, 9) > > >>>> ... > > >>>> kill(1848, 9) > > >>>> waitpid(1833, ., 0) > > >>>> > > >>>> > > >>>> Run this in second terminal: > > >>>> > > >>>> $ watch "ps | sed -n '1p;/manysignals/{/sed/d;p}'" > > >>>> > > >>>> If 'S' appear in the first column, the child processes likely reached > > >>>> the final SIGSTOP state. This takes some time. The parent process may > > >>>> still hang in first waitpid() but should not. > > >>>> > > >>>> If the parent process is aborted with ^C, child processes may be > > >>>> stopped > > >>>> or left behind. Occasionally a child process that can not be stopped by > > >>>> Cygwin (kill -9) is left behind. > > >>>> > > >>>> Tested with ancient (i7-2600K) and more recent (i7-14700K) CPU :-) > > >>>> > > >>>> > > >>>> Unrelated to the above, but related to 'stress-ng --cpu-sched' which > > >>>> uses sched_get/setscheduler(): > > >>>> > > >>>> - sched_getscheduler() always returns SCHED_FIFO. As far as I > > >>>> understand > > >>>> Linux sched(7), this is a non-preemptive real-time policy. The > > >>>> preemptive SCHED_RR would possibly a more reasonable value. > > >>>> Unfortunately SCHED_OTHER cannot be used because it would require to > > >>>> ignore the priority. > > >>>> > > >>>> - sched_setscheduler() always fails with ENOSYS. It IMO should allow to > > >>>> set 'param->sched_priority' if 'policy' is equal to the value returned > > >>>> by sched_getscheduler(). > > >>> Thanks for the report and the test case. I'm now looking into > > >>> the issue. Please wait a while. > > >> Hopefully, I have found the cause. > > >> > > >> The deadlock happens between main thread and wait_sig thread. > > >> The main thread is waiting for the wait_sig thread triggering > > >> wakeup event while the wait_sig thread is waiting previous > > >> signal being processed by main thread. > > >> > > >> Let me consider how to fix that. > > > I'd like to report my progress for this issue. > > > > > > The patch attached almost solves the problem. ... > > > > Compile error if applied to current git main (3dbc8c3): > > > > ../../../../winsup/cygwin/exceptions.cc:1487:21: error: struct > > _cygtls has no member named sig > > 1487 | while (_main_tls->sig) > > | ^~~ > > This is because the latest Corinna's commit changes the name 'sig' > to 'current_sig'. > > commit 3dbc8c3fbdc99d3f0f68fab8ba2a814ecdc27e17 > Cygwin: cygtls: rename sig to current_sig > > > > However, your test > > > case is paused for tens of seconds, then ends normally. > > > > I guess this is as expected. The processing of the > > SIGSTOP/SIGCONT/.../SIGSTOP/SIGKILL sequence of each child process take > > some time because all are locked to a single core. > > I feel it's too slow even if 16 processes (with wait_sig threads) are > executed in one CPU core. > > > > If the code: > > > cpu_set_t cpus; CPU_ZERO(&cpus); > > > CPU_SET(0, &cpus); > > > if (sched_setaffinity(getpid(), sizeof(cpus), &cpus)) > > > perror("setaffinity"); > > > > > > for (;;) > > > sched_yield(); > > > is changed to just: > > > for (;;) sleep(1); > > > the test case runs without pause. > > > > The pause will possibly reappear if the number of child processes is > > increased to some multiple of the available cores. > > I tested with np = 16*32 without sched_setaffinity() call, the pause > does not happen. My CPU is Threadripper 1950X 16-core 32-thread. > > > > I think there still is a bug in the signal handling.
I have just submitted 6 patches for this issue. With these pathces, the problem reported no longer occurs in my environment. -- Takashi Yano <takashi.y...@nifty.ne.jp> -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple