Re: SIGKILL may no longer work after many SIGCONT/SIGSTOP signals

Christian Franke via Cygwin Sat, 23 Nov 2024 08:01:19 -0800

Takashi Yano via Cygwin wrote:

On Wed, 20 Nov 2024 22:43:08 +0900
Takashi Yano wrote:

On Tue, 19 Nov 2024 18:21:52 +0900
Takashi Yano wrote:

On Tue, 12 Nov 2024 10:53:58 +0100
Christian Franke wrote:

Found with 'stress-ng --cpu-sched' from current stress-ng upstream HEAD:


Testcase (attached):

$ gcc -O2 -o manysignals manysignals.c

$ ./manysignals
fork() = 1833
...
fork() = 1848
...
kill(1833, 17)
...
kill(1848, 17)
kill(1833, 9)
...
kill(1848, 9)
waitpid(1833, ., 0)


Run this in second terminal:

$ watch "ps | sed -n '1p;/manysignals/{/sed/d;p}'"

If 'S' appear in the first column, the child processes likely reached
the final SIGSTOP state. This takes some time. The parent process may
still hang in first waitpid() but should not.

If the parent process is aborted with ^C, child processes may be stopped
or left behind. Occasionally a child process that can not be stopped by
Cygwin (kill -9) is left behind.

Tested with ancient (i7-2600K) and more recent (i7-14700K) CPU :-)


Unrelated to the above, but related to 'stress-ng --cpu-sched' which
uses sched_get/setscheduler():

- sched_getscheduler() always returns SCHED_FIFO. As far as I understand
Linux sched(7), this is a non-preemptive real-time policy. The
preemptive SCHED_RR would possibly a more reasonable value.
Unfortunately SCHED_OTHER cannot be used because it would require to
ignore the priority.

- sched_setscheduler() always fails with ENOSYS. It IMO should allow to
set 'param->sched_priority' if 'policy' is equal to the value returned
by sched_getscheduler().

Thanks for the report and the test case. I'm now looking into
the issue. Please wait a while.

Hopefully, I have found the cause.

The deadlock happens between main thread and wait_sig thread.
The main thread is waiting for the wait_sig thread triggering
wakeup event while the wait_sig thread is waiting previous
signal being processed by main thread.

Let me consider how to fix that.

I'd like to report my progress for this issue.

The patch attached almost solves the problem. ...


Compile error if applied to current git main (3dbc8c3):

../../../../winsup/cygwin/exceptions.cc:1487:21: error: struct_cygtls has no member named sig

  1487 |   while (_main_tls->sig)
       |                     ^~~

  However, your test
case is paused for tens of seconds, then ends normally.

I guess this is as expected. The processing of theSIGSTOP/SIGCONT/.../SIGSTOP/SIGKILL sequence of each child process takesome time because all are locked to a single core.

If the code:
       cpu_set_t cpus; CPU_ZERO(&cpus);
       CPU_SET(0, &cpus);
       if (sched_setaffinity(getpid(), sizeof(cpus), &cpus))
         perror("setaffinity");

       for (;;)
         sched_yield();
is changed to just:
       for (;;) sleep(1);
the test case runs without pause.

The pause will possibly reappear if the number of child processes isincreased to some multiple of the available cores.

I think there still is a bug in the signal handling.


Possibly related:
https://sourceware.org/pipermail/cygwin/2024-November/256808.html

--
Regards,
Christian


--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: SIGKILL may no longer work after many SIGCONT/SIGSTOP signals

Reply via email to