I experienced, with the Shepherd 0.10.0, a situation where ‘herd restart’ would get stuck after the end-of-grace-period expiration (restarting nscd in this case):
--8<---------------cut here---------------start------------->8--- May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Status of nscd: May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is running since 08:43:57 (36 seconds ago). May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Running value is 23753. May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is enabled. May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Provides (nscd). May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Requires (user-processes syslogd). May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Will be respawned. May 26 08:44:33 localhost shepherd[1]: Stopping service nscd... [...] May 26 08:44:38 localhost shepherd[1]: Grace period of 5 seconds is over; sending -23753 SIGKILL. --8<---------------cut here---------------end--------------->8--- The ‘herd restart’ process is indeed waiting, and the nscd process is still around as zombie, meaning that shepherd didn’t call waitpid(2): --8<---------------cut here---------------start------------->8--- $ sudo herd status nscd Status of nscd: It is being stopped. It is enabled. Provides (nscd). Requires (user-processes syslogd). Will be respawned. $ ps aux|grep nscd root 23753 0.0 0.0 0 0 ? Zs 08:43 0:00 [nscd] <defunct> root 23870 0.0 0.1 49968 18088 ? Sl 08:44 0:00 /gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/bin/guile --no-auto-compile /run/current-system/profile/bin/herd restart nscd ludo 25280 33.3 0.0 6112 2304 pts/0 S+ 09:32 0:00 grep --color=auto nscd --8<---------------cut here---------------end--------------->8--- At that point, if I pick another service process and kill it, shepherd again fails to react to SIGCHLD (or doesn’t receive it): --8<---------------cut here---------------start------------->8--- $ sudo herd status ntpd Status of ntpd: It is running since 08:44:31 AM (52 minutes ago). Running value is 23814. It is enabled. Provides (ntpd). Requires (user-processes networking). Will be respawned. $ sudo kill 23814 $ sudo herd status ntpd Status of ntpd: It is running since 08:44:31 AM (52 minutes ago). Running value is 23814. It is enabled. Provides (ntpd). Requires (user-processes networking). Will be respawned. ludo@ribbon ~/src/guix$ ps 23814 PID TTY STAT TIME COMMAND 23814 ? Zs 0:00 [ntpd] <defunct> --8<---------------cut here---------------end--------------->8--- If I repeat the same experiment while stracing shepherd, here’s what I see: --8<---------------cut here---------------start------------->8--- $ sudo herd status guix-publish Status of guix-publish: It is running since 08:44:32 AM (55 minutes ago). Running value is 23822. It is enabled. Provides (guix-publish). Requires (user-processes guix-daemon avahi-daemon). Will be respawned. $ sudo kill 23822 $ ps 23822 PID TTY STAT TIME COMMAND 23822 ? Zs 0:02 [guix publish] <defunct> --8<---------------cut here---------------end--------------->8--- … and the strace: --8<---------------cut here---------------start------------->8--- 1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 1 epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 8, -1) = 1 1 accept4(14, {sa_family=AF_UNIX}, [112 => 2], SOCK_CLOEXEC|SOCK_NONBLOCK) = 25 1 accept4(14, 0x7ffe6f8f0be0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 1 epoll_ctl(13, EPOLL_CTL_MOD, 14, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=14, u64=14}}) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 read(25, "(shepherd-command (version 0) (action status) (service guix-publish) (arguments ()) (directory \"/home/ludo/src/guix\"))", 1024) = 118 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 write(25, "(reply (version 0) (result ((service (version 0) (provides (guix-publish)) (requires (user-processes guix-daemon avahi-daemon)) (respawn? #t) (docstring \"[No documentation.]\") (enabled? #t) (running 23822) (conflicts ()) (last-respawns ()) (status-changes ((running . 1685083472) (starting . 1685083472) (stopped . 1685083469) (stopping . 1685083469) (running . 1684924319) (starting . 1684924319) (stopped . 1684924319) (stopping . 1684924319) (running . 1684853459) (starting . 1684853459) (stopped"..., 686) = 686 1 close(25) = 0 1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 1 epoll_wait(13, [{events=EPOLLHUP, data={u32=30, u64=30}}], 8, -1) = 1 1 read(30, "", 4096) = 0 1 close(30) = 0 1 close(36) = 0 1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 1 epoll_wait(13, --8<---------------cut here---------------end--------------->8--- The signal FD still exists but we didn’t see any activity on it: --8<---------------cut here---------------start------------->8--- $ sudo ls -lrt /proc/1/fd |grep signalfd lrwx------ 1 root root 64 May 26 09:41 10 -> anon_inode:[signalfd] --8<---------------cut here---------------end--------------->8--- Here’s the signal handling status (notice two signals queued): --8<---------------cut here---------------start------------->8--- $ sudo cat /proc/1/status|grep -i Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: 0000000000014003 SigIgn: 0000000000001000 SigCgt: 0000000120814423 --8<---------------cut here---------------end--------------->8--- With: (define (signal-mask->list mask) (let loop ((sig 32) (lst '())) (if (zero? sig) lst (loop (- sig 1) (if (bit-set? (- sig 1) mask) (cons sig lst) lst))))) … we see that the block mask is equal to ‘%precious-signals’ (good): --8<---------------cut here---------------start------------->8--- scheme@(shepherd system)> (signal-mask->list #x0000000000014003) $33 = (1 2 15 17) --8<---------------cut here---------------end--------------->8--- … while the ignored mask lists SIGPIPE: --8<---------------cut here---------------start------------->8--- scheme@(shepherd system)> (signal-mask->list #x0000000000001000) $32 = (13) --8<---------------cut here---------------end--------------->8--- However, the process also has 3 GC marker threads and 1 finalization thread: --8<---------------cut here---------------start------------->8--- (gdb) info threads Id Target Id Frame * 1 Thread 0x7fdbd530f380 (LWP 1) "shepherd" 0x00007fdbd5415626 in epoll_wait (epfd=13, events=0x7fdbd203a1a0, maxevents=8, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 2 Thread 0x7fdbd4cd9640 (LWP 118) "GC-marker-0" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>) at futex-internal.c:57 3 Thread 0x7fdbd44d8640 (LWP 119) "GC-marker-1" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>) at futex-internal.c:57 4 Thread 0x7fdbd3cd7640 (LWP 120) "GC-marker-2" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>) at futex-internal.c:57 5 Thread 0x7fdbd3389640 (LWP 123) "shepherd" __GI___libc_read (nbytes=1, buf=0x7fdbd3388620, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26 --8<---------------cut here---------------end--------------->8--- These threads seem to be blocking every signal, which is good: --8<---------------cut here---------------start------------->8--- $ sudo cat /proc/118/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5f7bfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 $ sudo cat /proc/119/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5f7bfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 $ sudo cat /proc/120/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5f7bfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 $ sudo cat /proc/123/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5ffbfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 --8<---------------cut here---------------end--------------->8--- Then all of a sudden, as I’m conducting this experiment, I see: --8<---------------cut here---------------start------------->8--- May 26 09:36:37 localhost ntpd[23814]: ntpd exiting on signal 15 (Terminated) […] May 26 10:52:10 localhost shepherd[1]: Respawning nscd. May 26 10:52:10 localhost shepherd[1]: Service guix-publish (PID 23822) terminated with signal 15. May 26 10:52:10 localhost shepherd[1]: Respawning ntpd. May 26 10:52:10 localhost shepherd[1]: Respawning guix-publish. May 26 10:52:10 localhost shepherd[1]: Starting service nscd... May 26 10:52:10 localhost shepherd[1]: Starting service ntpd... May 26 10:52:10 localhost shepherd[1]: Service ntpd has been started. May 26 10:52:10 localhost shepherd[1]: Service ntpd started. May 26 10:52:10 localhost shepherd[1]: Service ntpd running with value 29148. --8<---------------cut here---------------end--------------->8--- 10:52 is about the time I attached gdb to shepherd… Interestingly, we get a “terminated with signal 15” message from shepherd for guix-publish, but not for ntpd. Long story short: there seems to be a problem with signal delivery. Most likely, the initial grace period expiration above when stopping nscd is a symptom of shepherd no longer receiving/processing SIGCHLD rather than the cause. To be continued… Ludo’.