Hi, I've tried to determine why a workaround in the jami-service-type is required in the 'stop' slot to avoid failures in 'herd restart jami', and haven't quite found the culprit, but it appears to me that:
1. waipid is only called in one place in Shepherd, which is in the handle-SIGCHLD procedure in (shepherd service), which does not specifically wait for an exact PID but rather does: (waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling in the case a system-error exception is thrown with an ECHILD or EINTR error number. This doesn't strike me as a strong guarantee that waitpid occurs when stop is called, because: 1. It requires to be installed in the signal handlers for each processes, with something like: --8<---------------cut here---------------start------------->8--- (unless %sigchld-handler-installed? (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) (set! %sigchld-handler-installed? #t)) --8<---------------cut here---------------end--------------->8--- Done for fork+exec-command and make-inetd-forkexec-constructor, but not for make-forkexec-constructor/container, AFAICT; 2. it has the WNOHANG flag, which means the stop simply does a kill the the signal handling weakly (because of WNOHANG) waits on it, which means the start may begin before the process was actually completely terminated. Here's a small reproducer to apply on our code base: --8<---------------cut here---------------start------------->8--- modified gnu/services/telephony.scm @@ -685,13 +685,7 @@ (define (archive-name->username archive) ;; Finally, return the PID of the daemon process. daemon-pid)) - (stop - #~(lambda (pid . args) - (kill pid SIGKILL) - ;; Wait for the process to exit; this prevents overlapping - ;; processes when issuing 'herd restart'. - (waitpid pid) - #f)))))))) + (stop #~(make-kill-destructor)))))))) (define jami-service-type (service-type --8<---------------cut here---------------end--------------->8--- Then run 'make check-system TESTS=jami-provisioning' to see new failures, or if you want to investigate manually the system: --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix system vm --no-grafts --no-offload --no-graphic \ -e '(@@ (gnu tests telephony) %jami-os-provisioning)' $ /gnu/store/rxi7c14hga62qslb0sr6nac9qnkxr0nn-run-vm.sh -m 1G -smp 4 \ -nic user,model=virtio-net-pci,hostfwd=tcp::10022-:22 # Connect to the QEMU VM: $ ssh root@localhost -p10022 root@jami ~# herd restart jami Service jami has been stopped. herd: exception caught while executing 'start' on service 'jami': dbus "method failed with error" "org.freedesktop.DBus.Error.NoReply" ("Message recipient disconnected from message bus without replying") root@jami ~# herd status jami Status of jami: It is stopped. It is enabled. Provides (jami). Requires (jami-dbus-session). Conflicts with (). Will be respawned. root@jami ~# pgrep jami --8<---------------cut here---------------end--------------->8--- Thanks, Maxim