While on a quest for flaky tests in the Shepherd, I found a genuine bug that would manifest with this ‘tests/basic.sh’ failure:
--8<---------------cut here---------------start------------->8--- + herd -s t-socket-21679 status test-run-from-nonexistent-directory + sleep 0.5 + herd -s t-socket-21679 status test-run-from-nonexistent-directory + grep 'exited with code 127' + sleep 0.5 + herd -s t-socket-21679 status test-run-from-nonexistent-directory + grep 'exited with code 127' […] 2025-03-06 14:06:36 Service test-run-from-nonexistent-directory started. 2025-03-06 14:06:36 Failed to run "/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd": In procedure chdir: No such file or directory 2025-03-06 14:06:36 Service test-run-from-nonexistent-directory running with value #<<process> id: 22431 command: ("/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd")>. 2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been started. 2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been disabled. 2025-03-06 14:11:51 Stopping service root... --8<---------------cut here---------------end--------------->8--- What happens is that the service is not marked as “exited with code 127”; instead, it is marked as having exited with code 0: --8<---------------cut here---------------start------------->8--- ● Status of test-run-from-nonexistent-directory: It is stopped since 14:06:36 (37 seconds ago). Process exited successfully. It is disabled. Provides: test-run-from-nonexistent-directory Will not be respawned. --8<---------------cut here---------------end--------------->8--- This is due to a race condition: the process terminates before its service goes from ‘starting’ to ‘running’. By the time the service controller calls ‘monitor-service-process’, the process has already terminated, so the process monitor replies 0 to the 'await request because that process no longer exists. Attached is a test that reproduces the problem. Ludo’.
# GNU Shepherd --- Handling termination of a process before 'start' completes. # Copyright © 2025 Ludovic Courtès <l...@gnu.org> # # This file is part of the GNU Shepherd. # # The GNU Shepherd is free software; you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or (at # your option) any later version. # # The GNU Shepherd is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with the GNU Shepherd. If not, see <http://www.gnu.org/licenses/>. shepherd --version herd --version socket="t-socket-$$" conf="t-conf-$$" log="t-log-$$" pid="t-pid-$$" herd="herd -s $socket" trap "cat $log || true; rm -f $socket $conf $log; test -f $pid && kill \`cat $pid\` || true; rm -f $pid" EXIT cat > "$conf" <<EOF (register-services (list (service '(stops-early) #:start (lambda () (let ((pid (fork+exec-command '("$SHELL" "-c" "echo done; exit 42")))) (format #t "got PID ~a; sleeping~%" pid) ;; Artificially wait until PID is gone for sure. (let loop () (when (false-if-exception (begin (kill pid 0) #t)) (sleep 0.5) (loop))) pid)) #:stop (make-kill-destructor) #:respawn? #f))) EOF rm -f "$pid" shepherd -I -s "$socket" -c "$conf" --pid="$pid" --log="$log" & # Wait till it's ready. until test -f "$pid"; do sleep 0.3; done $herd status $herd start stops-early $herd status stops-early $herd status stops-early | grep stopped $herd status stops-early | grep 'exited with code 42'