While on a quest for flaky tests in the Shepherd, I found a genuine bug
that would manifest with this ‘tests/basic.sh’ failure:

--8<---------------cut here---------------start------------->8---
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ sleep 0.5
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ grep 'exited with code 127'
+ sleep 0.5
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ grep 'exited with code 127'
[…]
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory started.
2025-03-06 14:06:36 Failed to run 
"/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd": In 
procedure chdir: No such file or directory
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory running with 
value #<<process> id: 22431 command: 
("/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd")>.
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been 
started.
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been 
disabled.
2025-03-06 14:11:51 Stopping service root...
--8<---------------cut here---------------end--------------->8---

What happens is that the service is not marked as “exited with code
127”; instead, it is marked as having exited with code 0:

--8<---------------cut here---------------start------------->8---
● Status of test-run-from-nonexistent-directory:
  It is stopped since 14:06:36 (37 seconds ago).
  Process exited successfully.
  It is disabled.
  Provides: test-run-from-nonexistent-directory
  Will not be respawned.
--8<---------------cut here---------------end--------------->8---

This is due to a race condition: the process terminates before its
service goes from ‘starting’ to ‘running’.

By the time the service controller calls ‘monitor-service-process’, the
process has already terminated, so the process monitor replies 0 to the
'await request because that process no longer exists.

Attached is a test that reproduces the problem.

Ludo’.

# GNU Shepherd --- Handling termination of a process before 'start' completes.
# Copyright © 2025 Ludovic Courtès <l...@gnu.org>
#
# This file is part of the GNU Shepherd.
#
# The GNU Shepherd is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or (at
# your option) any later version.
#
# The GNU Shepherd is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with the GNU Shepherd.  If not, see <http://www.gnu.org/licenses/>.

shepherd --version
herd --version

socket="t-socket-$$"
conf="t-conf-$$"
log="t-log-$$"
pid="t-pid-$$"

herd="herd -s $socket"

trap "cat $log || true; rm -f $socket $conf $log;
      test -f $pid && kill \`cat $pid\` || true; rm -f $pid" EXIT

cat > "$conf" <<EOF
(register-services
  (list (service
          '(stops-early)
          #:start (lambda ()
                    (let ((pid (fork+exec-command
                                '("$SHELL" "-c" "echo done; exit 42"))))
                      (format #t "got PID ~a; sleeping~%" pid)

                      ;; Artificially wait until PID is gone for sure.
                      (let loop ()
                        (when (false-if-exception (begin (kill pid 0) #t))
                          (sleep 0.5)
                          (loop)))
                      pid))
          #:stop (make-kill-destructor)
          #:respawn? #f)))
EOF

rm -f "$pid"
shepherd -I -s "$socket" -c "$conf" --pid="$pid" --log="$log" &

# Wait till it's ready.
until test -f "$pid"; do sleep 0.3; done

$herd status
$herd start stops-early
$herd status stops-early
$herd status stops-early | grep stopped
$herd status stops-early | grep 'exited with code 42'

Reply via email to