Hello Reepca,

So glad to get your message.  :-)

Reepca Russelstein <ree...@russelstein.xyz> skribis:

> I assume you're not using cached build failures, right?  It defaults to
> off, so it should only be in use if you explicitly passed
> --cache-failures to guix-daemon (or had your client pass a value of
> "true" for "build-cache-failure", but I don't see that string anywhere
> in the guile side of guix).

Right, not using cached build failures on these machines.

> One detail that worries me is that SubstitutionGoal.outputLock is a
> std::shared_ptr<PathLocks> instead of a PathLocks.  I cannot for the
> life of me figure out why this is the case, since no sharing of
> ownership of the underlying PathLocks seems to ever be done.
> outputLock.reset() seems to be used as if it were a synonym for
> outputLock.unlock() in many places, so whoever wrote those uses probably
> thought the same.  As far as I can tell, it *should* always cause
> PathLocks::unlock to be called, so it shouldn't be an issue, it just
> worries me that I don't understand why it's done that way.

Yes, I had the same thoughts.  I checked the “shared_ptr<PathLocks>
outputLock;” member is here since at least 2005, when the file was
renamed to build.cc.

> Ahh, I think I may have just found another place in
> DerivationGoal::tryToBuild that might be the source of your issues.
> First, note the comment:
>
> /* Obtain locks on all output paths.  The locks are automatically
>    released when we exit this function or the client crashes. ... */
>
> Well, as I noticed earlier in the cached build failures case, this isn't
> exactly accurate, and there's probably a reason why.  The comment seems
> to believe specifically that leaving the scope of tryToBuild will
> automatically release the locks, which suggests to me that at one point
> outputLocks was a local variable, rather than a field of DerivationGoal.
> This theorized older version would be consistent with what we see in
> this (current) snippet:
>
>     if (buildMode != bmCheck && validPaths.size() == drv.outputs.size()) {
>         debug(format("skipping build of derivation `%1%', someone beat us to 
> it") % drvPath);
>         outputLocks.setDeletion(true);
>         done(BuildResult::AlreadyValid);
>         return;
>     }

Uh!  Sounds like a plausible problem.

> If leaving the scope of tryToBuild would cause outputLocks.unlock to
> run, then outputLocks.setDeletion is all that would need to be called.
> But since that isn't the case, this will cause the lock to be held for
> as long as the DerivationGoal exists.

Yep.

[...]

> From 2030f198892f972ee6bc2fb8529cbd6afb6bc9ad Mon Sep 17 00:00:00 2001
> From: Reepca Russelstein <ree...@russelstein.xyz>
> Date: Tue, 24 Dec 2024 05:40:58 -0600
> Subject: [PATCH] nix: build.cc: explicitly unlock in the has-become-valid
>  case.
>
> (hopefully) fixes #31785.  Similar to Nix issue #178, fixed in
> https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83b8.
>
> We can't rely on Goal deletion to release our locks in a timely manner.  In
> the case in which multiple guix-daemon processes simultaneously try producing
> an output path path1, the one that gets there first (P1) will get the lock,
> and the second one (P2) will continue trying to acquire the lock until it is
> released.  Once it has acquired the lock, it checks to see whether the path
> has already become valid in the meantime, and if so it reports success to
> those Goals waiting on its completion and finishes.  Unfortunately, it fails
> to release the locks it holds first, so those stay held until that Goal gets
> deleted.
>
> Suppose we have the following store path dependency graph:
>
>           path4
>       /     |     \
>    path1   path2   path3
>
> P2 is now sitting on path1's lock, and will continue to do so until path4 is
> completed.  Suppose there is also a P3, and it has been blocked while P1
> builds path2.  Now P3 is sitting on path2's lock, and can't acquire path1's
> lock to determine that it has been completed.  Likewise P2 is sitting on
> path1's lock, and now can't acquire path2's lock to determine that it has been
> completed.  Finally, P3 completes path3 while P2 is blocked.
>
> Now:
>
> - P1 knows that path1 and path2 are complete, and holds no locks, but can't
>   determine that path3 is complete
> - P2 knows that path1 and path3 are complete, and holds locks on path1 and
>   path3, but can't determine that path2 is complete
> - P3 knows that path2 and path3 are complete, and holds a lock on path2, but
>   can't determine that path1 is complete
>
> And none of these locks will be released until path4 is complete.  Thus, we
> have a deadlock.
>
> To resolve this, we should explicitly release these locks as soon as they
> should be released.
>
> * nix/libstore/build.cc
>   (DerivationGoal::tryToBuild, SubstitutionGoal::tryToRun):
>   Explicitly release locks in the has-become-valid case.
>
> Change-Id: Ie510f84828892315fe6776c830db33d0f70bcef8

I’m very much convinced by the patch.  Yet it bothers me that I cannot
reproduce the problem.  I tried first with this test, which attempts to
reproduce what you describe in the commit log above:

--8<---------------cut here---------------start------------->8---
;; https://issues.guix.gnu.org/31785

(use-modules (guix)
             ((gnu packages) #:select (specification->package))
             (srfi srfi-1)
             (ice-9 threads)
             (ice-9 match)
             (ice-9 textual-ports))

(define (nonce)
  (logxor (car (gettimeofday)) (cdr (gettimeofday))
          (getpid)))

(define drv1
  (computed-file "drv1" #~(begin #$(nonce) (sleep 3) (mkdir #$output))))
(define drv2
  (computed-file "drv2" #~(begin #$(nonce) (sleep 3) (mkdir #$output))))
(define drv3
  (computed-file "drv3" #~(begin #$(nonce) (sleep 3) (mkdir #$output))))
(define drv4
  (computed-file "drv4" #~(begin #$(nonce) (sleep 3)
                                 (pk 'deps: #$drv1 #$drv2 #$drv3)
                                 (mkdir #$output))))

(%graft? #f)

(define (log-port prefix)
  "Return a logging port that prefixes each line with PREFIX."
  (make-custom-textual-output-port
   "log"
   (lambda (str start count)
     (let ((str (substring str start (+ start count))))
       (format (current-error-port) "~a: ~a" prefix str)
       count))
   #f #f #f))

(setvbuf (current-error-port) 'line)
(set-port-encoding! (current-error-port) "UTF-8")
(let* ((builder (lambda (name lst)
                  (call-with-new-thread
                   (lambda ()
                     (parameterize ((current-build-output-port
                                     (log-port name)))
                      (with-store store
                        (set-build-options store #:verbosity 4)
                        (run-with-store store
                          (mlet %store-monad ((lst (mapm %store-monad
                                                         lower-object lst)))
                            (built-derivations lst)))))))))
       (thread1 (builder 'thread1 (list drv4)))
       (thread2 (builder 'thread2 (list drv4 drv1 drv2)))
       (thread3 (builder 'thread3 (list drv4 drv3 drv2))))
  (join-thread thread1)
  (join-thread thread2)
  (join-thread thread3))
--8<---------------cut here---------------end--------------->8---

But there’s no deadlock, and I think that’s because the problem we’re
seeing has to do with substitute goals, and there’s no such goal here.

So then I tried this:

diff --git a/tests/store.scm b/tests/store.scm
index 45948f4f43..224b9867c4 100644
--- a/tests/store.scm
+++ b/tests/store.scm
@@ -1,5 +1,5 @@
 ;;; GNU Guix --- Functional package management for GNU
-;;; Copyright © 2012-2021, 2023 Ludovic Courtès <l...@gnu.org>
+;;; Copyright © 2012-2021, 2023, 2024 Ludovic Courtès <l...@gnu.org>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -35,6 +35,7 @@ (define-module (test-store)
   #:use-module (gnu packages bootstrap)
   #:use-module (ice-9 match)
   #:use-module (ice-9 regex)
+  #:use-module (ice-9 threads)
   #:use-module (rnrs bytevectors)
   #:use-module (rnrs io ports)
   #:use-module (web uri)
@@ -1042,6 +1043,52 @@ (define %shell
       (ensure-path s item)
       (path-info-nar-size (query-path-info s item)))))
 
+(test-assert "substitute deadlock"
+  (with-store s
+    (let* ((guile (package-derivation s %bootstrap-guile (%current-system)))
+           (c (random-text))                      ;contents of the output
+           (drv1 (build-expression->derivation
+                  s "substitute-me1"
+                  `(begin ,c (exit 1))            ;would actually fail
+                  #:guile-for-build guile))
+           (drv2 (build-expression->derivation
+                  s "substitute-me2"
+                  `(begin ,c (exit 1))            ;would actually fail
+                  #:guile-for-build guile))
+           (drv3 (build-expression->derivation
+                  s "depends-on-substitutable1"
+                  `(call-with-output-file %output
+                     (lambda (p)
+                       (display ,c p)))
+                  #:inputs `(("drv1" ,drv1))
+                  #:guile-for-build guile))
+           (drv4 (build-expression->derivation
+                  s "depends-on-substitutable2"
+                  `(call-with-output-file %output
+                     (lambda (p)
+                       (display ,c p)))
+                  #:inputs `(("drv2" ,drv2))
+                  #:guile-for-build guile)))
+      (with-derivation-substitute drv1 c
+        (with-derivation-substitute drv2 c
+          (system* "ls" "-l" "/tmp/subst")
+          (let* ((builder (lambda (drv)
+                            (call-with-new-thread
+                             (lambda ()
+                               (with-store store
+                                 (set-build-options store
+                                                    #:use-substitutes? #t
+                                                    #:substitute-urls
+                                                    (%test-substitute-urls))
+                                 (build-things store
+                                               (list (derivation-file-name drv))))))))
+                 (thread1 (builder drv3))
+                 (thread2 (builder drv4)))
+            (join-thread thread1)
+            (join-thread thread2)
+            (and (valid-path? s (derivation->output-path drv3))
+                 (valid-path? s (derivation->output-path drv4)))))))))
+
 (test-assert "export/import several paths"
   (let* ((texts (unfold (cut >= <> 10)
                         (lambda _ (random-text))
But I think we’d have to be lucky for it to trigger the deadlock (lucky
because the two substitutes should be fetched at the same instant,
roughly).

Given that it’s a nasty issue, I’d prefer to have an (ideally automated)
test.

Thoughts? Ideas?

Anyway, thanks for thinking through it and for explaining your reasoning
and coming up with a patch!

And: merry end-of-year holidays too.  :-)

Ludo’.

Reply via email to