Hello Reepca, So glad to get your message. :-)
Reepca Russelstein <ree...@russelstein.xyz> skribis: > I assume you're not using cached build failures, right? It defaults to > off, so it should only be in use if you explicitly passed > --cache-failures to guix-daemon (or had your client pass a value of > "true" for "build-cache-failure", but I don't see that string anywhere > in the guile side of guix). Right, not using cached build failures on these machines. > One detail that worries me is that SubstitutionGoal.outputLock is a > std::shared_ptr<PathLocks> instead of a PathLocks. I cannot for the > life of me figure out why this is the case, since no sharing of > ownership of the underlying PathLocks seems to ever be done. > outputLock.reset() seems to be used as if it were a synonym for > outputLock.unlock() in many places, so whoever wrote those uses probably > thought the same. As far as I can tell, it *should* always cause > PathLocks::unlock to be called, so it shouldn't be an issue, it just > worries me that I don't understand why it's done that way. Yes, I had the same thoughts. I checked the “shared_ptr<PathLocks> outputLock;” member is here since at least 2005, when the file was renamed to build.cc. > Ahh, I think I may have just found another place in > DerivationGoal::tryToBuild that might be the source of your issues. > First, note the comment: > > /* Obtain locks on all output paths. The locks are automatically > released when we exit this function or the client crashes. ... */ > > Well, as I noticed earlier in the cached build failures case, this isn't > exactly accurate, and there's probably a reason why. The comment seems > to believe specifically that leaving the scope of tryToBuild will > automatically release the locks, which suggests to me that at one point > outputLocks was a local variable, rather than a field of DerivationGoal. > This theorized older version would be consistent with what we see in > this (current) snippet: > > if (buildMode != bmCheck && validPaths.size() == drv.outputs.size()) { > debug(format("skipping build of derivation `%1%', someone beat us to > it") % drvPath); > outputLocks.setDeletion(true); > done(BuildResult::AlreadyValid); > return; > } Uh! Sounds like a plausible problem. > If leaving the scope of tryToBuild would cause outputLocks.unlock to > run, then outputLocks.setDeletion is all that would need to be called. > But since that isn't the case, this will cause the lock to be held for > as long as the DerivationGoal exists. Yep. [...] > From 2030f198892f972ee6bc2fb8529cbd6afb6bc9ad Mon Sep 17 00:00:00 2001 > From: Reepca Russelstein <ree...@russelstein.xyz> > Date: Tue, 24 Dec 2024 05:40:58 -0600 > Subject: [PATCH] nix: build.cc: explicitly unlock in the has-become-valid > case. > > (hopefully) fixes #31785. Similar to Nix issue #178, fixed in > https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83b8. > > We can't rely on Goal deletion to release our locks in a timely manner. In > the case in which multiple guix-daemon processes simultaneously try producing > an output path path1, the one that gets there first (P1) will get the lock, > and the second one (P2) will continue trying to acquire the lock until it is > released. Once it has acquired the lock, it checks to see whether the path > has already become valid in the meantime, and if so it reports success to > those Goals waiting on its completion and finishes. Unfortunately, it fails > to release the locks it holds first, so those stay held until that Goal gets > deleted. > > Suppose we have the following store path dependency graph: > > path4 > / | \ > path1 path2 path3 > > P2 is now sitting on path1's lock, and will continue to do so until path4 is > completed. Suppose there is also a P3, and it has been blocked while P1 > builds path2. Now P3 is sitting on path2's lock, and can't acquire path1's > lock to determine that it has been completed. Likewise P2 is sitting on > path1's lock, and now can't acquire path2's lock to determine that it has been > completed. Finally, P3 completes path3 while P2 is blocked. > > Now: > > - P1 knows that path1 and path2 are complete, and holds no locks, but can't > determine that path3 is complete > - P2 knows that path1 and path3 are complete, and holds locks on path1 and > path3, but can't determine that path2 is complete > - P3 knows that path2 and path3 are complete, and holds a lock on path2, but > can't determine that path1 is complete > > And none of these locks will be released until path4 is complete. Thus, we > have a deadlock. > > To resolve this, we should explicitly release these locks as soon as they > should be released. > > * nix/libstore/build.cc > (DerivationGoal::tryToBuild, SubstitutionGoal::tryToRun): > Explicitly release locks in the has-become-valid case. > > Change-Id: Ie510f84828892315fe6776c830db33d0f70bcef8 I’m very much convinced by the patch. Yet it bothers me that I cannot reproduce the problem. I tried first with this test, which attempts to reproduce what you describe in the commit log above: --8<---------------cut here---------------start------------->8--- ;; https://issues.guix.gnu.org/31785 (use-modules (guix) ((gnu packages) #:select (specification->package)) (srfi srfi-1) (ice-9 threads) (ice-9 match) (ice-9 textual-ports)) (define (nonce) (logxor (car (gettimeofday)) (cdr (gettimeofday)) (getpid))) (define drv1 (computed-file "drv1" #~(begin #$(nonce) (sleep 3) (mkdir #$output)))) (define drv2 (computed-file "drv2" #~(begin #$(nonce) (sleep 3) (mkdir #$output)))) (define drv3 (computed-file "drv3" #~(begin #$(nonce) (sleep 3) (mkdir #$output)))) (define drv4 (computed-file "drv4" #~(begin #$(nonce) (sleep 3) (pk 'deps: #$drv1 #$drv2 #$drv3) (mkdir #$output)))) (%graft? #f) (define (log-port prefix) "Return a logging port that prefixes each line with PREFIX." (make-custom-textual-output-port "log" (lambda (str start count) (let ((str (substring str start (+ start count)))) (format (current-error-port) "~a: ~a" prefix str) count)) #f #f #f)) (setvbuf (current-error-port) 'line) (set-port-encoding! (current-error-port) "UTF-8") (let* ((builder (lambda (name lst) (call-with-new-thread (lambda () (parameterize ((current-build-output-port (log-port name))) (with-store store (set-build-options store #:verbosity 4) (run-with-store store (mlet %store-monad ((lst (mapm %store-monad lower-object lst))) (built-derivations lst))))))))) (thread1 (builder 'thread1 (list drv4))) (thread2 (builder 'thread2 (list drv4 drv1 drv2))) (thread3 (builder 'thread3 (list drv4 drv3 drv2)))) (join-thread thread1) (join-thread thread2) (join-thread thread3)) --8<---------------cut here---------------end--------------->8--- But there’s no deadlock, and I think that’s because the problem we’re seeing has to do with substitute goals, and there’s no such goal here. So then I tried this:
diff --git a/tests/store.scm b/tests/store.scm index 45948f4f43..224b9867c4 100644 --- a/tests/store.scm +++ b/tests/store.scm @@ -1,5 +1,5 @@ ;;; GNU Guix --- Functional package management for GNU -;;; Copyright © 2012-2021, 2023 Ludovic Courtès <l...@gnu.org> +;;; Copyright © 2012-2021, 2023, 2024 Ludovic Courtès <l...@gnu.org> ;;; ;;; This file is part of GNU Guix. ;;; @@ -35,6 +35,7 @@ (define-module (test-store) #:use-module (gnu packages bootstrap) #:use-module (ice-9 match) #:use-module (ice-9 regex) + #:use-module (ice-9 threads) #:use-module (rnrs bytevectors) #:use-module (rnrs io ports) #:use-module (web uri) @@ -1042,6 +1043,52 @@ (define %shell (ensure-path s item) (path-info-nar-size (query-path-info s item))))) +(test-assert "substitute deadlock" + (with-store s + (let* ((guile (package-derivation s %bootstrap-guile (%current-system))) + (c (random-text)) ;contents of the output + (drv1 (build-expression->derivation + s "substitute-me1" + `(begin ,c (exit 1)) ;would actually fail + #:guile-for-build guile)) + (drv2 (build-expression->derivation + s "substitute-me2" + `(begin ,c (exit 1)) ;would actually fail + #:guile-for-build guile)) + (drv3 (build-expression->derivation + s "depends-on-substitutable1" + `(call-with-output-file %output + (lambda (p) + (display ,c p))) + #:inputs `(("drv1" ,drv1)) + #:guile-for-build guile)) + (drv4 (build-expression->derivation + s "depends-on-substitutable2" + `(call-with-output-file %output + (lambda (p) + (display ,c p))) + #:inputs `(("drv2" ,drv2)) + #:guile-for-build guile))) + (with-derivation-substitute drv1 c + (with-derivation-substitute drv2 c + (system* "ls" "-l" "/tmp/subst") + (let* ((builder (lambda (drv) + (call-with-new-thread + (lambda () + (with-store store + (set-build-options store + #:use-substitutes? #t + #:substitute-urls + (%test-substitute-urls)) + (build-things store + (list (derivation-file-name drv)))))))) + (thread1 (builder drv3)) + (thread2 (builder drv4))) + (join-thread thread1) + (join-thread thread2) + (and (valid-path? s (derivation->output-path drv3)) + (valid-path? s (derivation->output-path drv4))))))))) + (test-assert "export/import several paths" (let* ((texts (unfold (cut >= <> 10) (lambda _ (random-text))
But I think we’d have to be lucky for it to trigger the deadlock (lucky because the two substitutes should be fetched at the same instant, roughly). Given that it’s a nasty issue, I’d prefer to have an (ideally automated) test. Thoughts? Ideas? Anyway, thanks for thinking through it and for explaining your reasoning and coming up with a patch! And: merry end-of-year holidays too. :-) Ludo’.