l...@gnu.org (Ludovic Courtès) skribis: > On the build machine side, the guile process that forwards data between > the sshd and guix-daemon¹ is stuck on: > > read(0, …)
> ¹ > https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102 I’ve been able to strace this Guile process. When “things work normally”, its loop does things like this: --8<---------------cut here---------------start------------->8--- 20456 10:52:17 read(9, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 65536) = 152 20456 10:52:17 write(1, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 152) = 152 20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...> 20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0 20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...> 20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9]) 20456 10:52:17 read(9, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 65536) = 168 20456 10:52:17 write(1, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 168) = 168 20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...> 20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0 20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...> 20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9]) 20456 10:52:17 read(9, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 65536) = 216 20456 10:52:17 write(1, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 216) = 216 --8<---------------cut here---------------end--------------->8--- Here FD 9 is the guix-daemon socket and 1 is stdout, which is connected to the remote ‘guix offload’ process. We then see a slightly different sequence leading to the read(0, …) call that hangs: --8<---------------cut here---------------start------------->8--- 20456 10:52:18 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...> 20456 10:52:18 <... pselect6 resumed> ) = 1 (in [9]) 20456 10:52:18 read(9, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 65536) = 72 20456 10:52:18 write(1, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 72) = 72 20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=261868144}) = 0 20456 10:52:18 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 20456 10:52:18 madvise(0xffffb0025000, 4096, MADV_DONTNEED <unfinished ...> 20456 10:52:18 <... madvise resumed> ) = 0 20456 10:52:18 write(6, "\0", 1 <unfinished ...> 20456 10:52:18 <... write resumed> ) = 1 20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, <unfinished ...> 20456 10:52:18 <... clock_gettime resumed> {tv_sec=0, tv_nsec=268384968}) = 0 20456 10:52:18 read(0, <unfinished ...> --8<---------------cut here---------------end--------------->8--- Notice that there’s no pselect showing up here! The two clock_gettime correspond to ‘get_internal_run_time’ in libguile, and ‘madvise’ corresponds to ‘return_unused_stack_to_os’; IOW, this part corresponds to garbage collection. So what we have here is that the Scheme procedure ‘select’ returned stdin as “ready for reading”. How did that happen? I believe this is due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’ returns 1, so ‘select’ returns EINTR but it does so without clearing the FD sets. Ludo’.