Hello, Maxim Cournoyer <maxim.courno...@gmail.com> writes:
> Hi Marius, > > Marius Bakke <mar...@gnu.org> writes: > >> Maxim Cournoyer <maxim.courno...@gmail.com> skriver: >> >>>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on >>>> berlin enough to reproduce the issue? If so, we could monitor/strace >>>> sshd on overdrive1 to get a better understanding of what’s going on. >>> >>> It's actually difficult to trigger it; it seems to happen mostly on the >>> first try after a long time without connecting to the machine; on the >>> 2nd and later tries, everything is smooth. Waiting a few minutes is not >>> enough to re-trigger the problem. >>> >>> I've managed to see the problem a few lucky times with: >>> >>> --8<---------------cut here---------------start------------->8--- >>> while true; do guix offload test /etc/guix/machines.scm overdrive1; done >>> --8<---------------cut here---------------end--------------->8--- >> >> I used to be able to reproduce it by inducing a high load on the target >> machine and just let Guix keep trying to connect. But now I did that, >> and set overload threshold to 0.0 for good measure, and Guix has been >> waiting patiently for two hours without failure. >> >> So AFAICT this bug has been fixed. Perhaps Berlin or the Overdrive >> simply needs to be updated? > > Ah! Do you have root access to overdrive1? It'd be interesting to > reconfigure it to update the guix-daemon and see if the problem > vanishes. Good news, this seems resolved with the newer Guile-SSH 0.15.1, where long delays to return some output no longer triggers an EOF response (instead now the client waits still). I believe it was fixed by this commit [0]. Many thanks to Artyom Poptsov for fixing it! Closing. Maxim [0] https://github.com/artyom-poptsov/guile-ssh/commit/fefaab9e925d015b01abc7c76ea4017c373ad895