On Fri, Apr 28, 2023 at 7:54 AM Sergey Bugaev <buga...@gmail.com> wrote:
> Hello! > > Here comes yet another bug description and a change proposal; > hopefully not too long this time. (Update after having written like > half of it: apparently it *is* going to be long.) > > My system regularly experiences network deadlocks: the system itself > still works if I access it through the console, but all my SSH > sessions just hang, as do any other connections. This can happen soon > after startup (in like 20% of times), and later during run time at > arbitrary moments too; but in this second case it seems related to how > much network activity there is. This is one of the reasons preventing > me from running the glibc testsuite, since that spits out _a lot_ of > output, and will almost surely lead to a networking deadlock at some > point. I don't know why nobody else seems to be complaining about > this; based on the analysis below the root cause should be fairly > reproducible. > > Shutting down the system normally in this case does not work, since > the dhcp client attempts to release its lease, but since the network > stack is deadlocked it never manages to. And since (on Debian > GNU/Hurd) the shutdown process is apparently synchronous and there are > no timeouts and sigkills in place, it just hangs there forever, with > no apparent way to even Ctrl-C it (which sort of makes sense since > it's not an interactive shell). If I approach it calmly, I do syncfs > --sync / (just in case) and then reboot-hurd, which works. If, > however, if I have already managed to type in 'reboot', I am left with > no choice but to forcefully power off the machine and risk corrupting > the file system; and the consequences range from minor complaints from > fsck on the next boot-up to the installation breaking so badly that it > does not boot. > > So we can agree that this is slightly suboptimal. > > Yesterday I got one of these hangs again, and decided to bite the > bullet & look into what's going on. There are two things that could be > locking up: either pfinet or netdde; quick inspection revealed that > pfinet is the guilty party. There is apparently a single global_lock > in pfinet; it makes sense that if something manages to hold it > indefinitely then everything else will hang. That something turned out > to be the timer thread (see pfinet/timer-emul.c:timer_function). It > grabs the lock after sleeping and proceeds to run the handlers, and > apparently never exits from that loop; so nothing else can make any > progress. > > But why doesn't it exit the loop? Surely there's only a finite, > hopefully small, number of timers overall? Well, yes, but they > re-schedule themselves from inside the handlers. As an aside, whoever > wrote add_timer () in the same file obviously didn't care about > releasing the mach_thread_self () right; currently a reference is > leaked on each invocation (yay). It should either deallocate the right > or use hurd_thread_self (). > > The offending function that reschedules itself over and over is > pfinet/linux-src/net/ipv4/tcp_timer.c:tcp_sltimer_handler. I found out > (which wasn't super easy when everything's <optimized out>, in > addition to me being unable to use my terminal emulator of choice and > having to squint at the Mach console which is tiny because it's not > getting appropriately scaled for my HiDPI screen [0]) that the 'now + > next' addition overflows and results in a small timestamp (number of > jiffies), which is then getting taken by timer-emul to be a timestamp > in the past, so it immediately runs the handler again, thus the > busy-loop. > > [0]: https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/635 > > Why does it overflow? Because 'now' is huge. 'now' is set from > 'jiffies' at the beginning of tcp_sltimer_handler. 'jiffies' is > #defined to fetch_jiffies () from pfinet/mapped-time.h; which uses > maptime_read () to read the current time from the Mach time device, > and then subtracts 'root_jiffies' from it. 'root_jiffies' is filled > from the Mach time device on pfinet startup. So clearly 'j - > root_jiffies' itself overflows, resulting in a huge 'jiffies' value; I > added an 'assert (j >= root_jiffies)' and sure enough, it fails at > about the same rate that the network would lock up. > > In conclusion, what happens is pfinet reads a timestamp from the Mach > time device, and then reads another timestamp, and the second one > happens to be smaller than the first one, which confuses it into > busy-looping and deadlocking. > > I then grepped gnumach to see how the time device is set up and > realized that the time device is supposed to be CLOCK_REALTIME, not > something like CLOCK_BOOTTIME or CLOCK_MONOTONIC. So it only makes > sense that if *something* adjusts the clock with host_set_time (), the > time as seen through the time device will experience a discontinuous > jump, possibly into the past. And that *something* appears to be > /usr/sbin/ntpdate. > This never happened to me. Probably because I don't have NTP installed in my Hurd box? > > Now, how do we fix this? > > Clearly pfinet needs to implement 'jiffies' differently. A stopgap > solution would be to detect the case of j < root_jiffies and just set > root_jiffies = j; this will probably break all sort of timeouts, but > at least will solve the busy-loop / deadlock. > > But also, Mach needs to expose a CLOCK_BOOTTIME-like clock in addition > to the CLOCK_REALTIME one. There is already internal tracking for > this, see kern/mach_clock.c:clock_boottime_offset; it only needs to be > exposed to the userland. > > Since we have already broken ABI compatibility with the time device on > other Mach versions (if it ever existed) by adding the *64 variants to > the struct mapped_time_value, maybe we could just place the > clock_boottime_offset into the same struct? Then the userspace would > be able to pick it up and use it to calculate the boottime-relative > time. > User land can still use the old time structures but can now be updated to read the *64 variants. > We could even attempt to use this to implement some support for > CLOCK_MONOTONIC in glibc, which I imagine would be a welcome addition. > Maybe backcompat is not that much of a concern either: the kernel > allocates a whole page for the mapped_time_value and zero-fills it; if > we see that clock_boottime_offset is zero that clearly means that > either the system was booted up on 1 January, 1970, or the Mach > version in use doesn't support clock_boottime_offset. > > What do you think? > I think the approach makes sense to me. We can update the clock_boottime_offset in the structure whenever it is updated by the kernel and then provide a maptime_read(clockid_t clock, struct mapped_time_value *mtime, struct timeval *tv) routine to read from the mapped memory and return either a time using CLOCK_REALTIME or CLOCK_MONOTONIC semantics. We could also have an RPC host_get_time that is parameterized by clockid_t to have some symmetry with the routine above. > Sergey > Flavio