On Fri, Apr 28, 2023 at 7:54 AM Sergey Bugaev <buga...@gmail.com> wrote:

> Hello!
>
> Here comes yet another bug description and a change proposal;
> hopefully not too long this time. (Update after having written like
> half of it: apparently it *is* going to be long.)
>
> My system regularly experiences network deadlocks: the system itself
> still works if I access it through the console, but all my SSH
> sessions just hang, as do any other connections. This can happen soon
> after startup (in like 20% of times), and later during run time at
> arbitrary moments too; but in this second case it seems related to how
> much network activity there is. This is one of the reasons preventing
> me from running the glibc testsuite, since that spits out _a lot_ of
> output, and will almost surely lead to a networking deadlock at some
> point. I don't know why nobody else seems to be complaining about
> this; based on the analysis below the root cause should be fairly
> reproducible.
>
> Shutting down the system normally in this case does not work, since
> the dhcp client attempts to release its lease, but since the network
> stack is deadlocked it never manages to. And since (on Debian
> GNU/Hurd) the shutdown process is apparently synchronous and there are
> no timeouts and sigkills in place, it just hangs there forever, with
> no apparent way to even Ctrl-C it (which sort of makes sense since
> it's not an interactive shell). If I approach it calmly, I do syncfs
> --sync / (just in case) and then reboot-hurd, which works. If,
> however, if I have already managed to type in 'reboot', I am left with
> no choice but to forcefully power off the machine and risk corrupting
> the file system; and the consequences range from minor complaints from
> fsck on the next boot-up to the installation breaking so badly that it
> does not boot.
>
> So we can agree that this is slightly suboptimal.
>
> Yesterday I got one of these hangs again, and decided to bite the
> bullet & look into what's going on. There are two things that could be
> locking up: either pfinet or netdde; quick inspection revealed that
> pfinet is the guilty party. There is apparently a single global_lock
> in pfinet; it makes sense that if something manages to hold it
> indefinitely then everything else will hang. That something turned out
> to be the timer thread (see pfinet/timer-emul.c:timer_function). It
> grabs the lock after sleeping and proceeds to run the handlers, and
> apparently never exits from that loop; so nothing else can make any
> progress.
>
> But why doesn't it exit the loop? Surely there's only a finite,
> hopefully small, number of timers overall? Well, yes, but they
> re-schedule themselves from inside the handlers. As an aside, whoever
> wrote add_timer () in the same file obviously didn't care about
> releasing the mach_thread_self () right; currently a reference is
> leaked on each invocation (yay). It should either deallocate the right
> or use hurd_thread_self ().
>
> The offending function that reschedules itself over and over is
> pfinet/linux-src/net/ipv4/tcp_timer.c:tcp_sltimer_handler. I found out
> (which wasn't super easy when everything's <optimized out>, in
> addition to me being unable to use my terminal emulator of choice and
> having to squint at the Mach console which is tiny because it's not
> getting appropriately scaled for my HiDPI screen [0]) that the 'now +
> next' addition overflows and results in a small timestamp (number of
> jiffies), which is then getting taken by timer-emul to be a timestamp
> in the past, so it immediately runs the handler again, thus the
> busy-loop.
>
> [0]: https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/635
>
> Why does it overflow? Because 'now' is huge. 'now' is set from
> 'jiffies' at the beginning of tcp_sltimer_handler. 'jiffies' is
> #defined to fetch_jiffies () from pfinet/mapped-time.h; which uses
> maptime_read () to read the current time from the Mach time device,
> and then subtracts 'root_jiffies' from it. 'root_jiffies' is filled
> from the Mach time device on pfinet startup. So clearly 'j -
> root_jiffies' itself overflows, resulting in a huge 'jiffies' value; I
> added an 'assert (j >= root_jiffies)' and sure enough, it fails at
> about the same rate that the network would lock up.
>
> In conclusion, what happens is pfinet reads a timestamp from the Mach
> time device, and then reads another timestamp, and the second one
> happens to be smaller than the first one, which confuses it into
> busy-looping and deadlocking.
>
> I then grepped gnumach to see how the time device is set up and
> realized that the time device is supposed to be CLOCK_REALTIME, not
> something like CLOCK_BOOTTIME or CLOCK_MONOTONIC. So it only makes
> sense that if *something* adjusts the clock with host_set_time (), the
> time as seen through the time device will experience a discontinuous
> jump, possibly into the past. And that *something* appears to be
> /usr/sbin/ntpdate.
>

This never happened to me. Probably because I don't have NTP installed
in my Hurd box?


>
> Now, how do we fix this?
>
> Clearly pfinet needs to implement 'jiffies' differently. A stopgap
> solution would be to detect the case of j < root_jiffies and just set
> root_jiffies = j; this will probably break all sort of timeouts, but
> at least will solve the busy-loop / deadlock.
>
> But also, Mach needs to expose a CLOCK_BOOTTIME-like clock in addition
> to the CLOCK_REALTIME one. There is already internal tracking for
> this, see kern/mach_clock.c:clock_boottime_offset; it only needs to be
> exposed to the userland.
>
> Since we have already broken ABI compatibility with the time device on
> other Mach versions (if it ever existed) by adding the *64 variants to
> the struct mapped_time_value, maybe we could just place the
> clock_boottime_offset into the same struct? Then the userspace would
> be able to pick it up and use it to calculate the boottime-relative
> time.
>

User land can still use the old time structures but can now be updated to
read the
*64 variants.


> We could even attempt to use this to implement some support for
> CLOCK_MONOTONIC in glibc, which I imagine would be a welcome addition.
> Maybe backcompat is not that much of a concern either: the kernel
> allocates a whole page for the mapped_time_value and zero-fills it; if
> we see that clock_boottime_offset is zero that clearly means that
> either the system was booted up on 1 January, 1970, or the Mach
> version in use doesn't support clock_boottime_offset.
>
> What do you think?
>

I think the approach makes sense to me. We can update the
clock_boottime_offset
in the structure whenever it is updated by the kernel and then provide a
maptime_read(clockid_t clock, struct mapped_time_value *mtime, struct
timeval *tv)
routine to read from the mapped memory and return either a time using
CLOCK_REALTIME
or CLOCK_MONOTONIC semantics.

We could also have an RPC host_get_time that is parameterized by clockid_t
to have some
symmetry with the routine above.


> Sergey
>

Flavio

Reply via email to