So, On 12/24/18 1:32 AM, Hans van Kranenburg wrote: > > On 12/21/18 6:54 PM, Hans van Kranenburg wrote: >> >> We've been tracking down a live migration bug during the last three days >> here at work, and here's what we found so far. >> >> 1. Xen version and dom0 linux kernel version don't matter. >> 2. DomU kernel is >= Linux 4.13. >> >> When using live migrate to another dom0, this often happens: >> >> [ 37.511305] Freezing user space processes ... (elapsed 0.001 seconds) >> done. >> [ 37.513316] OOM killer disabled. >> [ 37.513323] Freezing remaining freezable tasks ... (elapsed 0.001 >> seconds) done. >> [ 37.514837] suspending xenstore... >> [ 37.515142] xen:grant_table: Grant tables using version 1 layout >> [18446744002.593711] OOM killer enabled. >> [18446744002.593726] Restarting tasks ... done. >> [18446744002.604527] Setting capacity to 6291456 > > Tonight, I've been through 29 bisect steps to figure out a bit more. A > make defconfig with enabling Xen PV for domU reproduces the problem > already, so a complete cycle with compiling and testing had only to take > about 7 minutes. > > So, it appears that this 18 gazillion seconds of uptime is a thing that > started happening earlier than the TCP situation already. All of the > test scenarios resulted in these huge uptime numbers in dmesg. Not all > of them result in TCP connections hanging. > >> As a side effect, all open TCP connections stall, because the timestamp >> counters of packets sent to the outside world are affected: >> >> https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png > > This is happening since: > > commit 9a568de4818dea9a05af141046bd3e589245ab83 > Author: Eric Dumazet <eduma...@google.com> > Date: Tue May 16 14:00:14 2017 -0700 > > tcp: switch TCP TS option (RFC 7323) to 1ms clock > > [...] > >> [...] >> >> 3. Since this is related to time and clocks, the last thing today we >> tried was, instead of using default settings, put "clocksource=tsc >> tsc=stable:socket" on the xen command line and "clocksource=tsc" on the >> domU linux kernel line. What we observed after doing this, is that the >> failure happens less often, but still happens. Everything else applies. > > Actually, it seems that the important thing is that uptime of the dom0s > is not very close to each other. After rebooting all four back without > tsc options, and then a few hours later rebooting one of them again, I > could easily reproduce again when live migrating to the later rebooted > server. > >> Additional question: >> >> It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on >> Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19 >> domUs? All our hardware has 'TscInvariant = true'. >> >> Related: https://news.ycombinator.com/item?id=13813079 > > This is still interesting. > > ---- >8 ---- > > Now, the next question is... is 9a568de481 bad, or shouldn't there be 18 > gazillion whatever uptime already... In Linux 4.9, this doesn't happen, > so next task will be to find out where that started.
And that's... commit f94c8d116997597fc00f0812b0ab9256e7b0c58f Author: Peter Zijlstra <pet...@infradead.org> Date: Wed Mar 1 15:53:38 2017 +0100 sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface a.k.a. v4.11-rc2~30^2 Before this commit, time listed in dmesg seems to follow uptime of the domU, and after it, time in dmesg seems to jump around up and down when live migrating to different dom0s, with the occasional/frequent jump to a number above 18000000000 which then also shows the TCP timestamp breakage since 9a568de4. So, next question is... what now? Any ideas appreciated. Can anyone else reproduce this? I have super-common HP DL360 hardware and mostly default settings, so it shouldn't be that hard. Should I mail some other mailinglist with a question? Which one? Does any of you Xen developers have more experience with time keeping code? Regards, Hans _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel