Hi, On 12/21/18 6:54 PM, Hans van Kranenburg wrote: > Hi, > > We've been tracking down a live migration bug during the last three days > here at work, and here's what we found so far. > > 1. Xen version and dom0 linux kernel version don't matter. > 2. DomU kernel is >= Linux 4.13. > > When using live migrate to another dom0, this often happens: > > [ 37.511305] Freezing user space processes ... (elapsed 0.001 seconds) > done. > [ 37.513316] OOM killer disabled. > [ 37.513323] Freezing remaining freezable tasks ... (elapsed 0.001 > seconds) done. > [ 37.514837] suspending xenstore... > [ 37.515142] xen:grant_table: Grant tables using version 1 layout > [18446744002.593711] OOM killer enabled. > [18446744002.593726] Restarting tasks ... done. > [18446744002.604527] Setting capacity to 6291456
Tonight, I've been through 29 bisect steps to figure out a bit more. A make defconfig with enabling Xen PV for domU reproduces the problem already, so a complete cycle with compiling and testing had only to take about 7 minutes. So, it appears that this 18 gazillion seconds of uptime is a thing that started happening earlier than the TCP situation already. All of the test scenarios resulted in these huge uptime numbers in dmesg. Not all of them result in TCP connections hanging. > As a side effect, all open TCP connections stall, because the timestamp > counters of packets sent to the outside world are affected: > > https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png This is happening since: commit 9a568de4818dea9a05af141046bd3e589245ab83 Author: Eric Dumazet <eduma...@google.com> Date: Tue May 16 14:00:14 2017 -0700 tcp: switch TCP TS option (RFC 7323) to 1ms clock In order to find out, the first 13 bisect steps were to figure out that live migration was totally broken between... commit bf22ff45bed664aefb5c4e43029057a199b7070c Author: Jeffy Chen <jeffy.c...@rock-chips.com> Date: Mon Jun 26 19:33:34 2017 +0800 genirq: Avoid unnecessary low level irq function calls ...and... commit bb68cfe2f5a7f43058aed299fdbb73eb281734ed Author: Thomas Gleixner <t...@linutronix.de> Date: Mon Jul 31 22:07:09 2017 +0200 x86/hpet: Cure interface abuse in the resume path In between are 12k+ commits. So, I restarted bisect and used either revert of the first commit or cherry-pick of the fix to get a working test case every single time. http://paste.debian.net/plainh/be91aabd > [...] > > 3. Since this is related to time and clocks, the last thing today we > tried was, instead of using default settings, put "clocksource=tsc > tsc=stable:socket" on the xen command line and "clocksource=tsc" on the > domU linux kernel line. What we observed after doing this, is that the > failure happens less often, but still happens. Everything else applies. Actually, it seems that the important thing is that uptime of the dom0s is not very close to each other. After rebooting all four back without tsc options, and then a few hours later rebooting one of them again, I could easily reproduce again when live migrating to the later rebooted server. > Additional question: > > It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on > Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19 > domUs? All our hardware has 'TscInvariant = true'. > > Related: https://news.ycombinator.com/item?id=13813079 This is still interesting. ---- >8 ---- Now, the next question is... is 9a568de481 bad, or shouldn't there be 18 gazillion whatever uptime already... In Linux 4.9, this doesn't happen, so next task will be to find out where that started. to be continued... Hans _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel