Hi,

On 12/21/18 6:54 PM, Hans van Kranenburg wrote:
> Hi,
> 
> We've been tracking down a live migration bug during the last three days
> here at work, and here's what we found so far.
> 
> 1. Xen version and dom0 linux kernel version don't matter.
> 2. DomU kernel is >= Linux 4.13.
> 
> When using live migrate to another dom0, this often happens:
> 
> [   37.511305] Freezing user space processes ... (elapsed 0.001 seconds)
> done.
> [   37.513316] OOM killer disabled.
> [   37.513323] Freezing remaining freezable tasks ... (elapsed 0.001
> seconds) done.
> [   37.514837] suspending xenstore...
> [   37.515142] xen:grant_table: Grant tables using version 1 layout
> [18446744002.593711] OOM killer enabled.
> [18446744002.593726] Restarting tasks ... done.
> [18446744002.604527] Setting capacity to 6291456

Tonight, I've been through 29 bisect steps to figure out a bit more. A
make defconfig with enabling Xen PV for domU reproduces the problem
already, so a complete cycle with compiling and testing had only to take
about 7 minutes.

So, it appears that this 18 gazillion seconds of uptime is a thing that
started happening earlier than the TCP situation already. All of the
test scenarios resulted in these huge uptime numbers in dmesg. Not all
of them result in TCP connections hanging.

> As a side effect, all open TCP connections stall, because the timestamp
> counters of packets sent to the outside world are affected:
> 
> https://syrinx.knorrie.org/~knorrie/tmp/tcp-stall.png

This is happening since:

commit 9a568de4818dea9a05af141046bd3e589245ab83
Author: Eric Dumazet <eduma...@google.com>
Date:   Tue May 16 14:00:14 2017 -0700

    tcp: switch TCP TS option (RFC 7323) to 1ms clock

In order to find out, the first 13 bisect steps were to figure out that
live migration was totally broken between...

commit bf22ff45bed664aefb5c4e43029057a199b7070c
Author: Jeffy Chen <jeffy.c...@rock-chips.com>
Date:   Mon Jun 26 19:33:34 2017 +0800

    genirq: Avoid unnecessary low level irq function calls

...and...

commit bb68cfe2f5a7f43058aed299fdbb73eb281734ed
Author: Thomas Gleixner <t...@linutronix.de>
Date:   Mon Jul 31 22:07:09 2017 +0200

    x86/hpet: Cure interface abuse in the resume path

In between are 12k+ commits. So, I restarted bisect and used either
revert of the first commit or cherry-pick of the fix to get a working
test case every single time.

http://paste.debian.net/plainh/be91aabd

> [...]
> 
> 3. Since this is related to time and clocks, the last thing today we
> tried was, instead of using default settings, put "clocksource=tsc
> tsc=stable:socket" on the xen command line and "clocksource=tsc" on the
> domU linux kernel line. What we observed after doing this, is that the
> failure happens less often, but still happens. Everything else applies.

Actually, it seems that the important thing is that uptime of the dom0s
is not very close to each other. After rebooting all four back without
tsc options, and then a few hours later rebooting one of them again, I
could easily reproduce again when live migrating to the later rebooted
server.

> Additional question:
> 
> It's 2018, should we have these "clocksource=tsc tsc=stable:socket" on
> Xen and "clocksource=tsc" anyways now, for Xen 4.11 and Linux 4.19
> domUs? All our hardware has 'TscInvariant = true'.
> 
> Related: https://news.ycombinator.com/item?id=13813079

This is still interesting.

---- >8 ----

Now, the next question is... is 9a568de481 bad, or shouldn't there be 18
gazillion whatever uptime already... In Linux 4.9, this doesn't happen,
so next task will be to find out where that started.

to be continued...

Hans

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to