Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8

Ben Hutchings Wed, 19 Apr 2017 12:42:31 -0700

On Fri, 2017-04-14 at 11:18 +0200, Vincent Legout wrote:
[...]
> Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5
> minutes even without doing any 'xl vcpu-set'?


The MCE polling timer for each CPU runs every 5 minutes, so this is
presumably the first time it runs.  Perhaps this domain is configured
such that CPUs are hot-removed shortly after boot?

In the first crash, it looks like the timer for CPU x!=0 is being
called on CPU 0.  In general this can happen if CPU x is hot-removed;
its timers are migrated to another CPU.  This should *not* be possible
with the MCE timer, as there is a hotplug callback that removes the
timer when a CPU is removed.  There is a check for the timer having
been migrated anyway, which triggers the WARNING.  The timer function
then tries to re-add the timer for the current CPU, but that's still
pending, which triggers the BUG.  Either the hotplug callback was not
called, or the timer was migrated before being removed resulting in a
race condition.

> With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of
> the time (between 1 and 16 vcpus), but after several tries, I got the
> attached trace.

I'm not sure what's going on in this crash, but as it's a null
dereference in migrate_timer_list it seems somewhat related.

I didn't find any changes that would explain how this was fixed between
4.0 and 4.2.  I suggest you work around it by adding 'nomce' to the
kernel command line as I would expect Xen or dom0 to handle MCEs.

Ben.

-- 
Ben Hutchings
Man invented language to satisfy his deep need to complain. - Lily
Tomlin

signature.asc
Description: This is a digitally signed message part

Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8

Reply via email to