On Wed, Apr 19, 2017 at 08:39:05PM +0100, Ben Hutchings wrote : > On Fri, 2017-04-14 at 11:18 +0200, Vincent Legout wrote: > [...] > > Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5 > > minutes even without doing any 'xl vcpu-set'? > > The MCE polling timer for each CPU runs every 5 minutes, so this is > presumably the first time it runs. Perhaps this domain is configured > such that CPUs are hot-removed shortly after boot?
I didn't explicitly set anything like that, but I guess it could also be a default configuration in Xen. > In the first crash, it looks like the timer for CPU x!=0 is being > called on CPU 0. In general this can happen if CPU x is hot-removed; > its timers are migrated to another CPU. This should *not* be possible > with the MCE timer, as there is a hotplug callback that removes the > timer when a CPU is removed. There is a check for the timer having > been migrated anyway, which triggers the WARNING. The timer function > then tries to re-add the timer for the current CPU, but that's still > pending, which triggers the BUG. Either the hotplug callback was not > called, or the timer was migrated before being removed resulting in a > race condition. > > > With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of > > the time (between 1 and 16 vcpus), but after several tries, I got the > > attached trace. > > I'm not sure what's going on in this crash, but as it's a null > dereference in migrate_timer_list it seems somewhat related. > > I didn't find any changes that would explain how this was fixed between > 4.0 and 4.2. I suggest you work around it by adding 'nomce' to the > kernel command line as I would expect Xen or dom0 to handle MCEs. Thanks a lot Ben, I can't reproduce the issue with 'nomce'. Thanks, Vincent
signature.asc
Description: PGP signature