On Fri, 2017-04-14 at 11:18 +0200, Vincent Legout wrote: [...] > Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5 > minutes even without doing any 'xl vcpu-set'?
The MCE polling timer for each CPU runs every 5 minutes, so this is presumably the first time it runs. Perhaps this domain is configured such that CPUs are hot-removed shortly after boot? In the first crash, it looks like the timer for CPU x!=0 is being called on CPU 0. In general this can happen if CPU x is hot-removed; its timers are migrated to another CPU. This should *not* be possible with the MCE timer, as there is a hotplug callback that removes the timer when a CPU is removed. There is a check for the timer having been migrated anyway, which triggers the WARNING. The timer function then tries to re-add the timer for the current CPU, but that's still pending, which triggers the BUG. Either the hotplug callback was not called, or the timer was migrated before being removed resulting in a race condition. > With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of > the time (between 1 and 16 vcpus), but after several tries, I got the > attached trace. I'm not sure what's going on in this crash, but as it's a null dereference in migrate_timer_list it seems somewhat related. I didn't find any changes that would explain how this was fixed between 4.0 and 4.2. I suggest you work around it by adding 'nomce' to the kernel command line as I would expect Xen or dom0 to handle MCEs. Ben. -- Ben Hutchings Man invented language to satisfy his deep need to complain. - Lily Tomlin
signature.asc
Description: This is a digitally signed message part