On Fri, Dec 15, 2017 at 06:21:08PM +1000, Jonathan Matthew wrote:
> On Tue, Dec 12, 2017 at 09:11:40PM +1000, Jonathan Matthew wrote:
> > On Mon, Dec 11, 2017 at 09:34:00AM +0100, Landry Breuil wrote:
> > > On Mon, Dec 11, 2017 at 06:21:01PM +1000, Jonathan Matthew wrote:
> > > > On 10/12/17 03:26, Landry Breuil wrote:
> > > > > On Sat, Dec 09, 2017 at 04:33:28PM +0100, Juan Francisco Cantero
> > > > > Hurtado wrote:
> > > > > > On Thu, Dec 07, 2017 at 02:27:29PM +0100, Landry Breuil wrote:
> > > > > > > On Thu, Dec 07, 2017 at 11:52:46AM +0100, Martin Pieuchot wrote:
> > > > > > > > On 07/12/17(Thu) 08:34, Landry Breuil wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > i've been having kvm VMs running 6.2 hardlocking/deadlocking
> > > > > > > > > since a
> > > > > > > > > while, all those running on proxmox 5.1 using linux 4.13.8 &
> > > > > > > > > qemu-kvm
> > > > > > > > > 2.9.1. There were hardlocks upon reboot which were 'solved'
> > > > > > > > > by disabling
> > > > > > > > > x2apic emulation in kvm (args: -cpu=kvm64,-x2apic) or giving
> > > > > > > > > the host
> > > > > > > > > cpu flags to the vm (args: -cpu host) but there still remains
> > > > > > > > > deadlocks
> > > > > > > > > during normal operation.
> > > > > > > > >
> > > > > > > > > I'm now running a kernel with MP_LOCKDEBUG, so i'm collecting
> > > > > > > > > traces in
> > > > > > > > > the vain hope that it might help someone interested in
> > > > > > > > > locking issues.
> > > > > > > > > Here's the latest one:
> > > > > > > >
> > > > > > > > Let me add that when you had x2apic enabled the kernel 'froze'
> > > > > > > > inside
> > > > > > > > x2apic_readreg, trace below:
> > > > > > > >
> > > > > > > > ddb{0}> tr
> > > > > > > > x2apic_readreg(10) at x2apic_readreg+0xf
> > > > > > > > lapic_delay(ffff800022136900) at lapic_delay+0x5c
> > > > > > > > rtcput(ffff800022136960) at rtcput+0x65
> > > > > > > > resettodr() at resettodr+0x1d6
> > > > > > > > perform_resettodr(ffffffff81769b29) at perform_resettodr+0x9
> > > > > > > > taskq_thread(0) at taskq_thread+0x67
> > > > > > > > end trace frame: 0x0, count: -6
> > > > > > > >
> > > > > > > > What you're seeing with a MP_LOCKDEBUG kernel is just a
> > > > > > > > symptom. A CPU
> > > > > > > > enters DDB because another one is 'frozen' while holding the
> > > > > > > > KERNEL_LOCK(). What's interesting is that in both case the
> > > > > > > > frozen CPU
> > > > > > > > is trying to execute apic related code:
> > > > > > > > - x2apic_readreg
> > > > > > > > - lapic_delay
> > > > > > > >
> > > > > > > > I believe this issue should be reported to KVM developers as
> > > > > > > > well.
> > > > > > >
> > > > > > > *very* interestingly, i had a new lock, running bsd.sp.. So i
> > > > > > > think that
> > > > > > > rules out openbsd mp.
> > > > > > >
> > > > > > > ddb> tr
> > > > > > > i82489_readreg(0) at i82489_readreg+0xd
> > > > > > > lapic_delay(ffffffff81a84090) at lapic_delay+0x5c
> > > > > > > rtcget(ffffffff81a84090) at rtcget+0x1a
> > > > > > > resettodr() at resettodr+0x3a
> > > > > > > perform_resettodr(ffffffff81659e99) at perform_resettodr+0x9
> > > > > > > taskq_thread(0) at taskq_thread+0x57
> > > > > > > end trace frame: 0x0, count: -6
> > > > > >
> > > > > > Try running with "-machine q35". It changes the emulated machine to
> > > > > > a modern platform.
> > > > >
> > > > > Right, i suppose that matches https://wiki.qemu.org/Features/Q35,
> > > > > interesting. Will definitely try, mailed the kvm mailing list but got
> > > > > no
> > > > > feedback so far.
> > > >
> > > > I've been seeing this for a while too, on VMs that are already run with
> > > > -machine q35 and -cpu host. I was blaming my (still unfinished) pvclock
> > > > code, but now I can fairly easily trigger it on single cpu VMs without
> > > > that,
> > > > mostly by running kernel compiles in a loop in a couple of different
> > > > guests.
> > > > I'm using a Fedora 25 (4.10.15-200.fc25.x86_64) kernel.
> > > >
> > > > Adding some debug output to lapic_delay, it appears the KVM virtualized
> > > > lapic counter hits zero and doesn't reset, so the lapic_delay loop in
> > > > the
> > > > guest never terminates. KVM has several different ways it can provide
> > > > the
> > > > lapic counter and I'm not sure which one I'm using yet.
> > > >
> > > > I just tried making lapic_delay give up after a million zero reads, and
> > > > it
> > > > seems to recover after a minute or so. I'll leave it running to see if
> > > > it
> > > > happens again.
> > >
> > > Hah, interesting. So i know i can try q35 just for the sake of emulating
> > > a newer hw platform, but that wont fix this issue.
> > >
> > > I see upstream this thread/patchset: https://lkml.org/lkml/2017/9/28/773
> > > But i'm not sure that's the same issue.
> > > https://patchwork.kernel.org/patch/9036161/ might be too ?
> > >
> > > When you say you modified lapic_delay, i guess that's on the guest side ?
> >
> > yeah, I haven't changed anything on the host side yet.
> >
> > I have a sort of working theory now. It looks like polling the lapic
> > counter
> > in a tight loop starves whatever bit of kvm is responsible for resetting the
> > counter. If the counter hits 0 and then we enter lapic_delay before kvm
> > gets
> > a chance to reset it, it'll never get reset and lapic_delay will never
> > finish.
> >
> > The gross hack below makes lapic_delay issue a hlt instruction when it looks
> > like the lapic counter is stuck, which seems to be enough for kvm to reset
> > it.
> > To make it easier to hit, I also reduced RESETTODR_PERIOD (in
> > sys/kern/kern_time.c)
> > to 30, since it always seems to be the RTC read/write operations that get
> > stuck.
> > Maybe there aren't any other delay()s happening.
> >
> > Next I guess I'm going to dig through the kvm code to see how the lapic
> > counter
> > reset works.
>
> So, for me at least, adding 'kvm-intel.preemption_timer=0' to the linux kernel
> commandline (on the host) fixes this. This option disables kvm's use of the
> vmx preemption timer to schedule lapic counter events. I think there's some
> problem in how kvm calculates the tsc deadline, or the relationship between
> that and the time value it uses for lapic counter reads, but I'm not sure what
> exactly.
>
I'm also running with this option now, will bang the vms with some heavy
workload and keep an eye on it. Thanks for the time you spent on this ! :)