On Mon, Feb 05, 2018 at 01:36:00PM +0000, Mark Rutland wrote: > On Fri, Feb 02, 2018 at 10:07:26PM +0000, Mark Rutland wrote: > > On Fri, Feb 02, 2018 at 08:55:06PM +0100, Peter Zijlstra wrote: > > > On Fri, Feb 02, 2018 at 07:27:04PM +0000, Mark Rutland wrote: > > > > ... in some cases, owner_cpu is -1, so I guess we're racing with an > > > > unlock. I only ever see this on the runqueue locks in wake up functions. > > > > > > So runqueue locks are special in that the owner changes over a contex > > > switch, maybe something goes funny there? > > > > Aha! I think that's it! > > > > In finish_lock_switch() we do: > > > > smp_store_release(&prev->on_cpu, 0); > > ... > > rq->lock.owner = current; > > > > As soon as we update prev->on_cpu, prev can be scheduled on another CPU, and > > can thus see a stale value for rq->lock.owner (e.g. if it tries to wake up > > another task on that rq). > > I hacked in a forced vCPU preemption between the two using a sled of WFE > instructions, and now I can trigger the problem in seconds rather than > hours. > > With the patch below applied, things seem to fine so far. > > So I'm pretty sure this is it. I'll clean up the patch text and resend > that in a bit.
Also try and send it against an up-to-date scheduler tree, we just moved some stuff around just about there.