Re: Wake up to crash in sd_buf_done

Theo de Raadt Tue, 06 Aug 2024 07:52:35 -0700

Mark Kettenis <mark.kette...@xs4all.nl> wrote:

> > 0(ffffd85fe50a7e0,2,35d87c,4505,ffff80000025c000,fffffd85fe50a7e0) at 0
> > scsi_done(fffffd85fe50a7e0) at scsi_done+0x31
> > nvme_q_complete(ffff800000255000, ffff800002c79a80) at nume_q_complete+0x134
> > nume_intr(ffff800000255000) at nume_intr+0x2b
> > intr_handler(ffff800049e24990, ffff800000254200) at intr_handler+0x91
> > Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
> > acpicpu_idle() at acpicpu_idle+0x131
> > sched_idle(ffffffff82770ff0) at sched_idle+0x298
> > 
> > end trace frame: 0x0, count: 8
> 
> I think this is a bug in nvme(4).  For some reason it gets a
> (spurious?)  interrupt while in the suspended state with stuff torn
> down and dereferences a stale pointer.  We probably need to do a
> better job quiescing the thing when we suspend.


No kidding.

dv, did you get anywhere with your various diffs?  Greg, can you try
out the various diffs he sent?  It's a mishmash of solutions not yet
entirely decided.

The nvme driver doesn't seem to have any soft state variable which
will indicate that it is "down".  Comparing against ahci, it also has
no such variable, but inspection of an ahci port will never show work to
do.

It is curious that nvme_q_complete() will find anything to do inside
a ring. There is no way a scsi transaction should be sitting on a queue.
The bufq layer has ensured there is no transaction.  I think the
ring contains garbage for some reason.   dlg / jmatthew, any thoughts?

Re: Wake up to crash in sd_buf_done

Reply via email to