https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283189

--- Comment #1 from Jason A. Harmening <j...@freebsd.org> ---
Reverting from nda(4) to nvd(4) didn't resolve this issue, if anything it
appeared to make it slightly worse.  I did enable NVMe verbose command logging,
which yielded error logs like the following:

nvme0: WRITE sqid:14 cid:120 nsid:1 lba:475732976 len:96
DMAR4: nvme0: pci7:0:0 sid 700 fault acc 1 adt 0x0 reason 0x6 addr a5243000
nvme0: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0xa5243000 prp2:0xcf800
nvme0: cdw10: 0x1c5b1bf0 cdw11:0 cdw12:0x5f cdw13:0 cdw14:0 cdw15:0
nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:1 dnr:1 p:0 sqid:14 cid:120 cdw0:0

The errors continue to always be for NVMe writes (i.e. a DMA read access by the
controller).  I've also still never seen these faults for any device besides
nvme, and all still show the same DMAR fault code and similar small transfer
sizes.

Interestingly, all of the errors I've seen so far (about 15 of them since
enabling verbose logging) show the DMAR fault being taken against the buffer in
PRP1, even in cases in which PRP2 is populated.  So it seems the NVMe access
that triggers the fault is always at the beginning of the region mapped by the
NVMe command.

This "smells" like the sort of issue I'm used to seeing at $work on
weakly-ordered arm64 devices when there is a missing barrier between a page
table modification and a memory access that has an implicit dependency on the
page table modification.  In this case the page table modification would be the
DMAR PTE write that maps the PRP1 buffer, while the memory access would be the
NVMe controller read triggered by appending the write command to the submission
queue.  But I would be surprised if that kind of issue is at play here given
the stronger ordering of the x86 memory model.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to