Hi, I would like to draw upon the list participants' know-how and experience in trying to resolve the following issue. I have tried in vain to get NVidia's support in the past, I have given up for quite a long time in the hope it will get fixed as a matter of course but coming back to it half a year later (and multiple kernel and driver versions later) I see it still persists. (The original post was https:/ /devtalk.nvidia.com/default/topic/996091/peer-to-peer-dma-issue-/ here and I am copying below)
The bug that makes the use of multiple GTX1080's impossible when I turn on the IOMMU in Linux (tried kernels 4.8 and 4.13, using either standard iommu=on or iommu=on,igfx_off or iommu=pt for passthrough mode) on a X99 board. The bug can be triggered by running any peer-to-peer memory transfer, for example running the CUDA 8.0 Samples code 1_Utilities/p2pBandwidthLatencyTest from the terminal triggers the problem: the video driver (and as a result the X server) crashes immediately, and after multiple Ctrl-C's and waiting for tens of seconds the server eventually restarts and I am presented with a login prompt to X Windows. The relevant kernel error messages are (thousands of these lines, just a snippet below:) [ 51.691440] DMAR: DRHD: handling fault status reg 2 [ 51.691450] DMAR: [DMA Write] Request device [04:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set [ 51.691457] DMAR: [DMA Write] Request device [04:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set [ 51.691462] DMAR: [DMA Write] Request device [04:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set [ 51.691465] DMAR: [DMA Write] Request device [04:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set [ 51.691470] DMAR: DRHD: handling fault status reg 400 [ 51.740674] DMAR: DRHD: handling fault status reg 402 [ 51.740683] DMAR: [DMA Write] Request device [04:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set [ 51.740688] DMAR: [DMA Write] Request device [04:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set [ 51.740693] DMAR: [DMA Write] Request device [04:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set Cleary the above suggest that the CUDA driver is attempting DMA at an address for which the corresponding iommu page table entry write flag is not set, presumably because the driver has not properly registered/requested access via the general dma_map() kernel interface (https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt) Scouting the net reveals a bug registered (https://bugzilla.kernel.org/ show_bug.cgi?id=188271) for exactly the same reason on totally different hardware (Supermicro Dual socket board) using Pascal Titan- X's, so same architecture cards as mine. Interestingly enough, the kernel error messages in this report claim unauthorized access of *exactly* the same memory address! (f8139000, in bold below) : [16193.666976] DMAR: [DMA Write] Request device [82:00.0] fault addr f8 139000 [fault reason 05] PTE Write access is not set (edited) So this looks like a red flag that somehow the indirection afforded by the iommu is bypassed and the driver is using hardcoded DMA addresses. Please note that the author of the bug report claims that seting iommu=igfx_off somehow solves this, but really igfx_off per se should be irrelevant here without turning the iommu support on first, with something like iommu=on,igfx_off. What instead happens is that most likely iommu=igfx_off as opposed to iommu=on just turns off iommu altogether, allowing the dma to succeed. This is exactly what happens on my system too. So in other words the bug report merely states that turning off the iommu allows peer-to-peer tranfers to work. Still his detailed log files should be very useful for an independent manifestation of the same issue. My log files are attached on the original thread included at the start of this post. I am using an ASRock X99 board (x99e-itx/ac) with latest firmware, intel i6800k, dual Asus GTX-1080s Founder's Edition, 32GB ram and Ubuntu 16.10 (or 17.10 now) with all updates applied (kernel 4.8.0-37 or 4.13 now) with driver 378.13 or 384.69. Have you come across this while trying to virtualize nvidia GPUs? Given the Linux driver forum at nvidia refuses to display bug posts by users (they remain "hidden") and given nvidia would much have you buy quadro's and tesla's instead the conspiracy theorist in me is more inclined to believe that vt-d is intentionally disabled in consumer versions of the hardware... Thanks for any input/solutions! _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users