On 5/16/25 1:02 AM, Ankit Soni wrote:
On Wed, May 14, 2025 at 04:08:09PM -0400, Alejandro Jimenez wrote:


On 5/14/25 11:54 AM, Jason Gunthorpe wrote:
On Wed, May 14, 2025 at 09:23:49AM +0000, Ankit Soni wrote:
I am experiencing a system hang with a 5-level v2 page table mode, on boot.
The NVMe boot drive is not initializing.
Below are the relevant dmesg logs with some prints i had added:

[    6.386439] AMD-Vi v2 domain init
[    6.390132] AMD-Vi v2 pt init
[    6.390133] AMD-Vi aperture end last va ffffffffffffff
...
[   10.315372] AMD-Vi gen pt MAP PAGES iova ffffffffffffe000 paddr 19351b000
...
[   72.171930] nvme nvme0: I/O tag 0 (0000) QID 0 timeout, disable controller
[   72.179618] nvme nvme1: I/O tag 24 (0018) QID 0 timeout, disable controller
[   72.197176] nvme nvme0: Identify Controller failed (-4)
[   72.203063] nvme nvme1: Identify Controller failed (-4)
[   72.209237] nvme 0000:05:00.0: probe with driver nvme failed with error -5
[   72.209336] nvme 0000:44:00.0: probe with driver nvme failed with error -5
...
Timed out waiting for the udev queue to be empty.

According to the dmesg logs above, the IOVA for the v2 page table appears
incorrect and is not aligned with domain->geometry.aperture_end. Which
requires domain->geometry.force_aperture = true; to be added at the
appropriate location. Proabably here!

Thank you for pointing out this issue and its cause. I originally tested on
a host with SCSI storage, and after your report I tried but couldn't
reproduce the hang on a Zen4 host with an nvme boot drive. I wanted to see
if it was a pattern common to NVME, but I suppose it depends on the DMA mask
chosen by the specific driver.

Alejandro


Hi,
Can you try with below command line?
"amd_iommu=pgtbl_v2 iommu.forcedac=1"

Yes, I can reproduce the hang when booting with the combination of: "amd_iommu=pgtbl_v2 iommu.passthrough=0 iommu.forcedac=1"

[ 72.763105] nvme nvme0: I/O tag 8 (0008) QID 0 timeout, disable controller
[   72.772093] nvme nvme0: Device not ready; aborting shutdown, CSTS=0x1
[   72.796372] nvme nvme0: Identify Controller failed (-4)
[ 72.802603] nvme 0000:01:00.0: probe with driver nvme failed with error -5

It also triggers failures for the Mellanox driver:

[ 134.342120] mlx5_core 0000:61:00.0: wait_func:1185:(pid 3235): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource [ 134.355465] mlx5_core 0000:61:00.0: mlx5_function_enable:1215:(pid 3235): enable hca failed [ 134.366570] mlx5_core 0000:61:00.0: probe_one:2003:(pid 3235): mlx5_init_one failed with error code -110 [ 134.386593] mlx5_core 0000:61:00.0: probe with driver mlx5_core failed with error -110

Setting force_aperture = true in pt_iommu_init_domain() solves the issue for the AMD v2 format where dynamic top is not available.

Thank you,
Alejandro

Indeed it depends on DMA Mask chose by nvme driver. if force_aperture is
not true, iommu driver will use dma_mask in place of end_aperture.

-Ankit



Yes! It got lost, thanks alot!

Jason



Reply via email to