On 7/21/23 10:55, Michel Dänzer wrote: > On 7/20/23 22:48, Philip Yang wrote: >> On 2023-07-20 06:46, Michel Dänzer wrote: >>> On 7/17/23 15:09, Michel Dänzer wrote: >>>> On 5/10/23 23:23, Alex Deucher wrote: >>>>> From: Philip Yang <philip.y...@amd.com> >>>>> >>>>> Rename smv_migrate_init to a better name kgd2kfd_init_zone_device >>>>> because it setup zone devive pgmap for page migration and keep it in >>>>> kfd_migrate.c to access static functions svm_migrate_pgmap_ops. Call it >>>>> only once in amdgpu_device_ip_init after adev ip blocks are initialized, >>>>> but before amdgpu_amdkfd_device_init initialize kfd nodes which enable >>>>> SVM support based on pgmap. >>>>> >>>>> svm_range_set_max_pages is called by kgd2kfd_device_init everytime after >>>>> switching compute partition mode. >>>>> >>>>> Signed-off-by: Philip Yang <philip.y...@amd.com> >>>>> Reviewed-by: Felix Kuehling <felix.kuehl...@amd.com> >>>>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> >>>> I bisected a regression to this commit, which broke HW acceleration on >>>> this ThinkPad E595 with Picasso APU. >>> Actually, it doesn't seem to break HW acceleration completely. GDM >>> eventually comes up with HW acceleration, it takes a long time (~30s or so) >>> to start up though. >>> >>> Later, the same messages as described in >>> https://gitlab.freedesktop.org/drm/amd/-/issues/2659 appear. >>> >>> Reverting this commit fixes all of the above symptoms. >>> >>> >>> I reproduced all of the above symptoms with amd-staging-drm-next commit >>> 75515acf4b60 ("i2c: nvidia-gpu: Add ACPI property to align with >>> device-tree") as well. >>> >>> >>> For full disclosure, I use these kernel command line arguments: >>> >>> fbcon=font:10x18 drm_kms_helper.drm_fbdev_overalloc=112 amdgpu.noretry=1 >>> amdgpu.mcbp=1 >> >> Thanks for the issue report and full disclosure, but I am not able to >> reproduce this issue, with both drm-next branch and amd-staging-drm-next >> branch tip on gitlab. The test system has same device id, running Ubuntu >> 22.04, latest linux-firmware-20230625.tar.gz, and same BIOS version. > > FWIW, your system has PCI revision ID 0xC2, while mine has 0xC1. > > Also, I'm currently using linux-firmware 20230515. AFAICT there are no > relevant changes in 20230625, but I'm attaching the contents of > /sys/kernel/debug/dri/0/amdgpu_firmware_info just in case. > > >> I attached full dmesg log, could you help check if there is other >> difference, maybe kernel config, gcc version... it is hard to guess what >> could cause the basic driver gfx ring IB test timeout. > > I suspect the IOMMU page faults logged in my dmesg might be relevant: > > amdgpu: Topology: Add APU node [0x15d8:0x1002] > amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 > address=0x122201800 flags=0x0070] > amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 > address=0x1125fe380 flags=0x0070] > kfd kfd: amdgpu: added device 1002:15d8
Maybe I should mention my IOMMU related kernel build configuration: CONFIG_IRQ_MSI_IOMMU=y CONFIG_GART_IOMMU=y CONFIG_VFIO_IOMMU_TYPE1=m # CONFIG_VFIO_NOIOMMU is not set CONFIG_IOMMU_IOVA=y CONFIG_IOMMU_API=y CONFIG_IOMMU_SUPPORT=y CONFIG_IOMMU_IO_PGTABLE=y # CONFIG_IOMMU_DEFAULT_DMA_STRICT is not set CONFIG_IOMMU_DEFAULT_DMA_LAZY=y # CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set CONFIG_IOMMU_DMA=y CONFIG_IOMMU_SVA=y CONFIG_AMD_IOMMU=y CONFIG_AMD_IOMMU_V2=y # CONFIG_IOMMUFD is not set CONFIG_IOMMU_HELPER=y # CONFIG_IOMMU_DEBUG is not set > There are no such page faults with the commit reverted. > > Other than that and the IB test failure messages, our dmesg outputs are > mostly identical indeed. > > -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer