amdkfd: Refactor migrate init to support partition switch

Michel Dänzer Fri, 21 Jul 2023 03:09:26 -0700

On 7/21/23 10:55, Michel Dänzer wrote:
> On 7/20/23 22:48, Philip Yang wrote:
>> On 2023-07-20 06:46, Michel Dänzer wrote:
>>> On 7/17/23 15:09, Michel Dänzer wrote:
>>>> On 5/10/23 23:23, Alex Deucher wrote:
>>>>> From: Philip Yang <philip.y...@amd.com>
>>>>>
>>>>> Rename smv_migrate_init to a better name kgd2kfd_init_zone_device
>>>>> because it setup zone devive pgmap for page migration and keep it in
>>>>> kfd_migrate.c to access static functions svm_migrate_pgmap_ops. Call it
>>>>> only once in amdgpu_device_ip_init after adev ip blocks are initialized,
>>>>> but before amdgpu_amdkfd_device_init initialize kfd nodes which enable
>>>>> SVM support based on pgmap.
>>>>>
>>>>> svm_range_set_max_pages is called by kgd2kfd_device_init everytime after
>>>>> switching compute partition mode.
>>>>>
>>>>> Signed-off-by: Philip Yang <philip.y...@amd.com>
>>>>> Reviewed-by: Felix Kuehling <felix.kuehl...@amd.com>
>>>>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>
>>>> I bisected a regression to this commit, which broke HW acceleration on 
>>>> this ThinkPad E595 with Picasso APU.
>>> Actually, it doesn't seem to break HW acceleration completely. GDM 
>>> eventually comes up with HW acceleration, it takes a long time (~30s or so) 
>>> to start up though.
>>>
>>> Later, the same messages as described in 
>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2659 appear.
>>>
>>> Reverting this commit fixes all of the above symptoms.
>>>
>>>
>>> I reproduced all of the above symptoms with amd-staging-drm-next commit 
>>> 75515acf4b60 ("i2c: nvidia-gpu: Add ACPI property to align with 
>>> device-tree") as well.
>>>
>>>
>>> For full disclosure, I use these kernel command line arguments:
>>>
>>>  fbcon=font:10x18 drm_kms_helper.drm_fbdev_overalloc=112 amdgpu.noretry=1 
>>> amdgpu.mcbp=1
>>
>> Thanks for the issue report and full disclosure, but I am not able to 
>> reproduce this issue, with both drm-next branch and amd-staging-drm-next 
>> branch tip on gitlab. The test system has same device id, running Ubuntu 
>> 22.04, latest linux-firmware-20230625.tar.gz, and same BIOS version.
> 
> FWIW, your system has PCI revision ID 0xC2, while mine has 0xC1.
> 
> Also, I'm currently using linux-firmware 20230515. AFAICT there are no 
> relevant changes in 20230625, but I'm attaching the contents of 
> /sys/kernel/debug/dri/0/amdgpu_firmware_info just in case.
> 
> 
>> I attached full dmesg log, could you help check if there is other 
>> difference, maybe kernel config, gcc version... it is hard to guess what 
>> could cause the basic driver gfx ring IB test timeout.
> 
> I suspect the IOMMU page faults logged in my dmesg might be relevant:
> 
>  amdgpu: Topology: Add APU node [0x15d8:0x1002]
>  amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 
> address=0x122201800 flags=0x0070]
>  amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 
> address=0x1125fe380 flags=0x0070]
>  kfd kfd: amdgpu: added device 1002:15d8


Maybe I should mention my IOMMU related kernel build configuration:

CONFIG_IRQ_MSI_IOMMU=y
CONFIG_GART_IOMMU=y
CONFIG_VFIO_IOMMU_TYPE1=m
# CONFIG_VFIO_NOIOMMU is not set
CONFIG_IOMMU_IOVA=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
CONFIG_IOMMU_IO_PGTABLE=y
# CONFIG_IOMMU_DEFAULT_DMA_STRICT is not set
CONFIG_IOMMU_DEFAULT_DMA_LAZY=y
# CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set
CONFIG_IOMMU_DMA=y
CONFIG_IOMMU_SVA=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_V2=y
# CONFIG_IOMMUFD is not set
CONFIG_IOMMU_HELPER=y
# CONFIG_IOMMU_DEBUG is not set


> There are no such page faults with the commit reverted.
> 
> Other than that and the IB test failure messages, our dmesg outputs are 
> mostly identical indeed.
> 
> 

-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer

Re: [PATCH 28/29] drm/amdkfd: Refactor migrate init to support partition switch

Reply via email to