On 2025/5/7 20:56, Christian König wrote: > On 5/7/25 14:49, Sam wrote: >> On 2025/5/7 20:21, Christian König wrote: >>> On 5/7/25 13:03, Sam wrote: >>>> On 2025/5/7 18:03, Lazar, Lijo wrote: >>>>> On 5/7/2025 11:52 AM, Zhang, GuoQing (Sam) wrote: >>>>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>>>> >>>>>> >>>>>> >>>>>>> Please keep in mind that this is not the only scenario addressed by the >>>>>>> driver - for ex: a resume sequence is executed after a device reset. >>>>>>> This patch itself introduces unwanted sequences for other commonly used >>>>>>> usecases. Please rework on the series without breaking existing >>>>>>> usecases. >>>>>>> Thanks, >>>>>>> Lijo >>>>>> Hi @Lazar, Lijo<mailto:lijo.la...@amd.com>, Thank you for the >>>>>> feedback. >>>>>> >>>>>> I also think the new code should be inside a check so that new code is >>>>>> executed only on resume with different VF and do not break existing >>>>>> usecases. Following is the implementation of this approach I can think >>>>>> of. >>>>>> >>>>>> - introduce new field `prev_physical_node_id ` in `struct amdgpu_xgmi `. >>>>>> update the fields on resume. >>>>>> >>>>>> - put new code inside code block `if (prev_physical_node_id != >>>>>> physical_node_id )` >>>>>> >>>>>> >>>>> Can this happen only with XGMI under this condition? Any other method >>>>> possible like preparing a 'unique signature' and matching it to identify >>>>> if it resumed on an identically configured system? >>>> Yes, this hibernate-resume with different VF feature is only for devices >>>> with XGMI. Detecting XGMI node id change is the only way I can think of to >>>> identify the case. It's also a very simple way. >>>> >>>> @Koenig, Christian<mailto:christian.koe...@amd.com> Are you OK with this >>>> approach, adding a check for the new code sequence? >>> Well you still need to avoid calling gmc_v9_0_mc_init() since that is most >>> likely incorrect. >> Yes, I will change it to >> >> if (amdgpu_xmgi_is_node_changed(adev)) >> gmc_v9_0_vram_gtt_location(adev, &adev->gmc); > Even that is incorrect. The VRAM and GTT location can't change on resume. > > What changes are the XGMI node ID and with it where inside the FB aperture > our VRAM PDB0 should point to.
2 updates in `gmc_v9_0_vram_gtt_location()` is needed: 1. `vm_manager.vram_base_offset` is changed with new `xgmi.physical_node_id` at the end of `gmc_v9_0_vram_gtt_location()`. 2. `fb_start` and `fb_end` got reset in `mmhub_v1_8_get_fb_location()` called by the new `amdgpu_bo_fb_aper_addr()`. It needs to be updated again. For update 1, I can inline it in `gmc_v9_0_resume()`. For update 2, I can disable reset `fb_start/fb_end` in `mmhub_v1_8_get_fb_location()` when pdb0 is enabled. Is this OK? @Koenig, Christian <mailto:christian.koe...@amd.com> Regards Sam > Regards, > Christian. > >> And remove the change of `refresh`. >> >> Regards >> Sam >> >> >>> Regards, >>> Christian. >>> >>>>> Regardless, instead of having a direct check, better to wrap it inside >>>>> something like >>>>> if (amdgpu_virt_need_migration()) or something more appropriate. >>>> Yes, I will do that. Thank you! >>>> >>>> Regards >>>> Sam >>>> >>>>> Thanks, >>>>> Lijo >>>>> >>>>>> Is this approach acceptable? If not, can you suggest a better approach? >>>>>> @Lazar, Lijo<mailto:lijo.la...@amd.com> @Koenig, Christian >>>>>> <mailto:christian.koe...@amd.com> Thank you! >>>>>> >>>>>> Regards >>>>>> >>>>>> Sam >>>>>> >>>>>> *From: *Lazar, Lijo<lijo.la...@amd.com> >>>>>> *Date: *Tuesday, May 6, 2025 at 19:55 >>>>>> *To: *Zhang, GuoQing (Sam)<guoqing.zh...@amd.com>, amd- >>>>>> g...@lists.freedesktop.org <amd-gfx@lists.freedesktop.org> >>>>>> *Cc: *Zhao, Victor<victor.z...@amd.com>, Chang, HaiJun >>>>>> <haijun.ch...@amd.com>, Koenig, Christian<christian.koe...@amd.com>, >>>>>> Deucher, Alexander<alexander.deuc...@amd.com>, Zhang, Owen(SRDC) >>>>>> <owen.zha...@amd.com>, Ma, Qing (Mark)<qing...@amd.com>, Jiang Liu >>>>>> <ge...@linux.alibaba.com> >>>>>> *Subject: *Re: [PATCH v3 1/7] drm/amdgpu: update XGMI physical node id >>>>>> and GMC configs on resume >>>>>> >>>>>> >>>>>> >>>>>> On 5/6/2025 3:06 PM, Samuel Zhang wrote: >>>>>>> For virtual machine with vGPUs in SRIOV single device mode and XGMI >>>>>>> is enabled, XGMI physical node ids may change when waking up from >>>>>>> hiberation with different vGPU devices. So update XGMI physical node >>>>>>> ids on resume. >>>>>>> >>>>>> Please keep in mind that this is not the only scenario addressed by the >>>>>> driver - for ex: a resume sequence is executed after a device reset. >>>>>> This patch itself introduces unwanted sequences for other commonly used >>>>>> usecases. Please rework on the series without breaking existing usecases. >>>>>> >>>>>> Thanks, >>>>>> Lijo >>>>>> >>>>>>> Update GPU memory controller configuration on resume if XGMI physical >>>>>>> node ids are changed. >>>>>>> >>>>>>> Signed-off-by: Jiang Liu<ge...@linux.alibaba.com> >>>>>>> Signed-off-by: Samuel Zhang<guoqing.zh...@amd.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 >>>>>>> ++++++++++++++++++++++ >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +-- >>>>>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++++ >>>>>>> 3 files changed, 29 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/ >>>>>> drm/amd/amdgpu/amdgpu_device.c >>>>>>> index d477a901af84..e795af5067e5 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> @@ -5040,6 +5040,27 @@ int amdgpu_device_suspend(struct drm_device >>>>>> *dev, bool notify_clients) >>>>>>> return 0; >>>>>>> } >>>>>>> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device >>>>>>> *adev) >>>>>>> +{ >>>>>>> + int r; >>>>>>> + unsigned int prev_physical_node_id; >>>>>>> + >>>>>>> + /* Get xgmi info again for sriov to detect device changes */ >>>>>>> + if (amdgpu_sriov_vf(adev) && >>>>>>> + !(adev->flags & AMD_IS_APU) && >>>>>>> + adev->gmc.xgmi.supported && >>>>>>> + !adev->gmc.xgmi.connected_to_cpu) { >>>>>>> + prev_physical_node_id = adev->gmc.xgmi.physical_node_id; >>>>>>> + r = adev->gfxhub.funcs->get_xgmi_info(adev); >>>>>>> + if (r) >>>>>>> + return r; >>>>>>> + >>>>>>> + dev_info(adev->dev, "xgmi node, old id %d, new id %d\n", >>>>>>> + prev_physical_node_id, adev- >>>>>>> gmc.xgmi.physical_node_id); >>>>>>> + } >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> /** >>>>>>> * amdgpu_device_resume - initiate device resume >>>>>>> * >>>>>>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, >>>>>> bool notify_clients) >>>>>>> r = amdgpu_virt_request_full_gpu(adev, true); >>>>>>> if (r) >>>>>>> return r; >>>>>>> + r = amdgpu_device_update_xgmi_info(adev); >>>>>>> + if (r) >>>>>>> + return r; >>>>>>> } >>>>>>> if (dev->switch_power_state == DRM_SWITCH_POWER_OFF) >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/ >>>>>> drm/amd/amdgpu/amdgpu_gmc.c >>>>>>> index d1fa5e8e3937..a2abddf3c110 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >>>>>>> @@ -1298,8 +1298,7 @@ int amdgpu_gmc_get_nps_memranges(struct >>>>>> amdgpu_device *adev, >>>>>>> if (!mem_ranges || !exp_ranges) >>>>>>> return -EINVAL; >>>>>>> - refresh = (adev->init_lvl->level != >>>>>> AMDGPU_INIT_LEVEL_MINIMAL_XGMI) && >>>>>>> - (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS); >>>>>>> + refresh = true; >>>>>>> ret = amdgpu_discovery_get_nps_info(adev, &nps_type, &ranges, >>>>>>> &range_cnt, refresh); >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>>>>> b/drivers/gpu/drm/ >>>>>> amd/amdgpu/gmc_v9_0.c >>>>>>> index 59385da80185..1eb451a3743b 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>>>>> @@ -2533,6 +2533,10 @@ static int gmc_v9_0_resume(struct >>>>>> amdgpu_ip_block *ip_block) >>>>>>> struct amdgpu_device *adev = ip_block->adev; >>>>>>> int r; >>>>>>> + r = gmc_v9_0_mc_init(adev); >>>>>>> + if (r) >>>>>>> + return r; >>>>>>> + >>>>>>> /* If a reset is done for NPS mode switch, read the memory >>>>>>> range >>>>>>> * information again. >>>>>>> */