On 2025/5/21 20:00, Christian König wrote: > On 5/21/25 13:55, Zhang, GuoQing (Sam) wrote: >> On 2025/5/21 16:06, Christian König wrote: >>> On 5/20/25 07:10, Zhang, GuoQing (Sam) wrote: >>>>>> + if (amdgpu_virt_xgmi_migrate_enabled(adev)) { >>>>>> + /* set mc->vram_start to 0 to switch the returned GPU >>>>>> address of >>>>>> + * amdgpu_bo_create_reserved() from FB aperture to GART >>>>>> aperture. >>>>>> + */ >>>>>> + amdgpu_gmc_vram_location(adev, mc, 0); >>>>> This function does a lot more than just setting mc->vram_start and >>>>> mc->vram_end. >>>>> >>>>> You should probably just update the two setting and not call >>>>> amdgpu_gmc_vram_location() at all. >>>> I tried only setting mc->vram_start and mc->vram_end. But KMD load will >>>> fail with following error logs. >>>> >>>> [ 329.314346] amdgpu 0000:09:00.0: amdgpu: VRAM: 196288M >>>> 0x0000000000000000 - 0x0000002FEBFFFFFF (196288M used) >>>> [ 329.314348] amdgpu 0000:09:00.0: amdgpu: GART: 512M >>>> 0x0000018000000000 - 0x000001801FFFFFFF >>>> [ 329.314385] [drm] Detected VRAM RAM=196288M, BAR=262144M >>>> [ 329.314386] [drm] RAM width 8192bits HBM >>>> [ 329.314546] amdgpu 0000:09:00.0: amdgpu: (-22) failed to allocate >>>> kernel bo >>>> [ 329.315013] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP >>>> block <gmc_v9_0> failed -22 >>>> [ 329.315846] amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed >>>> >>>> >>>> It seems like setting mc->visible_vram_size and mc->visible_vram_size >>>> fields are also needed. In this case call amdgpu_gmc_vram_location() is >>>> better than inline the logic, I think. >>> Yeah, exactly that is not a good idea. >>> >>> The mc->visible_vram_size and mc->real_vram_size should have been >>> initialized by gmc_v9_0_mc_init(). Why didn't that happen? >> >> [Sam] visible_vram_size is set to 0x4000000000 (256G) from >> `pci_resource_len(adev->pdev, 0)` in `gmc_v9_0_mc_init()`. >> It is set to real_vram_size 0x2fec000000(192G) in >> amdgpu_gmc_vram_location(). >> >> Should I update the 3 variables inline and not call >> amdgpu_gmc_vram_location()? >> >> mc->vram_start = 0; >> mc->vram_end = mc->vram_start + mc->mc_vram_size - 1; >> if (mc->real_vram_size < mc->visible_vram_size) >> mc->visible_vram_size = mc->real_vram_size; > Yeah that seems to make sense. > >> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c >>>>>> index 84cde1239ee4..18e80aa78aff 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c >>>>>> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct >>>>>> amdgpu_device *adev) >>>>>> top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK; >>>>>> top <<= 24; >>>>>> >>>>>> - adev->gmc.fb_start = base; >>>>>> - adev->gmc.fb_end = top; >>>>>> + if (!amdgpu_virt_xgmi_migrate_enabled(adev)) { >>>>>> + adev->gmc.fb_start = base; >>>>>> + adev->gmc.fb_end = top; >>>>>> + } >>>>> We should probably avoid calling this in the first place. >>>>> >>>>> The function gmc_v9_0_vram_gtt_location() should probably be adjusted. >>>> mmhub_v1_8_get_fb_location() is called by the new >>>> amdgpu_bo_fb_aper_addr() as well, not just gmc_v9_0_vram_gtt_location(). >>> Oh, that is probably a bad idea. The function amdgpu_bo_fb_aper_addr() >>> should only rely on cached data. >> >> [Sam] Can I add new `fb_base` field in `struct amdgpu_gmc` to cache the >> value of `get_fb_location()`? > No, please try to avoid that.
OK. so "amdgpu_bo_fb_aper_addr() should only rely on cached data." is not required and I don't need to change current amdgpu_bo_fb_aper_addr() implementation, right? > >> Using this approach, we don't need to set fb_start and fb_end on resume >> any more, since the reset of the 2 field is caused by >> mmhub_v1_8_get_fb_location() calls from amdgpu_bo_fb_aper_addr(). >> Please see the code change below. > What is wrong with setting fb_start and fb_end on resume? It works. I have updated the patch in this way. >>>> mmhub_v1_8_get_fb_location() is supposed to be a query api according to >>>> its name. having such side effect is very surprising. >>>> >>>> Another approach is set the right fb_start and fb_end in the new >>>> amdgpu_virt_resume(), like updating vram_base_offset. >>> That is probably better. And skip setting fb_start and fb_end in >>> amdgpu_gmc_sysvm_location() for this use case. setting fb_start and fb_end in amdgpu_gmc_sysvm_location() is needed for normal KMD load, since amdgpu_virt_resume() is not called on normal KMD load. I have sent out v7 patch list. Please take another look. Thank you! mail titles: [PATCH v7 0/4] enable xgmi node migration support for hibernate on SRIOV [PATCH v7 1/4] drm/amdgpu: update xgmi info and vram_base_offset on resume [PATCH v7 2/4] drm/amdgpu: update GPU addresses for SMU and PSP [PATCH v7 3/4] drm/amdgpu: enable pdb0 for hibernation on SRIOV [PATCH v7 4/4] drm/amdgpu: fix fence fallback timer expired error changes: - remove pdb0_enabled and add gmc_v9_0_is_pdb0_enabled() - remove amdgpu_gmc_vram_location() call in amdgpu_gmc_sysvm_location() - remove check in mmhub_v1_8_get_fb_location() and update fb_start/fb_end on resume Regards Sam >>> >>> That was done only because we re-program those registers on bare metal. >>> >>> Regards, >>> Christian. >>> >>>> Which approach do you prefer? Or any better suggestions? Thank you. >>>> >>>> >>>> Regards >>>> Sam >>>> >>>> >>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> >>>>>> return base; >>>>>> }