On 2025/5/21 16:06, Christian König wrote: > On 5/20/25 07:10, Zhang, GuoQing (Sam) wrote: >>>> + if (amdgpu_virt_xgmi_migrate_enabled(adev)) { >>>> + /* set mc->vram_start to 0 to switch the returned GPU address >>>> of >>>> + * amdgpu_bo_create_reserved() from FB aperture to GART >>>> aperture. >>>> + */ >>>> + amdgpu_gmc_vram_location(adev, mc, 0); >>> This function does a lot more than just setting mc->vram_start and >>> mc->vram_end. >>> >>> You should probably just update the two setting and not call >>> amdgpu_gmc_vram_location() at all. >> I tried only setting mc->vram_start and mc->vram_end. But KMD load will >> fail with following error logs. >> >> [ 329.314346] amdgpu 0000:09:00.0: amdgpu: VRAM: 196288M >> 0x0000000000000000 - 0x0000002FEBFFFFFF (196288M used) >> [ 329.314348] amdgpu 0000:09:00.0: amdgpu: GART: 512M >> 0x0000018000000000 - 0x000001801FFFFFFF >> [ 329.314385] [drm] Detected VRAM RAM=196288M, BAR=262144M >> [ 329.314386] [drm] RAM width 8192bits HBM >> [ 329.314546] amdgpu 0000:09:00.0: amdgpu: (-22) failed to allocate >> kernel bo >> [ 329.315013] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP >> block <gmc_v9_0> failed -22 >> [ 329.315846] amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed >> >> >> It seems like setting mc->visible_vram_size and mc->visible_vram_size >> fields are also needed. In this case call amdgpu_gmc_vram_location() is >> better than inline the logic, I think. > Yeah, exactly that is not a good idea. > > The mc->visible_vram_size and mc->real_vram_size should have been initialized > by gmc_v9_0_mc_init(). Why didn't that happen?
[Sam] visible_vram_size is set to 0x4000000000 (256G) from `pci_resource_len(adev->pdev, 0)` in `gmc_v9_0_mc_init()`. It is set to real_vram_size 0x2fec000000(192G) in amdgpu_gmc_vram_location(). Should I update the 3 variables inline and not call amdgpu_gmc_vram_location()? mc->vram_start = 0; mc->vram_end = mc->vram_start + mc->mc_vram_size - 1; if (mc->real_vram_size < mc->visible_vram_size) mc->visible_vram_size = mc->real_vram_size; > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c >>>> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c >>>> index 84cde1239ee4..18e80aa78aff 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c >>>> @@ -45,8 +45,10 @@ static u64 mmhub_v1_8_get_fb_location(struct >>>> amdgpu_device *adev) >>>> top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK; >>>> top <<= 24; >>>> >>>> - adev->gmc.fb_start = base; >>>> - adev->gmc.fb_end = top; >>>> + if (!amdgpu_virt_xgmi_migrate_enabled(adev)) { >>>> + adev->gmc.fb_start = base; >>>> + adev->gmc.fb_end = top; >>>> + } >>> We should probably avoid calling this in the first place. >>> >>> The function gmc_v9_0_vram_gtt_location() should probably be adjusted. >> mmhub_v1_8_get_fb_location() is called by the new >> amdgpu_bo_fb_aper_addr() as well, not just gmc_v9_0_vram_gtt_location(). > Oh, that is probably a bad idea. The function amdgpu_bo_fb_aper_addr() should > only rely on cached data. [Sam] Can I add new `fb_base` field in `struct amdgpu_gmc` to cache the value of `get_fb_location()`? Using this approach, we don't need to set fb_start and fb_end on resume any more, since the reset of the 2 field is caused by mmhub_v1_8_get_fb_location() calls from amdgpu_bo_fb_aper_addr(). Please see the code change below. --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h @@ -259,6 +259,7 @@ struct amdgpu_gmc { */ u64 fb_start; u64 fb_end; + u64 fb_base; unsigned vram_width; u64 real_vram_size; int vram_mtrr; --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1527,7 +1527,7 @@ u64 amdgpu_bo_fb_aper_addr(struct amdgpu_bo *bo) WARN_ON_ONCE(bo->tbo.resource->mem_type != TTM_PL_VRAM); - fb_base = adev->mmhub.funcs->get_fb_location(adev); + fb_base = adev->gmc.fb_base; fb_base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size; offset = (bo->tbo.resource->start << PAGE_SHIFT) + fb_base; return amdgpu_gmc_sign_extend(offset); --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c @@ -1728,6 +1728,7 @@ static void gmc_v9_0_vram_gtt_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc) { u64 base = adev->mmhub.funcs->get_fb_location(adev); + mc->fb_base = base; /* add the xgmi offset of the physical node */ base += adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size; --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c @@ -45,10 +45,8 @@ static u64 mmhub_v1_8_get_fb_location(struct amdgpu_device *adev) top &= MC_VM_FB_LOCATION_TOP__FB_TOP_MASK; top <<= 24; - if (!amdgpu_virt_xgmi_migrate_enabled(adev)) { - adev->gmc.fb_start = base; - adev->gmc.fb_end = top; - } + adev->gmc.fb_start = base; + adev->gmc.fb_end = top; Regards Sam > >> mmhub_v1_8_get_fb_location() is supposed to be a query api according to >> its name. having such side effect is very surprising. >> >> Another approach is set the right fb_start and fb_end in the new >> amdgpu_virt_resume(), like updating vram_base_offset. > That is probably better. And skip setting fb_start and fb_end in > amdgpu_gmc_sysvm_location() for this use case. > > That was done only because we re-program those registers on bare metal. > > Regards, > Christian. > >> Which approach do you prefer? Or any better suggestions? Thank you. >> >> >> Regards >> Sam >> >> >> >>> Regards, >>> Christian. >>> >>>> >>>> return base; >>>> }