Hang on a second. If there are production GPUs that only work with HSA_OVERRIDE_GFX_VERSION right now, then we should make those GPUs properly supported. I thought this was only used internally for bring-up or maybe externally as a short-term solution before we upstream proper support for new GPUs.
Regards, Felix On 2024-08-11 22:10, Zhang, Yifan wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > I agree that adding exp_hw_support is a safer approach. My concern is that > HSA_OVERRIDE_GFX_VERSION has been used for a while and has become a status > quo for running ROCm on unsupported APUs. I'm not sure if this approach will > be a burden for APU end users. Adding driver load parameters is more > complicated than simply adding an environment variable on consumer PCs. > > Best Regards, > Yifan > > -----Original Message----- > From: Kuehling, Felix <felix.kuehl...@amd.com> > Sent: Saturday, August 10, 2024 7:37 AM > To: Zhang, Yifan <yifan1.zh...@amd.com>; Kasiviswanathan, Harish > <harish.kasiviswanat...@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Yang, Philip <philip.y...@amd.com> > Subject: Re: [PATCH] drm/amdkfd: keep create queue success if cwsr save area > doesn't match > > Maybe we can turn this check into a warnings if, and only if the > exp_hw_support module param is set. That way we don't water down the checks > on the production code path but allow experimental setups to run without a > seat belt. > > Regards, > Felix > > > On 2024-08-09 01:39, Zhang, Yifan wrote: >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> Yes, I think we need that change for a normal code path, but this case is >> introduced only with the HSA_OVERRIDE_GFX_VERSION environment setting, which >> implies that "the override ASIC is compatible with the real ASIC." It is >> intended for experimental purposes. When a user is using >> HSA_OVERRIDE_GFX_VERSION, they should be aware of the potential risks it may >> bring. Usually, HSA_OVERRIDE_GFX_VERSION is used to force an unsupported APU >> to be recognized as a ROCm-supported high-end dGPU, which has a large cwsr >> save area, making the operation safe. This check was added to KFD two weeks >> ago, the HSA_OVERRIDE_GFX_VERSION environment had been working fine before >> that. >> >> Best Regards, >> Yifan >> >> -----Original Message----- >> From: Kasiviswanathan, Harish <harish.kasiviswanat...@amd.com> >> Sent: Thursday, August 8, 2024 10:46 PM >> To: Zhang, Yifan <yifan1.zh...@amd.com>; amd-gfx@lists.freedesktop.org >> Cc: Kuehling, Felix <felix.kuehl...@amd.com>; Yang, Philip >> <philip.y...@amd.com>; Zhang, Yifan <yifan1.zh...@amd.com> >> Subject: RE: [PATCH] drm/amdkfd: keep create queue success if cwsr >> save area doesn't match >> >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> In this case, shouldn't larger of two sizes be used. Also, we should have an >> upper bound check. >> >> -----Original Message----- >> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of >> Yifan Zhang >> Sent: Thursday, August 8, 2024 4:44 AM >> To: amd-gfx@lists.freedesktop.org >> Cc: Kuehling, Felix <felix.kuehl...@amd.com>; Yang, Philip >> <philip.y...@amd.com>; Zhang, Yifan <yifan1.zh...@amd.com> >> Subject: [PATCH] drm/amdkfd: keep create queue success if cwsr save >> area doesn't match >> >> If HSA_OVERRIDE_GFX_VERSION is used in ROCm workload, user space and kernel >> use different spec to calculate cwsr save area, current check may fail >> create queue ioctl. Change error to warn to make create queue succeed in >> that case. >> >> Signed-off-by: Yifan Zhang <yifan1.zh...@amd.com> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 4 +--- >> 1 file changed, 1 insertion(+), 3 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c >> index e0a073ae4a49..9f283aff057a 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c >> @@ -295,11 +295,9 @@ int kfd_queue_acquire_buffers(struct kfd_process_device >> *pdd, struct queue_prope >> } >> >> if (properties->ctx_save_restore_area_size != >> topo_dev->node_props.cwsr_size) { >> - pr_debug("queue cwsr size 0x%x not equal to node cwsr size >> 0x%x\n", >> + pr_warn("queue cwsr size 0x%x not equal to node cwsr >> + size 0x%x\n", >> properties->ctx_save_restore_area_size, >> topo_dev->node_props.cwsr_size); >> - err = -EINVAL; >> - goto out_err_unreserve; >> } >> >> total_cwsr_size = (topo_dev->node_props.cwsr_size + >> topo_dev->node_props.debug_memory_size) >> -- >> 2.37.3 >> >>