On 11/18/25 15:43, Alex Deucher wrote: > On Tue, Nov 18, 2025 at 5:49 AM Christian König > <[email protected]> wrote: >> >> Hi Chong, >> >> yeah and exactly that argumentation is not correct. >> >> We have to guarantee a min minimum response time and that is what the >> timeout is all about. >> >> And it doesn't matter if the available HW time is split between 1,2,4 or 8 >> virtual functions. The minimum reponse time we need to guarantee is still >> the same, it's just that the available HW time is lowered. >> >> So as long as we don't have an explicit customer request which asks for >> longer default timeouts this change is rejected. > > I think the change makes sense. It needs to be longer to compensate > for the world switch latency. 0.5 seconds of runtime is probably too > short for many larger workloads.
The calculation that you have a 0.5 second timeout because the HW is shared among 4 VF doesn't make sense. It's just that each VF has 1/4 of the calculation power of the PF, but we also don't increase the timeout for slower physical HW either. The timeout is there to guarantee a certain response time and that comes from both customer requirements as well as requirements of the Linux memory management. And at least the memory management doesn't care if you have a virtual function or physical hardware, you have to complete all operations in a certain time or otherwise the system can run into a panic during low memory situations. What could be is that we have customers who don't care about OOM situations and want this longer timeout, but that is then on their own risk. We certainly shouldn't increase the timeout to something unreasonable just because some CI system is failing. Regards, Christian. > > Alex > >> >> Regards, >> Christian. >> >> On 11/18/25 11:08, Li, Chong(Alan) wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> Hi, Christian. >>> >>> what I mean is: >>> in sriov mode, when customer need limit timeout value , >>> they should set the "lockup_timeout" according to the vf number they load. >>> >>> >>> Why: >>> >>> The real timeout value in sriov for each vf is " lockup_timeout / >>> vf_number", >>> >>> As you said they want to limit the timeout to 2s, >>> when customer load 8vf, they should set the "lockup_timeout" to 16s, 4vf >>> need set "lockup_timeout" to 8s. >>> >>> >>> (After we test, when value "lockup_timeout" set to 2s, the 4vf mode can't >>> work as each vf only get 0.5s) >>> >>> >>> >>> >>> >>> Thanks, >>> Chong. >>> >>> >>> >>> -----Original Message----- >>> From: Koenig, Christian <[email protected]> >>> Sent: Tuesday, November 18, 2025 5:31 PM >>> To: Li, Chong(Alan) <[email protected]>; [email protected] >>> Cc: Chen, Horace <[email protected]> >>> Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds >>> timeout is not enough for sdma job >>> >>> Hi Chong, >>> >>> that is not a valid justification. >>> >>> What customer requirement is causing this SDMA timeout? >>> >>> When it is just your CI system then change the CI system. >>> >>> As long as you can't come up with a customer requirement this change is >>> rejected. >>> >>> Regards, >>> Christian. >>> >>> On 11/18/25 10:29, Li, Chong(Alan) wrote: >>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>> >>>> Hi, Christian. >>>> >>>> In multiple vf mode( in our CI environment the vf number is 4), the >>>> timeout value is shared across all vfs. >>>> After timeout value change to 2s, each vf only get 0.5s, cause sdma ring >>>> timeout and trigger gpu reset. >>>> >>>> >>>> Thanks, >>>> Chong. >>>> >>>> -----Original Message----- >>>> From: Koenig, Christian <[email protected]> >>>> Sent: Tuesday, November 18, 2025 4:34 PM >>>> To: Li, Chong(Alan) <[email protected]>; [email protected] >>>> Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds >>>> timeout is not enough for sdma job >>>> >>>> Clear NAK to this patch. >>>> >>>> It is explicitely requested by customers that we only have a 2 second >>>> timeout. >>>> >>>> So you need a very good explanation to have that changed for SRIOV. >>>> >>>> Regards, >>>> Christian. >>>> >>>> On 11/17/25 07:53, chong li wrote: >>>>> Signed-off-by: chong li <[email protected]> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++-- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 ++-- >>>>> 2 files changed, 9 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> index 69c29f47212d..4ab755eb5ec1 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> @@ -4341,10 +4341,15 @@ static int >>>>> amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev) >>>>> int index = 0; >>>>> long timeout; >>>>> int ret = 0; >>>>> + long timeout_default; >>>>> >>>>> - /* By default timeout for all queues is 2 sec */ >>>>> + if (amdgpu_sriov_vf(adev)) >>>>> + timeout_default = msecs_to_jiffies(10000); >>>>> + else >>>>> + timeout_default = msecs_to_jiffies(2000); >>>>> + /* By default timeout for all queues is 10 sec in sriov, 2 sec not >>>>> in sriov*/ >>>>> adev->gfx_timeout = adev->compute_timeout = adev->sdma_timeout = >>>>> - adev->video_timeout = msecs_to_jiffies(2000); >>>>> + adev->video_timeout = timeout_default; >>>>> >>>>> if (!strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENGTH)) >>>>> return 0; >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>> index f508c1a9fa2c..43bdd6c1bec2 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>> @@ -358,10 +358,10 @@ module_param_named(svm_default_granularity, >>>>> amdgpu_svm_default_granularity, uint >>>>> * [GFX,Compute,SDMA,Video] to set individual timeouts. >>>>> * Negative values mean infinity. >>>>> * >>>>> - * By default(with no lockup_timeout settings), the timeout for all >>>>> queues is 2000. >>>>> + * By default(with no lockup_timeout settings), the timeout for all >>>>> queues is 10000 in sriov, 2000 not in sriov. >>>>> */ >>>>> MODULE_PARM_DESC(lockup_timeout, >>>>> - "GPU lockup timeout in ms (default: 2000. 0: keep default >>>>> value. negative: infinity timeout), format: [single value for all] or >>>>> [GFX,Compute,SDMA,Video]."); >>>>> + "GPU lockup timeout in ms (default: 10000 in sriov, 2000 >>>>> not in sriov. 0: keep default value. negative: infinity timeout), format: >>>>> [single value for all] or [GFX,Compute,SDMA,Video]."); >>>>> module_param_string(lockup_timeout, amdgpu_lockup_timeout, >>>>> sizeof(amdgpu_lockup_timeout), 0444); >>>>> >>>> >>> >>
