Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds timeout is not enough for sdma job

Alex Deucher Tue, 18 Nov 2025 08:50:50 -0800

On Tue, Nov 18, 2025 at 9:57 AM Christian König
<[email protected]> wrote:
>
> On 11/18/25 15:43, Alex Deucher wrote:
> > On Tue, Nov 18, 2025 at 5:49 AM Christian König
> > <[email protected]> wrote:
> >>
> >> Hi Chong,
> >>
> >> yeah and exactly that argumentation is not correct.
> >>
> >> We have to guarantee a min minimum response time and that is what the 
> >> timeout is all about.
> >>
> >> And it doesn't matter if the available HW time is split between 1,2,4 or 8 
> >> virtual functions. The minimum reponse time we need to guarantee is still 
> >> the same, it's just that the available HW time is lowered.
> >>
> >> So as long as we don't have an explicit customer request which asks for 
> >> longer default timeouts this change is rejected.
> >
> > I think the change makes sense.  It needs to be longer to compensate
> > for the world switch latency.  0.5 seconds of runtime is probably too
> > short for many larger workloads.
>
> The calculation that you have a 0.5 second timeout because the HW is shared 
> among 4 VF doesn't make sense.
>
> It's just that each VF has 1/4 of the calculation power of the PF, but we 
> also don't increase the timeout for slower physical HW either.
>


But in this case, it doesn't actually reflect the GPU time spent on
the job, it only reflects a percentage of it so it's not really an
apples to apples comparison with non-SR-IOV.  That said, from a wall
clock perspective, it does reflect the time spent even if you only got
a percentage of that time.

Alex

> The timeout is there to guarantee a certain response time and that comes from 
> both customer requirements as well as requirements of the Linux memory 
> management.
>
> And at least the memory management doesn't care if you have a virtual 
> function or physical hardware, you have to complete all operations in a 
> certain time or otherwise the system can run into a panic during low memory 
> situations.
>
> What could be is that we have customers who don't care about OOM situations 
> and want this longer timeout, but that is then on their own risk.
>
> We certainly shouldn't increase the timeout to something unreasonable just 
> because some CI system is failing.
>
> Regards,
> Christian.
>
> >
> > Alex
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >> On 11/18/25 11:08, Li, Chong(Alan) wrote:
> >>> [AMD Official Use Only - AMD Internal Distribution Only]
> >>>
> >>> Hi, Christian.
> >>>
> >>> what I mean is:
> >>> in sriov mode, when customer need limit timeout value ,
> >>> they should set the "lockup_timeout" according to the vf number they load.
> >>>
> >>>
> >>> Why:
> >>>
> >>> The real timeout value in sriov for each vf is " lockup_timeout / 
> >>> vf_number",
> >>>
> >>> As you said they want to limit the timeout to 2s,
> >>> when customer load 8vf, they should set the "lockup_timeout" to 16s,  4vf 
> >>> need set "lockup_timeout" to 8s.
> >>>
> >>>
> >>> (After we test, when value "lockup_timeout" set to 2s, the 4vf mode can't 
> >>> work as each vf only get 0.5s)
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Thanks,
> >>> Chong.
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Koenig, Christian <[email protected]>
> >>> Sent: Tuesday, November 18, 2025 5:31 PM
> >>> To: Li, Chong(Alan) <[email protected]>; [email protected]
> >>> Cc: Chen, Horace <[email protected]>
> >>> Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds 
> >>> timeout is not enough for sdma job
> >>>
> >>> Hi Chong,
> >>>
> >>> that is not a valid justification.
> >>>
> >>> What customer requirement is causing this SDMA timeout?
> >>>
> >>> When it is just your CI system then change the CI system.
> >>>
> >>> As long as you can't come up with a customer requirement this change is 
> >>> rejected.
> >>>
> >>> Regards,
> >>> Christian.
> >>>
> >>> On 11/18/25 10:29, Li, Chong(Alan) wrote:
> >>>> [AMD Official Use Only - AMD Internal Distribution Only]
> >>>>
> >>>> Hi, Christian.
> >>>>
> >>>> In multiple vf mode( in our CI environment the vf number is 4), the 
> >>>> timeout value is shared across all vfs.
> >>>> After timeout value change to 2s, each vf only get 0.5s, cause sdma ring 
> >>>> timeout and trigger gpu reset.
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Chong.
> >>>>
> >>>> -----Original Message-----
> >>>> From: Koenig, Christian <[email protected]>
> >>>> Sent: Tuesday, November 18, 2025 4:34 PM
> >>>> To: Li, Chong(Alan) <[email protected]>; [email protected]
> >>>> Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds 
> >>>> timeout is not enough for sdma job
> >>>>
> >>>> Clear NAK to this patch.
> >>>>
> >>>> It is explicitely requested by customers that we only have a 2 second 
> >>>> timeout.
> >>>>
> >>>> So you need a very good explanation to have that changed for SRIOV.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>> On 11/17/25 07:53, chong li wrote:
> >>>>> Signed-off-by: chong li <[email protected]>
> >>>>> ---
> >>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
> >>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 4 ++--
> >>>>>  2 files changed, 9 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> index 69c29f47212d..4ab755eb5ec1 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>> @@ -4341,10 +4341,15 @@ static int 
> >>>>> amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
> >>>>>       int index = 0;
> >>>>>       long timeout;
> >>>>>       int ret = 0;
> >>>>> +     long timeout_default;
> >>>>>
> >>>>> -     /* By default timeout for all queues is 2 sec */
> >>>>> +     if (amdgpu_sriov_vf(adev))
> >>>>> +             timeout_default = msecs_to_jiffies(10000);
> >>>>> +     else
> >>>>> +             timeout_default = msecs_to_jiffies(2000);
> >>>>> +     /* By default timeout for all queues is 10 sec in sriov, 2 sec 
> >>>>> not in sriov*/
> >>>>>       adev->gfx_timeout = adev->compute_timeout = adev->sdma_timeout =
> >>>>> -             adev->video_timeout = msecs_to_jiffies(2000);
> >>>>> +             adev->video_timeout = timeout_default;
> >>>>>
> >>>>>       if (!strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENGTH))
> >>>>>               return 0;
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>> index f508c1a9fa2c..43bdd6c1bec2 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>> @@ -358,10 +358,10 @@ module_param_named(svm_default_granularity, 
> >>>>> amdgpu_svm_default_granularity, uint
> >>>>>   * [GFX,Compute,SDMA,Video] to set individual timeouts.
> >>>>>   * Negative values mean infinity.
> >>>>>   *
> >>>>> - * By default(with no lockup_timeout settings), the timeout for all 
> >>>>> queues is 2000.
> >>>>> + * By default(with no lockup_timeout settings), the timeout for all 
> >>>>> queues is 10000 in sriov, 2000 not in sriov.
> >>>>>   */
> >>>>>  MODULE_PARM_DESC(lockup_timeout,
> >>>>> -              "GPU lockup timeout in ms (default: 2000. 0: keep 
> >>>>> default value. negative: infinity timeout), format: [single value for 
> >>>>> all] or [GFX,Compute,SDMA,Video].");
> >>>>> +              "GPU lockup timeout in ms (default: 10000 in sriov, 2000 
> >>>>> not in sriov. 0: keep default value. negative: infinity timeout), 
> >>>>> format: [single value for all] or [GFX,Compute,SDMA,Video].");
> >>>>>  module_param_string(lockup_timeout, amdgpu_lockup_timeout,
> >>>>>                   sizeof(amdgpu_lockup_timeout), 0444);
> >>>>>
> >>>>
> >>>
> >>
>

Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds timeout is not enough for sdma job

Reply via email to