On Tue, Nov 18, 2025 at 5:49 AM Christian König
<[email protected]> wrote:
>
> Hi Chong,
>
> yeah and exactly that argumentation is not correct.
>
> We have to guarantee a min minimum response time and that is what the timeout 
> is all about.
>
> And it doesn't matter if the available HW time is split between 1,2,4 or 8 
> virtual functions. The minimum reponse time we need to guarantee is still the 
> same, it's just that the available HW time is lowered.
>
> So as long as we don't have an explicit customer request which asks for 
> longer default timeouts this change is rejected.

I think the change makes sense.  It needs to be longer to compensate
for the world switch latency.  0.5 seconds of runtime is probably too
short for many larger workloads.

Alex

>
> Regards,
> Christian.
>
> On 11/18/25 11:08, Li, Chong(Alan) wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Hi, Christian.
> >
> > what I mean is:
> > in sriov mode, when customer need limit timeout value ,
> > they should set the "lockup_timeout" according to the vf number they load.
> >
> >
> > Why:
> >
> > The real timeout value in sriov for each vf is " lockup_timeout / 
> > vf_number",
> >
> > As you said they want to limit the timeout to 2s,
> > when customer load 8vf, they should set the "lockup_timeout" to 16s,  4vf 
> > need set "lockup_timeout" to 8s.
> >
> >
> > (After we test, when value "lockup_timeout" set to 2s, the 4vf mode can't 
> > work as each vf only get 0.5s)
> >
> >
> >
> >
> >
> > Thanks,
> > Chong.
> >
> >
> >
> > -----Original Message-----
> > From: Koenig, Christian <[email protected]>
> > Sent: Tuesday, November 18, 2025 5:31 PM
> > To: Li, Chong(Alan) <[email protected]>; [email protected]
> > Cc: Chen, Horace <[email protected]>
> > Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds 
> > timeout is not enough for sdma job
> >
> > Hi Chong,
> >
> > that is not a valid justification.
> >
> > What customer requirement is causing this SDMA timeout?
> >
> > When it is just your CI system then change the CI system.
> >
> > As long as you can't come up with a customer requirement this change is 
> > rejected.
> >
> > Regards,
> > Christian.
> >
> > On 11/18/25 10:29, Li, Chong(Alan) wrote:
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >> Hi, Christian.
> >>
> >> In multiple vf mode( in our CI environment the vf number is 4), the 
> >> timeout value is shared across all vfs.
> >> After timeout value change to 2s, each vf only get 0.5s, cause sdma ring 
> >> timeout and trigger gpu reset.
> >>
> >>
> >> Thanks,
> >> Chong.
> >>
> >> -----Original Message-----
> >> From: Koenig, Christian <[email protected]>
> >> Sent: Tuesday, November 18, 2025 4:34 PM
> >> To: Li, Chong(Alan) <[email protected]>; [email protected]
> >> Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds 
> >> timeout is not enough for sdma job
> >>
> >> Clear NAK to this patch.
> >>
> >> It is explicitely requested by customers that we only have a 2 second 
> >> timeout.
> >>
> >> So you need a very good explanation to have that changed for SRIOV.
> >>
> >> Regards,
> >> Christian.
> >>
> >> On 11/17/25 07:53, chong li wrote:
> >>> Signed-off-by: chong li <[email protected]>
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 4 ++--
> >>>  2 files changed, 9 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 69c29f47212d..4ab755eb5ec1 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -4341,10 +4341,15 @@ static int 
> >>> amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
> >>>       int index = 0;
> >>>       long timeout;
> >>>       int ret = 0;
> >>> +     long timeout_default;
> >>>
> >>> -     /* By default timeout for all queues is 2 sec */
> >>> +     if (amdgpu_sriov_vf(adev))
> >>> +             timeout_default = msecs_to_jiffies(10000);
> >>> +     else
> >>> +             timeout_default = msecs_to_jiffies(2000);
> >>> +     /* By default timeout for all queues is 10 sec in sriov, 2 sec not 
> >>> in sriov*/
> >>>       adev->gfx_timeout = adev->compute_timeout = adev->sdma_timeout =
> >>> -             adev->video_timeout = msecs_to_jiffies(2000);
> >>> +             adev->video_timeout = timeout_default;
> >>>
> >>>       if (!strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENGTH))
> >>>               return 0;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> index f508c1a9fa2c..43bdd6c1bec2 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> @@ -358,10 +358,10 @@ module_param_named(svm_default_granularity, 
> >>> amdgpu_svm_default_granularity, uint
> >>>   * [GFX,Compute,SDMA,Video] to set individual timeouts.
> >>>   * Negative values mean infinity.
> >>>   *
> >>> - * By default(with no lockup_timeout settings), the timeout for all 
> >>> queues is 2000.
> >>> + * By default(with no lockup_timeout settings), the timeout for all 
> >>> queues is 10000 in sriov, 2000 not in sriov.
> >>>   */
> >>>  MODULE_PARM_DESC(lockup_timeout,
> >>> -              "GPU lockup timeout in ms (default: 2000. 0: keep default 
> >>> value. negative: infinity timeout), format: [single value for all] or 
> >>> [GFX,Compute,SDMA,Video].");
> >>> +              "GPU lockup timeout in ms (default: 10000 in sriov, 2000 
> >>> not in sriov. 0: keep default value. negative: infinity timeout), format: 
> >>> [single value for all] or [GFX,Compute,SDMA,Video].");
> >>>  module_param_string(lockup_timeout, amdgpu_lockup_timeout,
> >>>                   sizeof(amdgpu_lockup_timeout), 0444);
> >>>
> >>
> >
>

Reply via email to