[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Christian.

" It is explicitely requested by customers that we only have a 2 second 
timeout."


What scenarios are these customers using the GPU for?  sriov or non-sriov?

After changing the default timeout value to 2 sec:

        bare-metal or pass-through mode:
                these customers have 2 sec timeout limits, this is ok.

        sriov mode:
                AMD gpu device can only work in 1vf mode(2 sec timeout), and 
2vf mode(1 sec timeout).
                Many customers use 12vf mode(NV32 VDI) and 8vf mode(Alibaba 
Mi308),
                after they update driver with your patch, they will meet gpu 
reset repeatedly.
                We should fix this issue before customer complaints.

Solution:
If the customers you contact need use gpu in sriov mode,
Please guide them to config the timeout value with the correct way.

get vf number when they load gim driver, and calculate "timeout_value", as 2 * 
vf_num.
then "modprobe amdgpu lockup_timeout=timeout_value".



Thanks,
Chong.



-----Original Message-----
From: Koenig, Christian <[email protected]>
Sent: Tuesday, November 18, 2025 6:41 PM
To: Li, Chong(Alan) <[email protected]>; [email protected]
Cc: Chen, Horace <[email protected]>
Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds timeout 
is not enough for sdma job

Hi Chong,

yeah and exactly that argumentation is not correct.

We have to guarantee a min minimum response time and that is what the timeout 
is all about.

And it doesn't matter if the available HW time is split between 1,2,4 or 8 
virtual functions. The minimum reponse time we need to guarantee is still the 
same, it's just that the available HW time is lowered.

So as long as we don't have an explicit customer request which asks for longer 
default timeouts this change is rejected.

Regards,
Christian.

On 11/18/25 11:08, Li, Chong(Alan) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, Christian.
>
> what I mean is:
> in sriov mode, when customer need limit timeout value , they should
> set the "lockup_timeout" according to the vf number they load.
>
>
> Why:
>
> The real timeout value in sriov for each vf is " lockup_timeout /
> vf_number",
>
> As you said they want to limit the timeout to 2s, when customer load
> 8vf, they should set the "lockup_timeout" to 16s,  4vf need set 
> "lockup_timeout" to 8s.
>
>
> (After we test, when value "lockup_timeout" set to 2s, the 4vf mode
> can't work as each vf only get 0.5s)
>
>
>
>
>
> Thanks,
> Chong.
>
>
>
> -----Original Message-----
> From: Koenig, Christian <[email protected]>
> Sent: Tuesday, November 18, 2025 5:31 PM
> To: Li, Chong(Alan) <[email protected]>; [email protected]
> Cc: Chen, Horace <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds
> timeout is not enough for sdma job
>
> Hi Chong,
>
> that is not a valid justification.
>
> What customer requirement is causing this SDMA timeout?
>
> When it is just your CI system then change the CI system.
>
> As long as you can't come up with a customer requirement this change is 
> rejected.
>
> Regards,
> Christian.
>
> On 11/18/25 10:29, Li, Chong(Alan) wrote:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> Hi, Christian.
>>
>> In multiple vf mode( in our CI environment the vf number is 4), the timeout 
>> value is shared across all vfs.
>> After timeout value change to 2s, each vf only get 0.5s, cause sdma ring 
>> timeout and trigger gpu reset.
>>
>>
>> Thanks,
>> Chong.
>>
>> -----Original Message-----
>> From: Koenig, Christian <[email protected]>
>> Sent: Tuesday, November 18, 2025 4:34 PM
>> To: Li, Chong(Alan) <[email protected]>; [email protected]
>> Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds
>> timeout is not enough for sdma job
>>
>> Clear NAK to this patch.
>>
>> It is explicitely requested by customers that we only have a 2 second 
>> timeout.
>>
>> So you need a very good explanation to have that changed for SRIOV.
>>
>> Regards,
>> Christian.
>>
>> On 11/17/25 07:53, chong li wrote:
>>> Signed-off-by: chong li <[email protected]>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 4 ++--
>>>  2 files changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 69c29f47212d..4ab755eb5ec1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4341,10 +4341,15 @@ static int 
>>> amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
>>>       int index = 0;
>>>       long timeout;
>>>       int ret = 0;
>>> +     long timeout_default;
>>>
>>> -     /* By default timeout for all queues is 2 sec */
>>> +     if (amdgpu_sriov_vf(adev))
>>> +             timeout_default = msecs_to_jiffies(10000);
>>> +     else
>>> +             timeout_default = msecs_to_jiffies(2000);
>>> +     /* By default timeout for all queues is 10 sec in sriov, 2 sec
>>> + not in sriov*/
>>>       adev->gfx_timeout = adev->compute_timeout = adev->sdma_timeout =
>>> -             adev->video_timeout = msecs_to_jiffies(2000);
>>> +             adev->video_timeout = timeout_default;
>>>
>>>       if (!strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENGTH))
>>>               return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index f508c1a9fa2c..43bdd6c1bec2 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -358,10 +358,10 @@ module_param_named(svm_default_granularity, 
>>> amdgpu_svm_default_granularity, uint
>>>   * [GFX,Compute,SDMA,Video] to set individual timeouts.
>>>   * Negative values mean infinity.
>>>   *
>>> - * By default(with no lockup_timeout settings), the timeout for all queues 
>>> is 2000.
>>> + * By default(with no lockup_timeout settings), the timeout for all queues 
>>> is 10000 in sriov, 2000 not in sriov.
>>>   */
>>>  MODULE_PARM_DESC(lockup_timeout,
>>> -              "GPU lockup timeout in ms (default: 2000. 0: keep default 
>>> value. negative: infinity timeout), format: [single value for all] or 
>>> [GFX,Compute,SDMA,Video].");
>>> +              "GPU lockup timeout in ms (default: 10000 in sriov,
>>> + 2000 not in sriov. 0: keep default value. negative: infinity
>>> + timeout), format: [single value for all] or
>>> + [GFX,Compute,SDMA,Video].");
>>>  module_param_string(lockup_timeout, amdgpu_lockup_timeout,
>>>                   sizeof(amdgpu_lockup_timeout), 0444);
>>>
>>
>

Reply via email to