On Tue, Nov 18, 2025 at 9:57 AM Christian König <[email protected]> wrote: > > On 11/18/25 15:43, Alex Deucher wrote: > > On Tue, Nov 18, 2025 at 5:49 AM Christian König > > <[email protected]> wrote: > >> > >> Hi Chong, > >> > >> yeah and exactly that argumentation is not correct. > >> > >> We have to guarantee a min minimum response time and that is what the > >> timeout is all about. > >> > >> And it doesn't matter if the available HW time is split between 1,2,4 or 8 > >> virtual functions. The minimum reponse time we need to guarantee is still > >> the same, it's just that the available HW time is lowered. > >> > >> So as long as we don't have an explicit customer request which asks for > >> longer default timeouts this change is rejected. > > > > I think the change makes sense. It needs to be longer to compensate > > for the world switch latency. 0.5 seconds of runtime is probably too > > short for many larger workloads. > > The calculation that you have a 0.5 second timeout because the HW is shared > among 4 VF doesn't make sense. > > It's just that each VF has 1/4 of the calculation power of the PF, but we > also don't increase the timeout for slower physical HW either. >
But in this case, it doesn't actually reflect the GPU time spent on the job, it only reflects a percentage of it so it's not really an apples to apples comparison with non-SR-IOV. That said, from a wall clock perspective, it does reflect the time spent even if you only got a percentage of that time. Alex > The timeout is there to guarantee a certain response time and that comes from > both customer requirements as well as requirements of the Linux memory > management. > > And at least the memory management doesn't care if you have a virtual > function or physical hardware, you have to complete all operations in a > certain time or otherwise the system can run into a panic during low memory > situations. > > What could be is that we have customers who don't care about OOM situations > and want this longer timeout, but that is then on their own risk. > > We certainly shouldn't increase the timeout to something unreasonable just > because some CI system is failing. > > Regards, > Christian. > > > > > Alex > > > >> > >> Regards, > >> Christian. > >> > >> On 11/18/25 11:08, Li, Chong(Alan) wrote: > >>> [AMD Official Use Only - AMD Internal Distribution Only] > >>> > >>> Hi, Christian. > >>> > >>> what I mean is: > >>> in sriov mode, when customer need limit timeout value , > >>> they should set the "lockup_timeout" according to the vf number they load. > >>> > >>> > >>> Why: > >>> > >>> The real timeout value in sriov for each vf is " lockup_timeout / > >>> vf_number", > >>> > >>> As you said they want to limit the timeout to 2s, > >>> when customer load 8vf, they should set the "lockup_timeout" to 16s, 4vf > >>> need set "lockup_timeout" to 8s. > >>> > >>> > >>> (After we test, when value "lockup_timeout" set to 2s, the 4vf mode can't > >>> work as each vf only get 0.5s) > >>> > >>> > >>> > >>> > >>> > >>> Thanks, > >>> Chong. > >>> > >>> > >>> > >>> -----Original Message----- > >>> From: Koenig, Christian <[email protected]> > >>> Sent: Tuesday, November 18, 2025 5:31 PM > >>> To: Li, Chong(Alan) <[email protected]>; [email protected] > >>> Cc: Chen, Horace <[email protected]> > >>> Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds > >>> timeout is not enough for sdma job > >>> > >>> Hi Chong, > >>> > >>> that is not a valid justification. > >>> > >>> What customer requirement is causing this SDMA timeout? > >>> > >>> When it is just your CI system then change the CI system. > >>> > >>> As long as you can't come up with a customer requirement this change is > >>> rejected. > >>> > >>> Regards, > >>> Christian. > >>> > >>> On 11/18/25 10:29, Li, Chong(Alan) wrote: > >>>> [AMD Official Use Only - AMD Internal Distribution Only] > >>>> > >>>> Hi, Christian. > >>>> > >>>> In multiple vf mode( in our CI environment the vf number is 4), the > >>>> timeout value is shared across all vfs. > >>>> After timeout value change to 2s, each vf only get 0.5s, cause sdma ring > >>>> timeout and trigger gpu reset. > >>>> > >>>> > >>>> Thanks, > >>>> Chong. > >>>> > >>>> -----Original Message----- > >>>> From: Koenig, Christian <[email protected]> > >>>> Sent: Tuesday, November 18, 2025 4:34 PM > >>>> To: Li, Chong(Alan) <[email protected]>; [email protected] > >>>> Subject: Re: [PATCH] drm/amdgpu: in sriov multiple vf mode, 2 seconds > >>>> timeout is not enough for sdma job > >>>> > >>>> Clear NAK to this patch. > >>>> > >>>> It is explicitely requested by customers that we only have a 2 second > >>>> timeout. > >>>> > >>>> So you need a very good explanation to have that changed for SRIOV. > >>>> > >>>> Regards, > >>>> Christian. > >>>> > >>>> On 11/17/25 07:53, chong li wrote: > >>>>> Signed-off-by: chong li <[email protected]> > >>>>> --- > >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++-- > >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 ++-- > >>>>> 2 files changed, 9 insertions(+), 4 deletions(-) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>> index 69c29f47212d..4ab755eb5ec1 100644 > >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>> @@ -4341,10 +4341,15 @@ static int > >>>>> amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev) > >>>>> int index = 0; > >>>>> long timeout; > >>>>> int ret = 0; > >>>>> + long timeout_default; > >>>>> > >>>>> - /* By default timeout for all queues is 2 sec */ > >>>>> + if (amdgpu_sriov_vf(adev)) > >>>>> + timeout_default = msecs_to_jiffies(10000); > >>>>> + else > >>>>> + timeout_default = msecs_to_jiffies(2000); > >>>>> + /* By default timeout for all queues is 10 sec in sriov, 2 sec > >>>>> not in sriov*/ > >>>>> adev->gfx_timeout = adev->compute_timeout = adev->sdma_timeout = > >>>>> - adev->video_timeout = msecs_to_jiffies(2000); > >>>>> + adev->video_timeout = timeout_default; > >>>>> > >>>>> if (!strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENGTH)) > >>>>> return 0; > >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>> index f508c1a9fa2c..43bdd6c1bec2 100644 > >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>> @@ -358,10 +358,10 @@ module_param_named(svm_default_granularity, > >>>>> amdgpu_svm_default_granularity, uint > >>>>> * [GFX,Compute,SDMA,Video] to set individual timeouts. > >>>>> * Negative values mean infinity. > >>>>> * > >>>>> - * By default(with no lockup_timeout settings), the timeout for all > >>>>> queues is 2000. > >>>>> + * By default(with no lockup_timeout settings), the timeout for all > >>>>> queues is 10000 in sriov, 2000 not in sriov. > >>>>> */ > >>>>> MODULE_PARM_DESC(lockup_timeout, > >>>>> - "GPU lockup timeout in ms (default: 2000. 0: keep > >>>>> default value. negative: infinity timeout), format: [single value for > >>>>> all] or [GFX,Compute,SDMA,Video]."); > >>>>> + "GPU lockup timeout in ms (default: 10000 in sriov, 2000 > >>>>> not in sriov. 0: keep default value. negative: infinity timeout), > >>>>> format: [single value for all] or [GFX,Compute,SDMA,Video]."); > >>>>> module_param_string(lockup_timeout, amdgpu_lockup_timeout, > >>>>> sizeof(amdgpu_lockup_timeout), 0444); > >>>>> > >>>> > >>> > >> >
