Re: AW: [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation

2025-01-06 Thread Shuai Xue
occurs, the AMD GPU driver auto resets all GPUs and all jobs are terminated. My ultimate goal is provide error isolation between independent jobs which use a different GPU. Any suggestion? Thank you. Best Regards, Von: Shuai Xue Gesendet: Montag, 30

Re: [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation

2024-12-30 Thread Shuai Xue
在 2024/12/30 04:11, Christian König 写道: Am 28.12.24 um 07:32 schrieb Shuai Xue: It's observed that most GPU jobs utilize less than one server, typically with each GPU being used by an independent job. If a job consumed poisoned data, a SIGBUS signal will be sent to terminate it. Mean

[PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation

2024-12-29 Thread Shuai Xue
ments or conditions. Signed-off-by: Shuai Xue --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 - 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 38686203bea6..03dd902