occurs, the AMD GPU driver auto resets all GPUs and all jobs
are terminated. My ultimate goal is provide error isolation between independent
jobs which use a different GPU. Any suggestion?
Thank you.
Best Regards,
Von: Shuai Xue
Gesendet: Montag, 30
在 2024/12/30 04:11, Christian König 写道:
Am 28.12.24 um 07:32 schrieb Shuai Xue:
It's observed that most GPU jobs utilize less than one server, typically
with each GPU being used by an independent job. If a job consumed poisoned
data, a SIGBUS signal will be sent to terminate it. Mean
ments or conditions.
Signed-off-by: Shuai Xue
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 -
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 38686203bea6..03dd902