[PATCH 0/2] Improve the dev coredump

2024-08-19 Thread Trigger.Huang
From: Trigger Huang The current dev coredump implementation sometimes cannot fully satisfy customer's requirements due to: 1, dev coredump is called in GPU reset function, so if GPU reset is disabled, the dev coredump is also disabled 2, When job timeout happened, the dump GPU status will be ha

[PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo

2024-08-19 Thread Trigger.Huang
From: Trigger Huang Do the coredump immediately after a job timeout to get a closer representation of GPU's error status. V2: This will skip printing vram_lost as the GPU reset is not happened yet (Alex) V3: Unconditionally call the core dump as we care about all the reset functions(soft-recove

[PATCH 1/2] drm/amdgpu: skip printing vram_lost if needed

2024-08-19 Thread Trigger.Huang
From: Trigger Huang The vm lost status can only be obtained after a GPU reset occurs, but sometimes a dev core dump can be happened before GPU reset. So a new argument is added to tell the dev core dump implementation whether to skip printing the vram_lost status in the dump. And this patch is al

[PATCH v4 0/2] Improve the dev coredump for gfx job timeout scenario

2024-08-21 Thread Trigger.Huang
From: Trigger Huang The current dev coredump implementation sometimes cannot fully satisfy customer's requirements due to: 1, dev coredump is called in GPU reset function, so if GPU reset is disabled, the dev coredump is also disabled 2, When job timeout happened, the dump GPU status will be ha

[PATCH v4 1/2] drm/amdgpu: skip printing vram_lost if needed

2024-08-21 Thread Trigger.Huang
From: Trigger Huang The vm lost status can only be obtained after a GPU reset occurs, but sometimes a dev core dump can be happened before GPU reset. So a new argument is added to tell the dev core dump implementation whether to skip printing the vram_lost status in the dump. And this patch is al

[PATCH v4 2/2] drm/amdgpu: Do core dump immediately when job tmo

2024-08-21 Thread Trigger.Huang
From: Trigger Huang Do the coredump immediately after a job timeout to get a closer representation of GPU's error status. V2: This will skip printing vram_lost as the GPU reset is not happened yet (Alex) V3: Unconditionally call the core dump as we care about all the reset functions(soft-recove

[PATCH] drm/amdgpu: save the funcs of gfx software rings

2024-08-01 Thread Trigger.Huang
From: Trigger Huang Currently the funcs variable of a gfx software ring is not set. So if it is visited somewhere, it will lead to error logic being executed. For example, if we want to call some callbacks in funcs of a gfx software ring, like per ring reset, it will be failed due to funcs is NUL

[PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter

2024-08-15 Thread Trigger.Huang
From: Trigger Huang Add new separate parameter to control GPU coredump procedure. This can be used to decouple the coredump procedure from gpu recovery procedure Signed-off-by: Trigger Huang --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8

[PATCH 2/3] drm/amdgpu: introduce new API for GPU core dump

2024-08-15 Thread Trigger.Huang
From: Trigger Huang Put ip dump and register to dev_coredumpm together Signed-off-by: Trigger Huang --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 73 ++ 2 files changed, 75 insertions(+) diff --git a/drivers/gpu/drm/amd

[PATCH 3/3] drm/amdgpu: Change the timing of doing coredump

2024-08-15 Thread Trigger.Huang
From: Trigger Huang Do the coredump immediately after a job timeout to get a closer representation of GPU's error status. For other code paths that need to do the coredump, keep the original logic unchanged, except: 1,All the coredump operations will be under the control of parameter amdgpu_gpu_c

[PATCH 0/3] Improve the dev coredump

2024-08-15 Thread Trigger.Huang
From: Trigger Huang The current dev coredump implementation sometimes cannot fully satisfy customer's requirements due to: 1, The enablement of dev coredump is under the control of gpu_recovery. Customer can not do dev coredump with gpu_recovery disabled 2, When job timeout happened, the dump G

[PATCH 0/4] Improve the dev coredump

2024-08-16 Thread Trigger.Huang
From: Trigger Huang The current dev coredump implementation sometimes cannot fully satisfy customer's requirements due to: 1, dev coredump is under the control of gpu_recovery, thinking about the following application scenarios: 1), Customer may need to do the core dump with gpu_recover

[PATCH 1/4] drm/amdgpu: Add gpu_coredump parameter

2024-08-16 Thread Trigger.Huang
From: Trigger Huang Add new separate parameter to control GPU coredump procedure. This can be used to decouple the coredump procedure from gpu recovery procedure V2: enable gpu_coredump by default (Alex) Signed-off-by: Trigger Huang --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/g

[PATCH 2/4] drm/amdgpu: Use gpu_coredump to control core dump

2024-08-16 Thread Trigger.Huang
From: Trigger Huang Do the dev core dump if gpu_coredump is enabled Signed-off-by: Trigger Huang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgp

[PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo

2024-08-16 Thread Trigger.Huang
From: Trigger Huang Do the coredump immediately after a job timeout to get a closer representation of GPU's error status. V2: This will skip printing vram_lost as the GPU reset is not happened yet (Alex) Signed-off-by: Trigger Huang --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 64 +++

[PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed

2024-08-16 Thread Trigger.Huang
From: Trigger Huang The vm lost status can only be obtained after a GPU reset occurs, but sometimes a dev core dump can be happened before GPU reset. So a new argument is added to tell the dev core dump implementation whether to skip printing the vram_lost status in the dump. And this patch is al