[PATCH] drm/amdgpu: Add support for dpc to the product

2025-08-21 Thread Ce Sun
Add support for dpc to the product Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c

[PATCH] drm/amdgpu: Add support for dpc to a series of products

2025-08-20 Thread Ce Sun
Add support for dpc to a series of products Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13

[PATCH 4/4 v5] drm/amdgpu: Correct the loss of aca bank reg info

2025-08-18 Thread Ce Sun
creation interruption occurs at this time, bank reg info will be lost. (Thomas) v5: each cycle is delayed by 5ms. (Tao) Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 74 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 5 +- drivers/gpu/drm/amd/amdgpu

[PATCH 3/4 v5] drm/amdgpu: Add a mutex lock to protect poison injection

2025-08-18 Thread Ce Sun
When poison is triggered multiple times, competition will occur. Add a mutex lock to protect poison injection Signed-off-by: Ce Sun Reviewed-by: Yang Wang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 +++ 2 files changed, 6 insertions

[PATCH 2/4 v5] drm/amdgpu: Add functions to get bank count

2025-08-18 Thread Ce Sun
Add the amdgpu_aca_get_bank_count Signed-off-by: Ce Sun Signed-off-by: Xiang Liu Reviewed-by: Yang Wang --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 10 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 2 ++ 2 files changed, 12 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu

[PATCH 1/4 v5] drm/amdgpu: Correct the counts of nr_banks and nr_errors

2025-08-18 Thread Ce Sun
Correct the counts of nr_banks and nr_errors Signed-off-by: Ce Sun Reviewed-by: Yang Wang --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c index cbc40cad581b

[PATCH 4/4 v4] drm/amdgpu: Correct the loss of aca bank reg info

2025-08-17 Thread Ce Sun
By polling, poll ACA bank count to ensure that valid ACA bank reg info can be obtained v2: add corresponding delay before send msg to SMU to query mca bank info. (Stanley) v3: the loop cannot exit. (Thomas) v4: remove amdgpu_aca_clear_bank_count. (Kevin) Signed-off-by: Ce Sun --- drivers/gpu

[PATCH 3/4 v4] drm/amdgpu: Add a mutex lock to protect poison injection

2025-08-17 Thread Ce Sun
When poison is triggered multiple times, competition will occur. Add a mutex lock to protect poison injection Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 +++ 2 files changed, 6 insertions(+) diff --git a/drivers/gpu

[PATCH 2/4 v4] drm/amdgpu: Add functions to get/clear bank count

2025-08-17 Thread Ce Sun
Add the amdgpu_aca_get_bank_count/amdgpu_aca_clear_bank_count interface Signed-off-by: Ce Sun Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 10 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 2 ++ 2 files changed, 12 insertions(+) diff --git a/drivers/gpu/drm

[PATCH 1/4] drm/amdgpu: Correct the counts of nr_banks and nr_errors

2025-08-17 Thread Ce Sun
Correct the counts of nr_banks and nr_errors Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c index cbc40cad581b..090bf6cf1b91 100644 --- a

[PATCH 4/4 v3] drm/amdgpu: Correct the loss of aca bank reg info

2025-08-14 Thread Ce Sun
By polling, poll ACA bank count to ensure that valid ACA bank reg info can be obtained v2: add corresponding delay before send msg to SMU to query mca bank info. (Stanley) v3: the loop cannot exit. (Thomas) Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 65

[PATCH 3/4 v3] drm/amdgpu: Add a mutex lock to protect poison injection

2025-08-14 Thread Ce Sun
When poison is triggered multiple times, competition will occur. Add a mutex lock to protect poison injection Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 +++ 2 files changed, 6 insertions(+) diff --git a/drivers/gpu

[PATCH 2/4 v3] drm/amdgpu: Add functions to get/clear bank count

2025-08-14 Thread Ce Sun
Add the amdgpu_aca_get_bank_count/amdgpu_aca_clear_bank_count interface Signed-off-by: Ce Sun Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 14 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 3 +++ 2 files changed, 17 insertions(+) diff --git a/drivers/gpu

[PATCH 1/4] drm/amdgpu: Correct the counts of nr_banks and nr_errors

2025-08-14 Thread Ce Sun
Correct the counts of nr_banks and nr_errors Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c index cbc40cad581b..090bf6cf1b91 100644 --- a

[PATCH 3/3 v2] drm/amdgpu: Correct the loss of aca bank reg info

2025-08-13 Thread Ce Sun
By polling, poll ACA bank count to ensure that valid ACA bank reg info can be obtained v2: add corresponding delay before send msg to SMU to query mca bank info. (Stanley) Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 44

[PATCH 2/3 v2] drm/amdgpu: Add functions to get/clear bank count

2025-08-13 Thread Ce Sun
Add the amdgpu_aca_get_bank_count/amdgpu_aca_clear_bank_count interface Signed-off-by: Ce Sun Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 14 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 3 +++ 2 files changed, 17 insertions(+) diff --git a/drivers/gpu

[PATCH 1/3 v2] drm/amdgpu: Correct the counts of nr_banks and nr_errors

2025-08-13 Thread Ce Sun
Correct the counts of nr_banks and nr_errors Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c index cbc40cad581b..090bf6cf1b91 100644 --- a

[PATCH 3/3] drm/amdgpu: Correct the loss of aca bank reg info

2025-08-12 Thread Ce Sun
By polling, poll ACA bank count to ensure that valid ACA bank reg info can be obtained Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 46 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 -- drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 7 3 files

[PATCH 2/3] drm/amdgpu: Add functions to get/clear bank count

2025-08-12 Thread Ce Sun
Add the amdgpu_aca_get_bank_count/amdgpu_aca_clear_bank_count interface Signed-off-by: Ce Sun Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 14 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 3 +++ 2 files changed, 17 insertions(+) diff --git a/drivers/gpu

[PATCH 1/3] drm/amdgpu: Correct the counts of nr_banks and nr_errors

2025-08-12 Thread Ce Sun
Correct the counts of nr_banks and nr_errors Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c index d1e431818212..d14dee8d6632 100644 --- a

[PATCH] drm/amdgpu: Effective health check before reset

2025-07-29 Thread Ce Sun
held by the dpc thread, but dpc thread has not released the reset domain lock.In the dpc callback slot_reset,to obtain the hive lock, the hive lock is held by the gpu recover thread at this time.So a deadlock occurred Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26

[PATCH] drm/amdgpu: Effective health check before reset

2025-07-28 Thread Ce Sun
Move amdgpu_device_health_check into amdgpu_device_gpu_recover to ensure that if the device is present can be checked before reset Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 +++--- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a

[PATCH] drm/amdgpu: Avoid rma causes GPU duplicate reset

2025-07-28 Thread Ce Sun
Try to ensure poison creation handle is completed in time to set device rma value. Signed-off-by: Ce Sun Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 17 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 + 2 files changed, 11 insertions(+), 7

[PATCH] drm/amdgpu: The interrupt source was not released

2025-07-11 Thread Ce Sun
When the driver is unloaded, the interrupt source of the rma device is not released, resulting in the failure of hw_init when loading again using bad_page_threshold. Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff

[PATCH] drm/amdgpu: Fix code style issue

2025-06-29 Thread Ce Sun
cocci warnings: (new ones prefixed by >>) >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6088:8-9: Unneeded variable: "r". >> Return "0" on line 6141 Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202506281925.hhipxio7

[PATCH] drm/amdgpu: Fix code style issue

2025-06-29 Thread Ce Sun
cocci warnings: (new ones prefixed by >>) >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6088:8-9: Unneeded variable: "r". >> Return "0" on line 6141 Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202506281925.hhipxio7

[PATCH] drm/amdgpu: Release reset locks during failures

2025-06-06 Thread Ce Sun
From: Lijo Lazar Make sure to release reset domain lock in case of failures. Signed-off-by: Lijo Lazar Signed-off-by: Ce Sun Fixes: 0f936e23cf9d ("drm/amdgpu: refactor amdgpu_device_gpu_recover") --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 80 +++--- 1 file c

[PATCH v2] drm/amdgpu: Fix the gpu recover deadlock issue in abnormal situations

2025-06-05 Thread Ce Sun
rk+0x2f/0x40 [ 630.636413] ? __sched_group_set_shares+0x160/0x160 [ 630.647232] ret_from_fork_asm+0x11/0x20 Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/

[PATCH] drm/amdgpu: Fix the gpu recover deadlock issue in abnormal situations

2025-06-05 Thread Ce Sun
rk+0x2f/0x40 [ 630.636413] ? __sched_group_set_shares+0x160/0x160 [ 630.647232] ret_from_fork_asm+0x11/0x20 Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/

[PATCH] drm/amdgpu: Fix the gpu recover deadlock issue in abnormal situations

2025-06-05 Thread Ce Sun
rk+0x2f/0x40 [ 630.636413] ? __sched_group_set_shares+0x160/0x160 [ 630.647232] ret_from_fork_asm+0x11/0x20 Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 - 1 file changed, 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/am

[PATCH] drm/amdgpu: Modify the count method of defer error

2025-05-12 Thread Ce Sun
The number of newly added de counts and the number of newly added error addresses remain consistent Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 1 + drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 8 ++-- 3 files changed, 8

[PATCH] drm/amdgpu: Fix the kernel panic caused by RAS records exceed threshold

2025-05-11 Thread Ce Sun
0011 [ 5139.304690] R10: 000a R11: 0246 R12: 55ce8b8f9a70 [ 5139.304691] R13: 55ce8b8f2ec0 R14: 55ce8b8f2ab0 R15: 55ce8b8f9aa0 [ 5139.304692] [ 5139.304693] ---[ end trace 8536b052f7883003 ]--- Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 5 +++

[PATCH] drm/amdgpu: Fix the kernel panic caused by RAS records exceed threshold

2025-05-11 Thread Ce Sun
0011 [ 5139.304690] R10: 000a R11: 0246 R12: 55ce8b8f9a70 [ 5139.304691] R13: 55ce8b8f2ec0 R14: 55ce8b8f2ab0 R15: 55ce8b8f9aa0 [ 5139.304692] [ 5139.304693] ---[ end trace 8536b052f7883003 ]--- Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |

[PATCH] drm/amdgpu: Fix the kernel panic caused by RAS records exceed threshold

2025-05-08 Thread Ce Sun
6 R12: 55ce8b8f9a70 [ 5139.304691] R13: 55ce8b8f2ec0 R14: 55ce8b8f2ab0 R15: 55ce8b8f9aa0 [ 5139.304692] [ 5139.304693] ---[ end trace 8536b052f7883003 ]--- Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 ++ dri

[PATCH] drm/amdgpu: Modify the count method of defer error

2025-05-06 Thread Ce Sun
The number of newly added de counts and the number of newly added error addresses remain consistent Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 1 + drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 11 +-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a

[PATCH] drm/amdgpu: Modify the count method of defer error

2025-05-06 Thread Ce Sun
The number of newly added de counts and the number of newly added error addresses remain consistent --- drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 1 + drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 11 +-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdg

[PATCH] drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset

2025-04-09 Thread Ce Sun
Checking hive is more readable. The following smatch warning: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6820 amdgpu_pci_slot_reset() warn: iterator used outside loop: 'tmp_adev' Fixes: 8ba904f54148 ("drm/amdgpu: Multi-GPU DPC recovery support") Reported-by: Dan Carpenter S

[PATCH] drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset

2025-04-09 Thread Ce Sun
Checking hive is more readable. The following smatch warning: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6820 amdgpu_pci_slot_reset() warn: iterator used outside loop: 'tmp_adev' Fixes: 8ba904f54148 ("drm/amdgpu: Multi-GPU DPC recovery support") Reported-by: Dan Carpenter S

[PATCH v1 1/1] drm/amdgpu: fix a smatch static checker warning in amdgpu_pci_slot_reset

2025-04-09 Thread Ce Sun
Fixes smatch warning: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6820 amdgpu_pci_slot_reset() warn: iterator used outside loop: 'tmp_adev' Fixes: 8ba904f54148 ("drm/amdgpu: Multi-GPU DPC recovery support") Signed-off-by: Ce Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_dev

[PATCH v4 2/4] drm/amdgpu: refactor amdgpu_device_gpu_recover

2025-03-28 Thread Ce Sun
Split amdgpu_device_gpu_recover into the following stages: halt activities,asic reset,schedule resume and amdgpu resume. The reason is that the subsequent addition of dpc recover code will have a high similarity with gpu reset Signed-off-by: Ce Sun Reviewed-by: Hawking Zhang --- drivers/gpu

[PATCH v4 4/4] drm/amdgpu/vcn: during dpc recovery will corrupt VCPU buffer

2025-03-25 Thread Ce Sun
err_event_athub and dpc recovery will corrupt VCPU buffer, so we need to restore fw data and clear buffer in amdgpu_vcn_resume() Signed-off-by: Ce Sun Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a

[PATCH v4 3/4] drm/amdgpu: Multi-GPU DPC recovery support

2025-03-20 Thread Ce Sun
Add support for DPC recover based on refactored code Signed-off-by: Ce Sun Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 172 ++--- drivers/gpu/drm/amd/amdgpu/soc15.c | 5 + 3 files

[PATCH v4 1/4] drm/amd/pm: Add link reset for SMU 13.0.6

2025-03-20 Thread Ce Sun
Add link reset implementation Signed-off-by: Ce Sun Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 28 +++ drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 2 ++ drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 26 + drivers/gpu

[PATCH v4 0/4] upport for multi-GPU interconnection to trigger dpc recovery

2025-03-20 Thread Ce Sun
:38:00.0: amdgpu: RW: 0x0 Solved by patch-4 Ce Sun (4): drm/amd/pm: Add link reset for SMU 13.0.6 drm/amdgpu: refactor amdgpu_device_gpu_recover drm/amdgpu: Multi-GPU DPC recovery support drm/amdgpu/vcn: during dpc recovery will corrupt VCPU buffer drivers/gpu/drm/amd/amdgpu