[v6 4/5] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()

2025-01-23 Thread Jiang Liu
Enhance error handling in function amdgpu_pci_probe() to avoid possible resource leakage. Signed-off-by: Jiang Liu Reviewed-by: Mario Limonciello --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd

[v6 1/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-23 Thread Jiang Liu
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used to do error recovery. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 63 + drivers/gpu/drm/amd/amdxcp

[v6 3/5] drm/amdgpu: fix invalid memory access in amdgpu_xcp_cfg_sysfs_fini()

2025-01-23 Thread Jiang Liu
a/0x30 [ 90.092742] [ 90.252277] ---[ end trace ]--- Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c index 272954

[v6 0/5] Fix several bugs in error handling during device probe

2025-01-23 Thread Jiang Liu
is unnecessary. 3) add amdgpu_xcp_drm_dev_free() in patch 0003 to enhance amdxcp driver to better support device remove and error handling. 4) reworked patch 0005 to fix it in amdgpu instead of drm core. Jiang Liu (5): drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() drm/amdgpu: fix

[v6 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-23 Thread Jiang Liu
Introduce amdgpu_device_fini_schedulers() to clean scheduler related resources, and avoid possible invalid memory access. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 --- 2 files changed

[v6 2/5] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-23 Thread Jiang Liu
+0x176/0x310 [16002.344324] do_syscall_64+0x5d/0x170 [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e [16002.354956] RIP: 0033:0x7f2736a620cb-12-26 Fix it by removing xcp drm devices when failed to probe GPU devices. Signed-off-by: Jiang Liu Tested-by: Shuo Liu Reviewed-by: Lijo Lazar

[v5 4/5] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()

2025-01-16 Thread Jiang Liu
Enhance error handling in function amdgpu_pci_probe() to avoid possible resource leakage. Signed-off-by: Jiang Liu Reviewed-by: Mario Limonciello --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd

[v5 3/5] drm/amdgpu: fix invalid memory access in amdgpu_xcp_cfg_sysfs_fini()

2025-01-16 Thread Jiang Liu
a/0x30 [ 90.092742] [ 90.252277] ---[ end trace ]--- Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c index 272954

[v5 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-16 Thread Jiang Liu
Introduce amdgpu_device_fini_schedulers() to clean scheduler related resources, and avoid possible invalid memory access. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 35 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 -- 2 files changed

[v5 0/5] Fix several bugs in error handling during device probe

2025-01-16 Thread Jiang Liu
amdgpu_xcp_drm_dev_free() in patch 0003 to enhance amdxcp driver to better support device remove and error handling. 4) reworked patch 0005 to fix it in amdgpu instead of drm core. Jiang Liu (5): drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() drm/amdgpu: fix use after free bug related to

[v5 1/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-16 Thread Jiang Liu
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used to do error recovery. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 63 + drivers/gpu/drm/amd/amdxcp

[v5 2/5] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-16 Thread Jiang Liu
+0x176/0x310 [16002.344324] do_syscall_64+0x5d/0x170 [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e [16002.354956] RIP: 0033:0x7f2736a620cb-12-26 Fix it by removing xcp drm devices when failed to probe GPU devices. Signed-off-by: Jiang Liu Tested-by: Shuo Liu Reviewed-by: Lijo Lazar

[RFC v1 2/2] drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()

2025-01-14 Thread Jiang Liu
Introduce helper amdgpu_bo_get_pinned_gpu_addr(), which will be used to update GPU address of pinned kernel BO during resume. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 9 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + drivers/gpu/drm/amd/amdgpu

[RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs

2025-01-14 Thread Jiang Liu
so we can't test our hypothesis. And we are not sure whether there are still other blocking to enable resume with different AMD SR-IOV vGPUs. Help is needed to identify more task items to enable resume with different AMD SR-IOV vGPUs:) Jiang Liu (2): drm/amdgpu: update cached vram base addres

[RFC v1 1/2] drm/amdgpu: update cached vram base addresses on resume

2025-01-14 Thread Jiang Liu
When resume on a different SR-IOV vGPU device, the VRAM base addresses may have changed. So we need to update those cached addresses. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h| 6 -- drivers/gpu

[PATCH 1/1] amdgpu/soc15: enable asic reset for dGPU in case of suspend abort

2025-01-12 Thread Jiang Liu
: amdgpu_device_ip_resume failed (-110). [ 555.126965] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -110 [ 555.126966] PM: Device :0a:00.0 failed to resume async: error -110 This fix has been tested on Mi308X. Signed-off-by: Jiang Liu Tested-by: Shuo Liu --- drivers/gpu/drm/amd/amdgpu/soc15.c

[RFC v2 11/15] drm/amdgpu: convert ip block bool flags into an enum

2025-01-12 Thread Jiang Liu
() AMDGPU_IP_STATE_SW .sw_fini() AMDGPU_IP_STATE_EARLY .late_fini() AMDGPU_IP_STATE_INVALID Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/aldebaran.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 48 ++- drivers/gpu/drm/amd/amdgpu

[RFC v2 02/15] drm/amdgpu: add a flag to track ras debugfs creation status

2025-01-12 Thread Jiang Liu
Add a flag to track ras debugfs creation status, to avoid possible incorrect reference count management for ras block object in function amdgpu_ras_aca_is_supported(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9

[RFC v2 00/15] Enhance device state machine to better support suspend/resume

2025-01-12 Thread Jiang Liu
- refine the way to define status markers - split amdgpu_dm related change into a dedicated patch - add patch 13 to walk ip blocks in reverse order when shutdown Jiang Liu (15): drm/amdgpu: add helper functions to track status for ras manager drm/amdgpu: add a flag to track ras debugfs creation

[RFC v2 06/15] drm/amdgpu: enhance amdgpu_ras_pre_fini() to better support SR

2025-01-12 Thread Jiang Liu
d-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 44 +- 2 files changed, 31 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_

[RFC v2 10/15] drm/admgpu: make device state machine work in stack like way

2025-01-12 Thread Jiang Liu
` callback. 4) call amdgpu_ras_fini() before invoking ip_blocks[i].late_fini. There's one more task left to analyze GPU reset related state machine transitions. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 -- 1 file changed, 20 inser

[RFC v2 12/15] drm/amdgpu: introduce IP block iterators to reduce duplicated code

2025-01-12 Thread Jiang Liu
Introduce following IP block iterators to reduce duplicated code: - amdgpu_for_each_ip_block - amdgpu_for_each_ip_block_reverse - amdgpu_for_each_ip_block_valid - amdgpu_for_each_ip_block_valid_reverse Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/aldebaran.c| 46

[RFC v2 04/15] drm/amdgpu: introduce a flag to track refcount held for features

2025-01-12 Thread Jiang Liu
Currently we track the refcount on ras block object for features by checking `if (obj && amdgpu_ras_is_feature_enabled(adev, head))`, which is a little unreliable. So introduce a dedicated flag to track the reference count. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/

[RFC v2 14/15] drm/amdgpu/nbio: improve the way to manage irq reference count

2025-01-12 Thread Jiang Liu
amdgpu_nbio_ras_early_fini() to undo work done by amdgpu_nbio_ras_late_init(). 2) remove call of amdgpu_irq_put in _hw_fini(). 3) record the status where reference count is held for specific irq. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c | 16 +++- drivers/gpu/drm/amd

[RFC v2 13/15] drm/amdgpu: walk IP blocks in reverse order when shutdown

2025-01-12 Thread Jiang Liu
Walk IP blocks in reverse order in function amdgpu_device_ip_fini_early and amdgpu_device_smu_fini_early, to keep consistence with other finish functions. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff

[RFC v2 15/15] drm/amdgpu/asic: make ip block operations symmetric by .early_fini()

2025-01-12 Thread Jiang Liu
(). 3) call xgpu_nv_mailbox_put_irq() for nv.c to avoid possible resource leakage. 4) use flags to track irq reference count usage. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/nv.c| 14 +++- drivers/gpu/drm/amd/amdgpu/soc15.c | 22 +++ drivers/gpu/drm/amd

[RFC v2 07/15] drm/admgpu: rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini()

2025-01-12 Thread Jiang Liu
Rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini(), to keep same style with other code. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +- drivers/gpu

[RFC v2 08/15] drm/amdgpu: make IP block state machine works in stack like way

2025-01-12 Thread Jiang Liu
true sw_init:sw = true hw_init:hw = true late_init: late_initialized = true early_fini: late_initialized = false hw_fini:hw = false sw_fini:sw = false late_fini: valid = false Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.

[RFC v2 09/15] drm/amdgpu_dm: enhance amdgpu_dm_early_fini() for PM ops

2025-01-12 Thread Jiang Liu
Enhance amdgpu_dm_early_fini() so it can be called in power management operations. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd

[RFC v2 03/15] drm/amdgpu: free all resources on error recovery path of amdgpu_ras_init()

2025-01-12 Thread Jiang Liu
Free all allocated resources on error recovery path in function amdgpu_ras_init(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu

[RFC v2 05/15] drm/amdgpu: enhance amdgpu_ras_block_late_fini()

2025-01-12 Thread Jiang Liu
Enhance amdgpu_ras_block_late_fini() to revert what has been done by amdgpu_ras_block_late_init(), and fix a possible resource leakage in function amdgpu_ras_block_late_init(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 16 ++-- 1 file changed, 10

[RFC v2 01/15] drm/amdgpu: add helper functions to track status for ras manager

2025-01-12 Thread Jiang Liu
Add helper functions to track status for ras manager and ip blocks. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 38 + drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 10 +++ 3

[v4 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-09 Thread Jiang Liu
Function detects initialization status by checking sched->ops, so set sched->ops to non-NULL just before return in function amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers() to avoid possible invalid memory access on error recover path. Signed-off-by: Jiang Liu --- drive

[v4 1/5] drm/amdgpu: clear adev->in_suspend flag when fails to suspend

2025-01-09 Thread Jiang Liu
0 [ 1802.213878] [ 1802.213879] ---[ end trace ]--- Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu

[v4 3/5] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-09 Thread Jiang Liu
+0x176/0x310 [16002.344324] do_syscall_64+0x5d/0x170 [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e [16002.354956] RIP: 0033:0x7f2736a620cb-12-26 Fix it by removing xcp drm devices when failed to probe GPU devices. Signed-off-by: Jiang Liu Tested-by: Shuo Liu Reviewed-by: Lijo Lazar

[v4 2/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-09 Thread Jiang Liu
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used to do error recovery. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 65 + drivers/gpu/drm/amd/amdxcp

[v4 4/5] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()

2025-01-09 Thread Jiang Liu
Enhance error handling in function amdgpu_pci_probe() to avoid possible resource leakage. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu

[v4 0/6] Fix several bugs in error handling during device probe

2025-01-09 Thread Jiang Liu
drm core. Jiang Liu (5): drm/amdgpu: clear adev->in_suspend flag when fails to suspend drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms() drm/amdgpu: enhance error handling in function amdgpu_pci_pr

[RFC PATCH 12/13] drm/amdgpu/nbio: improve the way to manage irq reference count

2025-01-08 Thread Jiang Liu
amdgpu_nbio_ras_early_fini() to undo work done by amdgpu_nbio_ras_late_init(). 2) remove call of amdgpu_irq_put in _hw_fini(). 3) record the status where reference count is held for specific irq. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c | 16 +++- drivers/gpu/drm/amd

[RFC PATCH 02/13] drm/admgpu: add helper functions to track status for ras manager

2025-01-08 Thread Jiang Liu
Add helper functions to track status for ras manager and ip blocks. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 38 + drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 + drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 10

[RFC PATCH 07/13] drm/amdgpu: enhance amdgpu_ras_pre_fini() to better support SR

2025-01-08 Thread Jiang Liu
d-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 45 ++ 2 files changed, 32 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_

[no subject]

2025-01-08 Thread Jiang Liu
etc, to follow the new design. Currently we have only taken the nbio and asic as examples to show the proposed changes. Once we have confirmed that's the right way to go, we will handle the lefting subsystems. This is in early stage and requesting for comments, any comments and suggestions

[RFC PATCH 06/13] drm/amdgpu: enhance amdgpu_ras_block_late_fini()

2025-01-08 Thread Jiang Liu
Enhance amdgpu_ras_block_late_fini() to revert what has been done by amdgpu_ras_block_late_init(), and fix a possible resource leakage in function amdgpu_ras_block_late_init(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 16 ++-- 1 file changed, 10

[RFC PATCH 08/13] drm/admgpu: rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini()

2025-01-08 Thread Jiang Liu
Rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini(), to keep same style with other code. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +- drivers/gpu

[RFC PATCH 03/13] drm/amdgpu: add a flag to track ras debugfs creation status

2025-01-08 Thread Jiang Liu
Add a flag to track ras debugfs creation status, to avoid possible incorrect reference count management for ras block object in function amdgpu_ras_aca_is_supported(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9

[RFC PATCH 09/13] drm/amdgpu: make IP block state machine works in stack like way

2025-01-08 Thread Jiang Liu
_init:sw = true hw_init:hw = true late_init: late_initialized = true early_fini: late_initialized = false hw_fini:hw = false sw_fini:sw = false late_fini: valid = false Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 -

[RFC PATCH 05/13] drm/amdgpu: introduce a flag to track refcount held for features

2025-01-08 Thread Jiang Liu
Currently we track the refcount on ras block object for features by checking `if (obj && amdgpu_ras_is_feature_enabled(adev, head))`, which is a little unreliable. So introduce a dedicated flag to track the reference count. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/

[RFC PATCH 13/13] drm/amdgpu/asic: make ip block operations symmetric by .early_fini()

2025-01-08 Thread Jiang Liu
(). 3) call xgpu_nv_mailbox_put_irq() for nv.c to avoid possible resource leakage. 4) use flags to track irq reference count usage. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/nv.c| 14 +++- drivers/gpu/drm/amd/amdgpu/soc15.c | 22 +++ drivers/gpu/drm/amd

[RFC PATCH 11/13] drm/amdgpu/sdma: improve the way to manage irq reference count

2025-01-08 Thread Jiang Liu
_init(amdgpu_irq_get), but sdma_v4_4_2_xcp_suspend() invokes amdgpu_irq_put(), thus causes unbalanced irq reference count. Fix it by calling amdgpu_irq_get() in function sdma_v4_4_2_xcp_resume(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +- drivers/gpu/drm/amd/a

[RFC PATCH 01/13] amdgpu: wrong array index to get ip block for PSP

2025-01-08 Thread Jiang Liu
The adev->ip_blocks array is not indexed by AMD_IP_BLOCK_TYPE_xxx, instead we should call amdgpu_device_ip_get_ip_block() to get the corresponding IP block oject. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deleti

[RFC PATCH 10/13] drm/admgpu: make device state machine work in stack like way

2025-01-08 Thread Jiang Liu
_fini. There's one more task left to analyze GPU reset related state machine transitions. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 22 +-- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++ 2 files changed, 23 insertions(+), 2 deletion

[RFC PATCH 04/13] drm/amdgpu: free all resources on error recovery path of amdgpu_ras_init()

2025-01-08 Thread Jiang Liu
Free all allocated resources on error recovery path in function amdgpu_ras_init(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu

[v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()

2025-01-08 Thread Jiang Liu
301408] ret_from_fork+0x1f/0x30 [ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]--- [ 1209.301418] [ cut here ] Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + 2 files changed, 8 insertions

[v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()

2025-01-08 Thread Jiang Liu
Enhance error handling in function amdgpu_pci_probe() to avoid possible resource leakage. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu

[v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-08 Thread Jiang Liu
Function detects initialization status by checking sched->ops, so set sched->ops to non-NULL just before return in function amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers() to avoid possible invalid memory access on error recover path. Signed-off-by: Jiang Liu --- drive

[v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-08 Thread Jiang Liu
348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26 Fix it by removing xcp drm devices when failed to probe GPU devices. Signed-off-by: Jiang Liu Tested-by: Shuo Liu Reviewed-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drive

[v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-08 Thread Jiang Liu
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used to do error recovery. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 + drivers/gpu/drm/amd/amdxcp

[PATCH v3 0/6] Fix several bugs in error handling during device probe

2025-01-08 Thread Jiang Liu
amdxcp driver to better support device remove and error handling. 4) reworked patch 0005 to fix it in amdgpu instead of drm core. Jiang Liu (6): drm/amdgpu: clear adev->in_suspend flag when fails to suspend drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() drm/amdgpu: fix use af

[v3 1/6] drm/amdgpu: clear adev->in_suspend flag when fails to suspend

2025-01-08 Thread Jiang Liu
0 [ 1802.213878] [ 1802.213879] ---[ end trace ]--- Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu

[PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-05 Thread Jiang Liu
348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26 Fix it by unplugging xcp drm devices when failed to probe GPU devices. Signed-off-by: Jiang Liu Tested-by: Shuo Liu Reviewed-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-

[PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-05 Thread Jiang Liu
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used to do error recovery. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 11 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h | 1

[PATCH v2 6/6] amdgpu: get rid of false warnings caused by amdgpu_irq_put()

2025-01-05 Thread Jiang Liu
301408] ret_from_fork+0x1f/0x30 [ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]--- [ 1209.301418] [ cut here ] Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + 2 files changed, 8 insertions

[PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-05 Thread Jiang Liu
Function detects initialization status by checking sched->ops, so set sched->ops to non-NULL just before return in function drm_sched_init() to avoid possible invalid memory access on error recover path. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + drive

[PATCH v2 2/6] amdgpu: clear adev->in_suspend flag when fails to suspend

2025-01-05 Thread Jiang Liu
0 [ 1802.213878] [ 1802.213879] ---[ end trace ]--- Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu

[PATCH v2 0/6] Fix several bugs in error handling during device probe

2025-01-05 Thread Jiang Liu
amd-staging-drm-next. 2) removed the first patch, which is unnecessary. 3) add amdgpu_xcp_drm_dev_free() in patch 0003 to enhance amdxcp driver to better support device remove and error handling. 4) reworked patch 0005 to fix it in amdgpu instead of drm core. Jiang Liu (6): amdgpu: fix invalid

[PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()

2025-01-05 Thread Jiang Liu
Fix possible resource leakage on error recovery path in function kgd2kfd_device_init(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm

[PATCH 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-02 Thread Jiang Liu
Function detects initialization status by checking sched->ops, so set sched->ops to non-NULL just before return in function drm_sched_init() to avoid possible invalid memory access on error recover path. Signed-off-by: Jiang Liu --- drivers/gpu/drm/scheduler/sched_main.c | 3 +++ 1 file c

[PATCH 2/6] amdgpu: fix invalid memory access in kfd_cleanup_nodes()

2025-01-02 Thread Jiang Liu
01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48 Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --g

[PATCH 3/6] amdgpu: clear adev->in_suspend flag when fails to suspend

2025-01-02 Thread Jiang Liu
0 [ 1802.213878] [ 1802.213879] ---[ end trace ]--- Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu

[PATCH 0/6] Fix several bugs in error handling during device

2025-01-02 Thread Jiang Liu
This patchset tries to fix several memory leakages/invalid memory accesses on error handling path during GPU driver loading/unloading. They applies to: https://github.com/ROCm/ROCK-Kernel-Driver/tree/master/drivers Jiang Liu (6): amdgpu: add flags to track sysfs initialization status amdgpu

[PATCH 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-02 Thread Jiang Liu
348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26 Fix it by unplugging xcp drm devices when failed to probe GPU devices. Signed-off-by: Jiang Liu Tested-by: Shuo Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++- drivers/gpu/drm/amd/

[PATCH 6/6] amdgpu: get rid of false warnings caused by amdgpu_irq_put()

2025-01-02 Thread Jiang Liu
301408] ret_from_fork+0x1f/0x30 [ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]--- [ 1209.301418] [ cut here ] Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + 2 files changed, 8 insertions

[PATCH 1/6] amdgpu: add flags to track sysfs initialization status

2025-01-02 Thread Jiang Liu
Add flags to track sysfs initialization status, so we can correctly clean them up on error recover paths. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 3 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 34 +- 2 files changed, 30 insertions(+), 7