348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
Fix it by removing xcp drm devices when failed to probe GPU devices.
Signed-off-by: Jiang Liu
Tested-by: Shuo Liu
Reviewed-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
drive
301408] ret_from_fork+0x1f/0x30
[ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
[ 1209.301418] [ cut here ]
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
2 files changed, 8 insertions
Enhance error handling in function amdgpu_pci_probe() to avoid
possible resource leakage.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu
Function detects initialization status by checking sched->ops, so set
sched->ops to non-NULL just before return in function
amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
to avoid possible invalid memory access on error recover path.
Signed-off-by: Jiang Liu
---
drive
0
[ 1802.213878]
[ 1802.213879] ---[ end trace ]---
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu
amdxcp
driver to better support device remove and error handling.
4) reworked patch 0005 to fix it in amdgpu instead of drm core.
Jiang Liu (6):
drm/amdgpu: clear adev->in_suspend flag when fails to suspend
drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
drm/amdgpu: fix use af
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
to do error recovery.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 +
drivers/gpu/drm/amd/amdxcp
In function psp_init_cap_microcode(), it should bail out when failed to
load firmware, otherwise it may cause invalid memory access.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd
Enhance psp_ta_init_shared_buf() to check whether the shared buffer has
already been allocated, and return success if it's allocated. So caller
doesn't need to check the initialized flag.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 53 ++-
Fix some bugs in error handling path in psp subsystem:
1) fix possible bugs in error handling path in psp_sw_init()
2) fix a bug in error handling path in psp_init_cap_microcode()
3) reduce duplicated code related to psp_ta_init_shared_buf()
Jiang Liu (4):
drm/amdgpu: reset psp->cmd to N
Enhance error handling in function psp_sw_init() by:
1) bail out when failed to allocate memory
2) release allocated resource on error
3) introduce helper function psp_bo_init()
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 84 -
1 file changed
Reset psp->cmd to NULL after releasing the buffer in function psp_sw_fini().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
b/drivers/gpu/drm/amd/amd
Minor code style enhancement for smu.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c| 2 +-
drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
Fix several bugs in smu subsystem:
1) a buffer overflow bug in function smu_sys_set_pp_table()
2) tune logic of is_vcn_enabled()
3) enhance handling of gfx_off_entrycount in function smu_suspend()
Jiang Liu (4):
drm/amdgpu: avoid buffer overflow attach in smu_sys_set_pp_table()
drm/amdgpu
As pwfw resets entrycount when device is suspended, so we should
accmulate the gfx_off_entrycount value instead of save the last value
of it.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers
It malicious user provides a small pptable through sysfs and then
a bigger pptable, it may cause buffer overflow attack in function
smu_sys_set_pp_table().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git
Function is_vcn_enabled() returns false if either the VCN or JPEG ip
block is disabled, which sounds unreasonable. It should returns true
when either VCN and JPEG is enabled.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 6 +++---
1 file changed, 3 insertions(+), 3
Add helper functions to track status for ras manager and ip blocks.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 38 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 10
amdgpu_nbio_ras_early_fini() to undo work done by
amdgpu_nbio_ras_late_init().
2) remove call of amdgpu_irq_put in _hw_fini().
3) record the status where reference count is held for specific irq.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c | 16 +++-
drivers/gpu/drm/amd
The adev->ip_blocks array is not indexed by AMD_IP_BLOCK_TYPE_xxx,
instead we should call amdgpu_device_ip_get_ip_block() to get the
corresponding IP block oject.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 8 ++--
1 file changed, 6 insertions(+), 2 deleti
_fini.
There's one more task left to analyze GPU reset related state machine
transitions.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 22 +--
.../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
2 files changed, 23 insertions(+), 2 deletion
_init(amdgpu_irq_get), but
sdma_v4_4_2_xcp_suspend() invokes amdgpu_irq_put(), thus causes
unbalanced irq reference count. Fix it by calling amdgpu_irq_get()
in function sdma_v4_4_2_xcp_resume().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +-
drivers/gpu/drm/amd/a
().
3) call xgpu_nv_mailbox_put_irq() for nv.c to avoid possible resource
leakage.
4) use flags to track irq reference count usage.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/nv.c| 14 +++-
drivers/gpu/drm/amd/amdgpu/soc15.c | 22 +++
drivers/gpu/drm/amd
d-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 45 ++
2 files changed, 32 insertions(+), 19 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_
Enhance amdgpu_ras_block_late_fini() to revert what has been done
by amdgpu_ras_block_late_init(), and fix a possible resource leakage
in function amdgpu_ras_block_late_init().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 16 ++--
1 file changed, 10
Rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini(), to keep same
style with other code.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c| 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
drivers/gpu
etc, to follow the new design. Currently we have only taken the
nbio and asic as examples to show the proposed changes. Once we have
confirmed that's the right way to go, we will handle the lefting
subsystems.
This is in early stage and requesting for comments, any comments and
suggestions
Free all allocated resources on error recovery path in function
amdgpu_ras_init().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 19 ++-
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu
_init:sw = true
hw_init:hw = true
late_init: late_initialized = true
early_fini: late_initialized = false
hw_fini:hw = false
sw_fini:sw = false
late_fini: valid = false
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 -
Currently we track the refcount on ras block object for features by
checking `if (obj && amdgpu_ras_is_feature_enabled(adev, head))`,
which is a little unreliable. So introduce a dedicated flag to track
the reference count.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/
Add a flag to track ras debugfs creation status, to avoid possible
incorrect reference count management for ras block object in function
amdgpu_ras_aca_is_supported().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
to do error recovery.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 65 +
drivers/gpu/drm/amd/amdxcp
Enhance error handling in function amdgpu_pci_probe() to avoid
possible resource leakage.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu
+0x176/0x310
[16002.344324] do_syscall_64+0x5d/0x170
[16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[16002.354956] RIP: 0033:0x7f2736a620cb-12-26
Fix it by removing xcp drm devices when failed to probe GPU devices.
Signed-off-by: Jiang Liu
Tested-by: Shuo Liu
Reviewed-by: Lijo Lazar
0
[ 1802.213878]
[ 1802.213879] ---[ end trace ]---
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu
Function detects initialization status by checking sched->ops, so set
sched->ops to non-NULL just before return in function
amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
to avoid possible invalid memory access on error recover path.
Signed-off-by: Jiang Liu
---
drive
drm core.
Jiang Liu (5):
drm/amdgpu: clear adev->in_suspend flag when fails to suspend
drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
drm/amdgpu: fix use after free bug related to
amdgpu_driver_release_kms()
drm/amdgpu: enhance error handling in function amdgpu_pci_pr
+0x176/0x310
[16002.344324] do_syscall_64+0x5d/0x170
[16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[16002.354956] RIP: 0033:0x7f2736a620cb-12-26
Fix it by removing xcp drm devices when failed to probe GPU devices.
Signed-off-by: Jiang Liu
Tested-by: Shuo Liu
Reviewed-by: Lijo Lazar
Introduce amdgpu_device_fini_schedulers() to clean scheduler related
resources, and avoid possible invalid memory access.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 ---
2 files changed
is unnecessary.
3) add amdgpu_xcp_drm_dev_free() in patch 0003 to enhance amdxcp
driver to better support device remove and error handling.
4) reworked patch 0005 to fix it in amdgpu instead of drm core.
Jiang Liu (5):
drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
drm/amdgpu: fix
a/0x30
[ 90.092742]
[ 90.252277] ---[ end trace ]---
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index 272954
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
to do error recovery.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 63 +
drivers/gpu/drm/amd/amdxcp
Enhance error handling in function amdgpu_pci_probe() to avoid
possible resource leakage.
Signed-off-by: Jiang Liu
Reviewed-by: Mario Limonciello
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd
: amdgpu_device_ip_resume failed
(-110).
[ 555.126965] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -110
[ 555.126966] PM: Device :0a:00.0 failed to resume async: error -110
This fix has been tested on Mi308X.
Signed-off-by: Jiang Liu
Tested-by: Shuo Liu
---
drivers/gpu/drm/amd/amdgpu/soc15.c
amdgpu_nbio_ras_early_fini() to undo work done by
amdgpu_nbio_ras_late_init().
2) remove call of amdgpu_irq_put in _hw_fini().
3) record the status where reference count is held for specific irq.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c | 16 +++-
drivers/gpu/drm/amd
Walk IP blocks in reverse order in function amdgpu_device_ip_fini_early
and amdgpu_device_smu_fini_early, to keep consistence with other finish
functions.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff
Currently we track the refcount on ras block object for features by
checking `if (obj && amdgpu_ras_is_feature_enabled(adev, head))`,
which is a little unreliable. So introduce a dedicated flag to track
the reference count.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/
` callback.
4) call amdgpu_ras_fini() before invoking ip_blocks[i].late_fini.
There's one more task left to analyze GPU reset related state machine
transitions.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 --
1 file changed, 20 inser
Introduce following IP block iterators to reduce duplicated code:
- amdgpu_for_each_ip_block
- amdgpu_for_each_ip_block_reverse
- amdgpu_for_each_ip_block_valid
- amdgpu_for_each_ip_block_valid_reverse
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/aldebaran.c| 46
() AMDGPU_IP_STATE_SW
.sw_fini() AMDGPU_IP_STATE_EARLY
.late_fini() AMDGPU_IP_STATE_INVALID
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/aldebaran.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 48 ++-
drivers/gpu/drm/amd/amdgpu
true
sw_init:sw = true
hw_init:hw = true
late_init: late_initialized = true
early_fini: late_initialized = false
hw_fini:hw = false
sw_fini:sw = false
late_fini: valid = false
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.
Rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini(), to keep same
style with other code.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c| 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
drivers/gpu
().
3) call xgpu_nv_mailbox_put_irq() for nv.c to avoid possible resource
leakage.
4) use flags to track irq reference count usage.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/nv.c| 14 +++-
drivers/gpu/drm/amd/amdgpu/soc15.c | 22 +++
drivers/gpu/drm/amd
- refine the way to define status markers
- split amdgpu_dm related change into a dedicated patch
- add patch 13 to walk ip blocks in reverse order when shutdown
Jiang Liu (15):
drm/amdgpu: add helper functions to track status for ras manager
drm/amdgpu: add a flag to track ras debugfs creation
d-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 44 +-
2 files changed, 31 insertions(+), 19 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_
Enhance amdgpu_ras_block_late_fini() to revert what has been done
by amdgpu_ras_block_late_init(), and fix a possible resource leakage
in function amdgpu_ras_block_late_init().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 16 ++--
1 file changed, 10
Add helper functions to track status for ras manager and ip blocks.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 38 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 10 +++
3
Free all allocated resources on error recovery path in function
amdgpu_ras_init().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 19 ++-
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu
Enhance amdgpu_dm_early_fini() so it can be called in power
management operations.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
b/drivers/gpu/drm/amd
Add a flag to track ras debugfs creation status, to avoid possible
incorrect reference count management for ras block object in function
amdgpu_ras_aca_is_supported().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9
When resume on a different SR-IOV vGPU device, the VRAM base addresses
may have changed. So we need to update those cached addresses.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h| 6 --
drivers/gpu
so we can't test
our hypothesis. And we are not sure whether there are still other
blocking to enable resume with different AMD SR-IOV vGPUs.
Help is needed to identify more task items to enable resume with
different AMD SR-IOV vGPUs:)
Jiang Liu (2):
drm/amdgpu: update cached vram base addres
Introduce helper amdgpu_bo_get_pinned_gpu_addr(), which will be
used to update GPU address of pinned kernel BO during resume.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 9 +
drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 +
drivers/gpu/drm/amd/amdgpu
0
[ 1802.213878]
[ 1802.213879] ---[ end trace ]---
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
to do error recovery.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 11 +++-
drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h | 1
348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
Fix it by unplugging xcp drm devices when failed to probe GPU devices.
Signed-off-by: Jiang Liu
Tested-by: Shuo Liu
Reviewed-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
301408] ret_from_fork+0x1f/0x30
[ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
[ 1209.301418] [ cut here ]
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
2 files changed, 8 insertions
Function detects initialization status by checking sched->ops, so set
sched->ops to non-NULL just before return in function drm_sched_init()
to avoid possible invalid memory access on error recover path.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drive
Fix possible resource leakage on error recovery path in function
kgd2kfd_device_init().
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm
amd-staging-drm-next.
2) removed the first patch, which is unnecessary.
3) add amdgpu_xcp_drm_dev_free() in patch 0003 to enhance amdxcp
driver to better support device remove and error handling.
4) reworked patch 0005 to fix it in amdgpu instead of drm core.
Jiang Liu (6):
amdgpu: fix invalid
+0x176/0x310
[16002.344324] do_syscall_64+0x5d/0x170
[16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[16002.354956] RIP: 0033:0x7f2736a620cb-12-26
Fix it by removing xcp drm devices when failed to probe GPU devices.
Signed-off-by: Jiang Liu
Tested-by: Shuo Liu
Reviewed-by: Lijo Lazar
amdgpu_xcp_drm_dev_free() in patch 0003 to enhance amdxcp
driver to better support device remove and error handling.
4) reworked patch 0005 to fix it in amdgpu instead of drm core.
Jiang Liu (5):
drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
drm/amdgpu: fix use after free bug related to
Introduce amdgpu_device_fini_schedulers() to clean scheduler related
resources, and avoid possible invalid memory access.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 35 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 --
2 files changed
a/0x30
[ 90.092742]
[ 90.252277] ---[ end trace ]---
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index 272954
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
to do error recovery.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 63 +
drivers/gpu/drm/amd/amdxcp
Enhance error handling in function amdgpu_pci_probe() to avoid
possible resource leakage.
Signed-off-by: Jiang Liu
Reviewed-by: Mario Limonciello
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd
Function detects initialization status by checking sched->ops, so set
sched->ops to non-NULL just before return in function drm_sched_init()
to avoid possible invalid memory access on error recover path.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/scheduler/sched_main.c | 3 +++
1 file c
Add flags to track sysfs initialization status, so we can correctly
clean them up on error recover paths.
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 3 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 34 +-
2 files changed, 30 insertions(+), 7
301408] ret_from_fork+0x1f/0x30
[ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
[ 1209.301418] [ cut here ]
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
2 files changed, 8 insertions
348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
Fix it by unplugging xcp drm devices when failed to probe GPU devices.
Signed-off-by: Jiang Liu
Tested-by: Shuo Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
drivers/gpu/drm/amd/
0
[ 1802.213878]
[ 1802.213879] ---[ end trace ]---
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu
This patchset tries to fix several memory leakages/invalid memory
accesses on error handling path during GPU driver loading/unloading.
They applies to:
https://github.com/ROCm/ROCK-Kernel-Driver/tree/master/drivers
Jiang Liu (6):
amdgpu: add flags to track sysfs initialization status
amdgpu
01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0
00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01
48
Signed-off-by: Jiang Liu
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 7 ++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --g
83 matches
Mail list logo