> 2025年1月3日 07:08,Chen, Xiaogang <xiaogang.c...@amd.com> 写道: > > > > On 1/1/2025 11:36 PM, Jiang Liu wrote: >> On error recover path during device probe, it may trigger invalid >> memory access as below: >> 024-12-25 12:00:53 [ 2703.773040] general protection fault, probably for >> non-canonical address 0x52445f4749464e4f: 0000 [#1] SMP NOPTI >> 2024-12-25 12:00:53 [ 2703.785199] CPU: 157 PID: 151951 Comm: rmmod Kdump: >> loaded Tainted: G W OE 5.10.134-17.2.al8.x86_64 #1 >> 2024-12-25 12:00:53 [ 2703.797515] Hardware name: Alibaba Alibaba Cloud >> ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024 >> 2024-12-25 12:00:53 [ 2703.809188] RIP: 0010:kgd2kfd_device_exit+0x6/0x60 >> [amdgpu] >> 2024-12-25 12:00:53 [ 2703.816136] Code: ff 48 c7 83 38 01 00 00 80 45 e4 c0 >> c7 83 40 01 00 00 08 0f 00 00 e9 cd fa ff ff 66 0f 1f 84 00 00 00 00 00 0f >> 1f 44 00 00 55 <80> bf 28 01 00 00 00 48 89 fd 75 09 48 89 ef 5d e9 65 df 9d >> f4 8b >> 2024-12-25 12:00:54 [ 2703.838622] RSP: 0018:ffffb5313df07e10 EFLAGS: >> 00010202 >> 2024-12-25 12:00:54 [ 2703.845216] RAX: 0000000000000000 RBX: >> ffff97ad689a3ff0 RCX: 0000000080400014 >> 2024-12-25 12:00:54 [ 2703.853935] RDX: 0000000080400015 RSI: >> ffff97ad627e93d8 RDI: 52445f4749464e4f >> 2024-12-25 12:00:54 [ 2703.862652] RBP: ffff97ad689a3ff0 R08: >> 0000000000000000 R09: ffffffffb5814c00 >> 2024-12-25 12:00:54 [ 2703.871368] R10: ffff97ad627e9280 R11: >> 0000000000000001 R12: ffffb5313df07e98 >> 2024-12-25 12:00:54 [ 2703.880068] R13: ffff97ad689a1810 R14: >> 0000000000000001 R15: 0000000000000000 >> 2024-12-25 12:00:54 [ 2703.888768] FS: 00007fa4db81e740(0000) >> GS:ffff98a93ec80000(0000) knlGS:0000000000000000 >> 2024-12-25 12:00:54 [ 2703.898547] CS: 0010 DS: 0000 ES: 0000 CR0: >> 0000000080050033 >> 2024-12-25 12:00:54 [ 2703.905684] CR2: 00007f4502deca00 CR3: >> 000001008fc50001 CR4: 0000000002770ee0 >> 2024-12-25 12:00:54 [ 2703.914382] DR0: 0000000000000000 DR1: >> 0000000000000000 DR2: 0000000000000000 >> 2024-12-25 12:00:54 [ 2703.923066] DR3: 0000000000000000 DR6: >> 00000000fffe07f0 DR7: 0000000000000400 >> 2024-12-25 12:00:54 [ 2703.931746] PKRU: 55555554 >> 2024-12-25 12:00:54 [ 2703.935444] Call Trace: >> 2024-12-25 12:00:54 [ 2703.938962] amdgpu_amdkfd_device_fini_sw+0x1a/0x40 >> [amdgpu] >> 2024-12-25 12:00:54 [ 2703.946080] amdgpu_device_ip_fini.isra.0+0x3d/0x1b0 >> [amdgpu] >> 2024-12-25 12:00:54 [ 2703.953278] amdgpu_device_fini_sw+0x2a/0x240 [amdgpu] >> 2024-12-25 12:00:54 [ 2703.959789] amdgpu_driver_release_kms+0x12/0x30 >> [amdgpu] >> 2024-12-25 12:00:54 [ 2703.966501] devm_drm_dev_init_release+0x42/0x70 [drm] >> 2024-12-25 12:00:54 [ 2703.972891] release_nodes+0x6e/0xb0 >> 2024-12-25 12:00:54 [ 2703.977522] amdgpu_xcp_drv_release+0x38/0x80 [amdxcp] >> 2024-12-25 12:00:54 [ 2703.983906] >> __do_sys_delete_module.constprop.0+0x138/0x2a0 >> 2024-12-25 12:00:54 [ 2703.990775] ? exit_to_user_mode_loop+0xd6/0x120 >> 2024-12-25 12:00:54 [ 2703.996563] do_syscall_64+0x2e/0x50 >> 2024-12-25 12:00:54 [ 2704.001166] entry_SYSCALL_64_after_hwframe+0x62/0xc7 >> 2024-12-25 12:00:54 [ 2704.007432] RIP: 0033:0x7fa4db2620cb >> 2024-12-25 12:00:54 [ 2704.012025] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 >> 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 >> 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 >> 01 48 >> >> Signed-off-by: Jiang Liu <ge...@linux.alibaba.com> >> <mailto:ge...@linux.alibaba.com> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> index b6c5ffd4630b..58c1b5fcf785 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> @@ -663,6 +663,8 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd, >> unsigned int num_nodes) >> >> for (i = 0; i < num_nodes; i++) { >> knode = kfd->nodes[i]; >> + if (!knode) >> + continue; > I wonder how knode can be null here? kfd_cleanup_nodes is called by > kgd2kfd_device_exit under condition if (kfd->init_complete). > kfd->init_complete is true only after all kfd node got initialized at > kgd2kfd_device_init. If kfd driver init fail kfd->init_complete would be > false, then kfd_cleanup_node would not get called. > Hi Xiaogang, It may get triggered on error handling path of ‘kid2kfd_device_init()`, when it jump to label `node_alloc_error` and then call `kfd_cleanup_nodes()`.
Thanks, Gerry > > Regards > > Xiaogang > >> device_queue_manager_uninit(knode->dqm); >> kfd_interrupt_exit(knode); >> kfd_topology_remove_device(knode); >> @@ -954,7 +956,10 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd) >> kfd_doorbell_fini(kfd); >> ida_destroy(&kfd->doorbell_ida); >> kfd_gtt_sa_fini(kfd); >> - amdgpu_amdkfd_free_gtt_mem(kfd->adev, &kfd->gtt_mem); >> + if (kfd->gtt_mem) { >> + amdgpu_amdkfd_free_gtt_mem(kfd->adev, &kfd->gtt_mem); >> + kfd->gtt_mem = NULL; >> + } >> } >> >> kfree(kfd);