[AMD Official Use Only - AMD Internal Distribution Only] Hi @Yang, Philip<mailto:[email protected]> Here is the log:
[ 2 17 16:41:58 2026 < 24.787659>] [drm:amddrm_sched_entity_push_job [amd_sched]] *ERROR* Trying to push to a killed entity [ 2 17 16:42:43 2026 < 0.000000>] amdgpu 0000:00:08.0: clean up the vf2pf work item [ 2 17 16:42:51 2026 < 7.951077>] amdgpu 0000:00:08.0: ring sdma0 timeout, signaled seq=2567, emitted seq=2568 [ 2 17 16:42:51 2026 < 0.001734>] amdgpu 0000:00:08.0: Process quark pid 8325 thread quark pid 8328 [ 2 17 16:42:51 2026 < 0.001475>] amdgpu 0000:00:08.0: GPU reset begin!. Source: 1 [ 2 17 16:42:51 2026 < 0.000026>] amdgpu 0000:00:08.0: Suspending all queues failed [ 2 17 16:42:51 2026 < 0.008520>] amdgpu 0000:00:08.0: [drm] PCIE GART of 512M enabled (table at 0x000000800D300000). [ 2 17 16:42:51 2026 < 0.204760>] amdgpu 0000:00:08.0: GPU reset(6) succeeded! [ 2 17 16:42:51 2026 < 0.000010>] amdgpu 0000:00:08.0: [drm] device wedged, but recovered through reset [ 2 17 16:44:02 2026 < 0.000000>] INFO: task kworker/10:3:7194 blocked for more than 122 seconds. [ 2 17 16:44:02 2026 < 0.001502>] Tainted: G OE 6.8.0-90-generic #91~22.04.1-Ubuntu [ 2 17 16:44:02 2026 < 0.001533>] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2 17 16:44:02 2026 < 0.001597>] task:kworker/10:3 state:D stack:0 pid:7194 tgid:7194 ppid:2 flags:0x00004000 [ 2 17 16:44:02 2026 < 0.000006>] Workqueue: events_freezable svm_range_restore_work [amdgpu] [ 2 17 16:44:02 2026 < 0.000282>] Call Trace: [ 2 17 16:44:02 2026 < 0.000002>] <TASK> [ 2 17 16:44:02 2026 < 0.000004>] __schedule+0x27c/0x6a0 [ 2 17 16:44:02 2026 < 0.000008>] schedule+0x33/0x110 [ 2 17 16:44:02 2026 < 0.000003>] schedule_timeout+0x157/0x170 [ 2 17 16:44:02 2026 < 0.000005>] dma_fence_default_wait+0x13d/0x210 [ 2 17 16:44:02 2026 < 0.000004>] ? __pfx_dma_fence_default_wait_cb+0x10/0x10 [ 2 17 16:44:02 2026 < 0.000003>] dma_fence_wait_timeout+0x116/0x140 [ 2 17 16:44:02 2026 < 0.000003>] svm_range_validate_and_map+0xf7c/0x19c0 [amdgpu] [ 2 17 16:44:02 2026 < 0.000218>] svm_range_restore_work+0xe5/0x340 [amdgpu] [ 2 17 16:44:02 2026 < 0.000197>] process_one_work+0x181/0x3a0 [ 2 17 16:44:02 2026 < 0.000005>] worker_thread+0x306/0x440 [ 2 17 16:44:02 2026 < 0.000003>] ? srso_alias_return_thunk+0x5/0xfbef5 [ 2 17 16:44:02 2026 < 0.000004>] ? _raw_spin_lock_irqsave+0xe/0x20 [ 2 17 16:44:02 2026 < 0.000003>] ? __pfx_worker_thread+0x10/0x10 [ 2 17 16:44:02 2026 < 0.000002>] kthread+0xef/0x120 [ 2 17 16:44:02 2026 < 0.000005>] ? __pfx_kthread+0x10/0x10 [ 2 17 16:44:02 2026 < 0.000003>] ret_from_fork+0x44/0x70 [ 2 17 16:44:02 2026 < 0.000004>] ? __pfx_kthread+0x10/0x10 [ 2 17 16:44:02 2026 < 0.000003>] ret_from_fork_asm+0x1b/0x30 [ 2 17 16:44:02 2026 < 0.000005>] </TASK> [ 2 17 16:46:05 2026 < 122.874606>] INFO: task kworker/10:3:7194 blocked for more than 245 seconds. drm sched entity could be destroyed by amdgpu_flush if the process is killed forcibly even vm refcount is not zero. Thanks River From: Yang, Philip <[email protected]> Sent: Thursday, April 9, 2026 11:33 PM To: Zhang, Tiantian (Celine) <[email protected]>; YuanShang Mao (River) <[email protected]>; Yang, Philip <[email protected]>; Koenig, Christian <[email protected]> Cc: [email protected]; Liu, JennyJing (Jenny Jing) <[email protected]> Subject: Re: [PATCH] drm/amdkfd: check if vm ready in svm map and unmap to gpu On 2026-04-07 03:45, Zhang, Tiantian (Celine) wrote: [AMD Official Use Only - AMD Internal Distribution Only] Hi @Yang, Philip<mailto:[email protected]>, Could you please help to review this patch, thanks a lot~ Best Regards, Celine Zhang -----Original Message----- From: YuanShang Mao (River) <[email protected]><mailto:[email protected]> Sent: Wednesday, April 1, 2026 5:56 PM To: Yang, Philip <[email protected]><mailto:[email protected]> Cc: Koenig, Christian <[email protected]><mailto:[email protected]>; [email protected]<mailto:[email protected]>; Zhang, Tiantian (Celine) <[email protected]><mailto:[email protected]> Subject: RE: [PATCH] drm/amdkfd: check if vm ready in svm map and unmap to gpu [AMD Official Use Only - AMD Internal Distribution Only] Hi @Yang, Philip Could help review this patch? Thanks River -----Original Message----- From: Koenig, Christian <[email protected]<mailto:[email protected]>> Sent: Tuesday, March 31, 2026 7:32 PM To: YuanShang Mao (River) <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]>; Yang, Philip <[email protected]<mailto:[email protected]>> Subject: Re: [PATCH] drm/amdkfd: check if vm ready in svm map and unmap to gpu On 3/26/26 11:36, YuanShang wrote: > Don't map or unmap svm range to gpu if vm is not ready for updates. > > Why: DRM entity may already be killed when the svm worker try to > update gpu vm. > > Signed-off-by: YuanShang <[email protected]<mailto:[email protected]>> Looks correct to me, but I think somebody else already added those checks. @Philip is that correct? If not please help reviewing the patch. Thanks, Christian. > --- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > index 8167fe642341..7f905a7805fa 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > @@ -1366,6 +1366,12 @@ svm_range_unmap_from_gpu(struct amdgpu_device > *adev, struct amdgpu_vm *vm, > > pr_debug("CPU[0x%llx 0x%llx] -> GPU[0x%llx 0x%llx]\n", start, last, > gpu_start, gpu_end); > + > + if (!amdgpu_vm_ready(vm)) { > + pr_debug("VM not ready, canceling unmap\n"); > + return -EINVAL; > + } > + The change looks fine, but it is unnecessary after checking the details of amdgpu_vm_ready. It is impossible the "DRM entity may already be killed when the svm worker try to update gpu vm", guessing the svm worker is p->svms.restore_work, svm_range_list_fini cancel the work or wait for it to finish. kfd_process_wq_release does svm_range_list_fini first, then fput(pdd->drm_file) to reduce the vm refcount, then calls amdgpu_vm_fini, to destroy drm sched entity. If you see the real issue, please post the dmesg log to help understand. Regards, Philip > return amdgpu_vm_update_range(adev, vm, false, true, true, false, NULL, > gpu_start, > gpu_end, init_pte_value, 0, 0, NULL, NULL, > fence); @@ -1443,6 +1449,11 @@ > svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange, > pr_debug("svms 0x%p [0x%lx 0x%lx] readonly %d\n", prange->svms, > last_start, last_start + npages - 1, readonly); > > + if (!amdgpu_vm_ready(vm)) { > + pr_debug("VM not ready, canceling map\n"); > + return -EINVAL; > + } > + > for (i = offset; i < offset + npages; i++) { > uint64_t gpu_start; > uint64_t gpu_end;
