On 2026-04-10 01:51, YuanShang Mao (River) wrote:
[AMD Official Use Only - AMD Internal Distribution Only]
Hi @Yang, Philip <mailto:[email protected]>
Here is the log:
[ 2 17 16:41:58 2026 < 24.787659>] [drm:amddrm_sched_entity_push_job
[amd_sched]] *ERROR* Trying to push to a killed entity
[ 2 17 16:42:43 2026 < 0.000000>] amdgpu 0000:00:08.0: clean up the
vf2pf work item
[ 2 17 16:42:51 2026 < 7.951077>] amdgpu 0000:00:08.0: ring sdma0
timeout, signaled seq=2567, emitted seq=2568
[ 2 17 16:42:51 2026 < 0.001734>] amdgpu 0000:00:08.0: Process
quark pid 8325 thread quark pid 8328
[ 2 17 16:42:51 2026 < 0.001475>] amdgpu 0000:00:08.0: GPU reset
begin!. Source: 1
[ 2 17 16:42:51 2026 < 0.000026>] amdgpu 0000:00:08.0: Suspending
all queues failed
[ 2 17 16:42:51 2026 < 0.008520>] amdgpu 0000:00:08.0: [drm] PCIE
GART of 512M enabled (table at 0x000000800D300000).
[ 2 17 16:42:51 2026 < 0.204760>] amdgpu 0000:00:08.0: GPU reset(6)
succeeded!
[ 2 17 16:42:51 2026 < 0.000010>] amdgpu 0000:00:08.0: [drm] device
wedged, but recovered through reset
*[ 2 17 16:44:02 2026 < 0.000000>] INFO: task kworker/10:3:7194
blocked for more than 122 seconds.*
[ 2 17 16:44:02 2026 < 0.001502>] Tainted: G OE
6.8.0-90-generic #91~22.04.1-Ubuntu
[ 2 17 16:44:02 2026 < 0.001533>] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2 17 16:44:02 2026 < 0.001597>] task:kworker/10:3 state:D
stack:0 pid:7194 tgid:7194 ppid:2 flags:0x00004000
*[ 2 17 16:44:02 2026 < 0.000006>] Workqueue: events_freezable
svm_range_restore_work [amdgpu]
[ 2 17 16:44:02 2026 < 0.000282>] Call Trace:*
[ 2 17 16:44:02 2026 < 0.000002>] <TASK>
[ 2 17 16:44:02 2026 < 0.000004>] __schedule+0x27c/0x6a0
[ 2 17 16:44:02 2026 < 0.000008>] schedule+0x33/0x110
[ 2 17 16:44:02 2026 < 0.000003>] schedule_timeout+0x157/0x170
[ 2 17 16:44:02 2026 < 0.000005>] dma_fence_default_wait+0x13d/0x210
[ 2 17 16:44:02 2026 < 0.000004>] ?
__pfx_dma_fence_default_wait_cb+0x10/0x10
[ 2 17 16:44:02 2026 < 0.000003>] dma_fence_wait_timeout+0x116/0x140
[ 2 17 16:44:02 2026 < 0.000003>]
svm_range_validate_and_map+0xf7c/0x19c0 [amdgpu]
[ 2 17 16:44:02 2026 < 0.000218>]
svm_range_restore_work+0xe5/0x340 [amdgpu]
[ 2 17 16:44:02 2026 < 0.000197>] process_one_work+0x181/0x3a0
[ 2 17 16:44:02 2026 < 0.000005>] worker_thread+0x306/0x440
[ 2 17 16:44:02 2026 < 0.000003>] ?
srso_alias_return_thunk+0x5/0xfbef5
[ 2 17 16:44:02 2026 < 0.000004>] ? _raw_spin_lock_irqsave+0xe/0x20
[ 2 17 16:44:02 2026 < 0.000003>] ? __pfx_worker_thread+0x10/0x10
[ 2 17 16:44:02 2026 < 0.000002>] kthread+0xef/0x120
[ 2 17 16:44:02 2026 < 0.000005>] ? __pfx_kthread+0x10/0x10
[ 2 17 16:44:02 2026 < 0.000003>] ret_from_fork+0x44/0x70
[ 2 17 16:44:02 2026 < 0.000004>] ? __pfx_kthread+0x10/0x10
[ 2 17 16:44:02 2026 < 0.000003>] ret_from_fork_asm+0x1b/0x30
[ 2 17 16:44:02 2026 < 0.000005>] </TASK>
[ 2 17 16:46:05 2026 < 122.874606>] INFO: task kworker/10:3:7194
blocked for more than 245 seconds.
drm sched entitycould be destroyed by *amdgpu_flush* if the process is
killed forcibly even vm refcountis not zero.
I see, patch "drm/amdkfd: Don't clear PT after process killed" fixed one
path, this patch fix another different path.
Thanks, this patch is
Reviewed-by: Philip Yang <[email protected]>
Thanks
River
*From:*Yang, Philip <[email protected]>
*Sent:* Thursday, April 9, 2026 11:33 PM
*To:* Zhang, Tiantian (Celine) <[email protected]>; YuanShang Mao
(River) <[email protected]>; Yang, Philip <[email protected]>;
Koenig, Christian <[email protected]>
*Cc:* [email protected]; Liu, JennyJing (Jenny Jing)
<[email protected]>
*Subject:* Re: [PATCH] drm/amdkfd: check if vm ready in svm map and
unmap to gpu
On 2026-04-07 03:45, Zhang, Tiantian (Celine) wrote:
[AMD Official Use Only - AMD Internal Distribution Only]
Hi @Yang, Philip <mailto:[email protected]>,
Could you please help to review this patch, thanks a lot~
Best Regards,
Celine Zhang
-----Original Message-----
From: YuanShang Mao (River) <[email protected]>
<mailto:[email protected]>
Sent: Wednesday, April 1, 2026 5:56 PM
To: Yang, Philip <[email protected]> <mailto:[email protected]>
Cc: Koenig, Christian <[email protected]>
<mailto:[email protected]>; [email protected];
Zhang, Tiantian (Celine) <[email protected]>
<mailto:[email protected]>
Subject: RE: [PATCH] drm/amdkfd: check if vm ready in svm map and
unmap to gpu
[AMD Official Use Only - AMD Internal Distribution Only]
Hi @Yang, Philip
Could help review this patch?
Thanks
River
-----Original Message-----
From: Koenig, Christian <[email protected]
<mailto:[email protected]>>
Sent: Tuesday, March 31, 2026 7:32 PM
To: YuanShang Mao (River) <[email protected]
<mailto:[email protected]>>; [email protected]
<mailto:[email protected]>; Yang, Philip
<[email protected] <mailto:[email protected]>>
Subject: Re: [PATCH] drm/amdkfd: check if vm ready in svm map and
unmap to gpu
On 3/26/26 11:36, YuanShang wrote:
> Don't map or unmap svm range to gpu if vm is not ready for updates.
>
> Why: DRM entity may already be killed when the svm worker try to
> update gpu vm.
>
> Signed-off-by: YuanShang <[email protected]
<mailto:[email protected]>>
Looks correct to me, but I think somebody else already added those
checks.
@Philip is that correct? If not please help reviewing the patch.
Thanks,
Christian.
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 8167fe642341..7f905a7805fa 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1366,6 +1366,12 @@ svm_range_unmap_from_gpu(struct
amdgpu_device
> *adev, struct amdgpu_vm *vm,
>
> pr_debug("CPU[0x%llx 0x%llx] -> GPU[0x%llx 0x%llx]\n",
start, last,
> gpu_start, gpu_end);
> +
> + if (!amdgpu_vm_ready(vm)) {
> + pr_debug("VM not ready, canceling unmap\n");
> + return -EINVAL;
> + }
> +
The change looks fine, but it is unnecessary after checking the
details of amdgpu_vm_ready.
It is impossible the "DRM entity may already be killed when the svm
worker try to update gpu vm",
guessing the svm worker is p->svms.restore_work, svm_range_list_fini
cancel the work or wait for
it to finish. kfd_process_wq_release does svm_range_list_fini first,
then fput(pdd->drm_file) to reduce
the vm refcount, then calls amdgpu_vm_fini, to destroy drm sched entity.
If you see the real issue, please post the dmesg log to help understand.
Regards,
Philip
> return amdgpu_vm_update_range(adev, vm, false, true, true,
false, NULL, gpu_start,
> gpu_end, init_pte_value, 0, 0, NULL, NULL,
> fence); @@ -1443,6 +1449,11 @@
> svm_range_map_to_gpu(struct kfd_process_device *pdd, struct
svm_range *prange,
> pr_debug("svms 0x%p [0x%lx 0x%lx] readonly %d\n",
prange->svms,
> last_start, last_start + npages - 1, readonly);
>
> + if (!amdgpu_vm_ready(vm)) {
> + pr_debug("VM not ready, canceling map\n");
> + return -EINVAL;
> + }
> +
> for (i = offset; i < offset + npages; i++) {
> uint64_t gpu_start;
> uint64_t gpu_end;