[AMD Official Use Only - AMD Internal Distribution Only]

Hi @Yang, Philip<mailto:[email protected]>
Here is the log:

[ 2 17 16:41:58 2026 <   24.787659>] [drm:amddrm_sched_entity_push_job 
[amd_sched]] *ERROR* Trying to push to a killed entity
[ 2 17 16:42:43 2026 <    0.000000>] amdgpu 0000:00:08.0: clean up the vf2pf 
work item
[ 2 17 16:42:51 2026 <    7.951077>] amdgpu 0000:00:08.0: ring sdma0 timeout, 
signaled seq=2567, emitted seq=2568
[ 2 17 16:42:51 2026 <    0.001734>] amdgpu 0000:00:08.0:  Process quark pid 
8325 thread quark pid 8328
[ 2 17 16:42:51 2026 <    0.001475>] amdgpu 0000:00:08.0: GPU reset begin!. 
Source:  1
[ 2 17 16:42:51 2026 <    0.000026>] amdgpu 0000:00:08.0: Suspending all queues 
failed
[ 2 17 16:42:51 2026 <    0.008520>] amdgpu 0000:00:08.0: [drm] PCIE GART of 
512M enabled (table at 0x000000800D300000).
[ 2 17 16:42:51 2026 <    0.204760>] amdgpu 0000:00:08.0: GPU reset(6) 
succeeded!
[ 2 17 16:42:51 2026 <    0.000010>] amdgpu 0000:00:08.0: [drm] device wedged, 
but recovered through reset
[ 2 17 16:44:02 2026 <    0.000000>] INFO: task kworker/10:3:7194 blocked for 
more than 122 seconds.
[ 2 17 16:44:02 2026 <    0.001502>]       Tainted: G           OE      
6.8.0-90-generic #91~22.04.1-Ubuntu
[ 2 17 16:44:02 2026 <    0.001533>] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2 17 16:44:02 2026 <    0.001597>] task:kworker/10:3    state:D stack:0     
pid:7194  tgid:7194  ppid:2      flags:0x00004000
[ 2 17 16:44:02 2026 <    0.000006>] Workqueue: events_freezable 
svm_range_restore_work [amdgpu]
[ 2 17 16:44:02 2026 <    0.000282>] Call Trace:
[ 2 17 16:44:02 2026 <    0.000002>]  <TASK>
[ 2 17 16:44:02 2026 <    0.000004>]  __schedule+0x27c/0x6a0
[ 2 17 16:44:02 2026 <    0.000008>]  schedule+0x33/0x110
[ 2 17 16:44:02 2026 <    0.000003>]  schedule_timeout+0x157/0x170
[ 2 17 16:44:02 2026 <    0.000005>]  dma_fence_default_wait+0x13d/0x210
[ 2 17 16:44:02 2026 <    0.000004>]  ? 
__pfx_dma_fence_default_wait_cb+0x10/0x10
[ 2 17 16:44:02 2026 <    0.000003>]  dma_fence_wait_timeout+0x116/0x140
[ 2 17 16:44:02 2026 <    0.000003>]  svm_range_validate_and_map+0xf7c/0x19c0 
[amdgpu]
[ 2 17 16:44:02 2026 <    0.000218>]  svm_range_restore_work+0xe5/0x340 [amdgpu]
[ 2 17 16:44:02 2026 <    0.000197>]  process_one_work+0x181/0x3a0
[ 2 17 16:44:02 2026 <    0.000005>]  worker_thread+0x306/0x440
[ 2 17 16:44:02 2026 <    0.000003>]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2 17 16:44:02 2026 <    0.000004>]  ? _raw_spin_lock_irqsave+0xe/0x20
[ 2 17 16:44:02 2026 <    0.000003>]  ? __pfx_worker_thread+0x10/0x10
[ 2 17 16:44:02 2026 <    0.000002>]  kthread+0xef/0x120
[ 2 17 16:44:02 2026 <    0.000005>]  ? __pfx_kthread+0x10/0x10
[ 2 17 16:44:02 2026 <    0.000003>]  ret_from_fork+0x44/0x70
[ 2 17 16:44:02 2026 <    0.000004>]  ? __pfx_kthread+0x10/0x10
[ 2 17 16:44:02 2026 <    0.000003>]  ret_from_fork_asm+0x1b/0x30
[ 2 17 16:44:02 2026 <    0.000005>]  </TASK>
[ 2 17 16:46:05 2026 <  122.874606>] INFO: task kworker/10:3:7194 blocked for 
more than 245 seconds.


drm sched entity could be destroyed by amdgpu_flush if the process is killed 
forcibly even vm refcount is  not zero.

Thanks
River
From: Yang, Philip <[email protected]>
Sent: Thursday, April 9, 2026 11:33 PM
To: Zhang, Tiantian (Celine) <[email protected]>; YuanShang Mao (River) 
<[email protected]>; Yang, Philip <[email protected]>; Koenig, Christian 
<[email protected]>
Cc: [email protected]; Liu, JennyJing (Jenny Jing) 
<[email protected]>
Subject: Re: [PATCH] drm/amdkfd: check if vm ready in svm map and unmap to gpu


On 2026-04-07 03:45, Zhang, Tiantian (Celine) wrote:

[AMD Official Use Only - AMD Internal Distribution Only]


Hi @Yang, Philip<mailto:[email protected]>,



Could you please help to review this patch, thanks a lot~



Best Regards,
Celine Zhang

-----Original Message-----
From: YuanShang Mao (River) 
<[email protected]><mailto:[email protected]>
Sent: Wednesday, April 1, 2026 5:56 PM
To: Yang, Philip <[email protected]><mailto:[email protected]>
Cc: Koenig, Christian 
<[email protected]><mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; Zhang, 
Tiantian (Celine) <[email protected]><mailto:[email protected]>
Subject: RE: [PATCH] drm/amdkfd: check if vm ready in svm map and unmap to gpu



[AMD Official Use Only - AMD Internal Distribution Only]



Hi @Yang, Philip

Could help review this patch?



Thanks

River



-----Original Message-----

From: Koenig, Christian 
<[email protected]<mailto:[email protected]>>

Sent: Tuesday, March 31, 2026 7:32 PM

To: YuanShang Mao (River) 
<[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]>; Yang, 
Philip <[email protected]<mailto:[email protected]>>

Subject: Re: [PATCH] drm/amdkfd: check if vm ready in svm map and unmap to gpu



On 3/26/26 11:36, YuanShang wrote:

> Don't map or unmap svm range to gpu if vm is not ready for updates.

>

> Why: DRM entity may already be killed when the svm worker try to

> update gpu vm.

>

> Signed-off-by: YuanShang <[email protected]<mailto:[email protected]>>



Looks correct to me, but I think somebody else already added those checks.



@Philip is that correct? If not please help reviewing the patch.



Thanks,

Christian.



> ---

>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +++++++++++

>  1 file changed, 11 insertions(+)

>

> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

> index 8167fe642341..7f905a7805fa 100644

> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

> @@ -1366,6 +1366,12 @@ svm_range_unmap_from_gpu(struct amdgpu_device

> *adev, struct amdgpu_vm *vm,

>

>       pr_debug("CPU[0x%llx 0x%llx] -> GPU[0x%llx 0x%llx]\n", start, last,

>               gpu_start, gpu_end);

> +

> +     if (!amdgpu_vm_ready(vm)) {

> +             pr_debug("VM not ready, canceling unmap\n");

> +             return -EINVAL;

> +     }

> +
The change looks fine, but it is unnecessary after checking the details of 
amdgpu_vm_ready.

It is impossible the "DRM entity may already be killed when the svm worker try 
to update gpu vm",
guessing the svm worker is p->svms.restore_work, svm_range_list_fini cancel the 
work or wait for
it to finish. kfd_process_wq_release does svm_range_list_fini first, then 
fput(pdd->drm_file) to reduce
the vm refcount, then calls amdgpu_vm_fini, to destroy drm sched entity.

If you see the real issue, please post the dmesg log to help understand.

Regards,
Philip




>       return amdgpu_vm_update_range(adev, vm, false, true, true, false, NULL, 
> gpu_start,

>                                     gpu_end, init_pte_value, 0, 0, NULL, NULL,

>                                     fence); @@ -1443,6 +1449,11 @@

> svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,

>       pr_debug("svms 0x%p [0x%lx 0x%lx] readonly %d\n", prange->svms,

>                last_start, last_start + npages - 1, readonly);

>

> +     if (!amdgpu_vm_ready(vm)) {

> +             pr_debug("VM not ready, canceling map\n");

> +             return -EINVAL;

> +     }

> +

>       for (i = offset; i < offset + npages; i++) {

>               uint64_t gpu_start;

>               uint64_t gpu_end;





Reply via email to