On 2026-04-10 01:51, YuanShang Mao (River) wrote:

[AMD Official Use Only - AMD Internal Distribution Only]


Hi @Yang, Philip <mailto:[email protected]>
Here is the log:

[ 2 17 16:41:58 2026 <   24.787659>] [drm:amddrm_sched_entity_push_job [amd_sched]] *ERROR* Trying to push to a killed entity [ 2 17 16:42:43 2026 <    0.000000>] amdgpu 0000:00:08.0: clean up the vf2pf work item [ 2 17 16:42:51 2026 <    7.951077>] amdgpu 0000:00:08.0: ring sdma0 timeout, signaled seq=2567, emitted seq=2568 [ 2 17 16:42:51 2026 <    0.001734>] amdgpu 0000:00:08.0:  Process quark pid 8325 thread quark pid 8328 [ 2 17 16:42:51 2026 <    0.001475>] amdgpu 0000:00:08.0: GPU reset begin!. Source:  1 [ 2 17 16:42:51 2026 <    0.000026>] amdgpu 0000:00:08.0: Suspending all queues failed [ 2 17 16:42:51 2026 <    0.008520>] amdgpu 0000:00:08.0: [drm] PCIE GART of 512M enabled (table at 0x000000800D300000). [ 2 17 16:42:51 2026 <    0.204760>] amdgpu 0000:00:08.0: GPU reset(6) succeeded! [ 2 17 16:42:51 2026 <    0.000010>] amdgpu 0000:00:08.0: [drm] device wedged, but recovered through reset *[ 2 17 16:44:02 2026 <    0.000000>] INFO: task kworker/10:3:7194 blocked for more than 122 seconds.* [ 2 17 16:44:02 2026 <    0.001502>]       Tainted: G           OE      6.8.0-90-generic #91~22.04.1-Ubuntu [ 2 17 16:44:02 2026 <    0.001533>] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2 17 16:44:02 2026 <    0.001597>] task:kworker/10:3    state:D stack:0     pid:7194  tgid:7194  ppid:2      flags:0x00004000 *[ 2 17 16:44:02 2026 <    0.000006>] Workqueue: events_freezable svm_range_restore_work [amdgpu]
[ 2 17 16:44:02 2026 <    0.000282>] Call Trace:*
[ 2 17 16:44:02 2026 <    0.000002>]  <TASK>
[ 2 17 16:44:02 2026 <    0.000004>]  __schedule+0x27c/0x6a0
[ 2 17 16:44:02 2026 <    0.000008>]  schedule+0x33/0x110
[ 2 17 16:44:02 2026 <    0.000003>]  schedule_timeout+0x157/0x170
[ 2 17 16:44:02 2026 <    0.000005>]  dma_fence_default_wait+0x13d/0x210
[ 2 17 16:44:02 2026 <    0.000004>]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 2 17 16:44:02 2026 <    0.000003>]  dma_fence_wait_timeout+0x116/0x140
[ 2 17 16:44:02 2026 <    0.000003>]  svm_range_validate_and_map+0xf7c/0x19c0 [amdgpu] [ 2 17 16:44:02 2026 <    0.000218>]  svm_range_restore_work+0xe5/0x340 [amdgpu]
[ 2 17 16:44:02 2026 <    0.000197>]  process_one_work+0x181/0x3a0
[ 2 17 16:44:02 2026 <    0.000005>]  worker_thread+0x306/0x440
[ 2 17 16:44:02 2026 <    0.000003>]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 2 17 16:44:02 2026 <    0.000004>]  ? _raw_spin_lock_irqsave+0xe/0x20
[ 2 17 16:44:02 2026 <    0.000003>]  ? __pfx_worker_thread+0x10/0x10
[ 2 17 16:44:02 2026 <    0.000002>]  kthread+0xef/0x120
[ 2 17 16:44:02 2026 <    0.000005>]  ? __pfx_kthread+0x10/0x10
[ 2 17 16:44:02 2026 <    0.000003>]  ret_from_fork+0x44/0x70
[ 2 17 16:44:02 2026 <    0.000004>]  ? __pfx_kthread+0x10/0x10
[ 2 17 16:44:02 2026 <    0.000003>]  ret_from_fork_asm+0x1b/0x30
[ 2 17 16:44:02 2026 <    0.000005>]  </TASK>
[ 2 17 16:46:05 2026 <  122.874606>] INFO: task kworker/10:3:7194 blocked for more than 245 seconds.


drm sched entitycould be destroyed by *amdgpu_flush* if the process is killed forcibly even vm refcountis  not zero.

I see, patch "drm/amdkfd: Don't clear PT after process killed" fixed one path, this patch fix another different path.

Thanks, this patch is

Reviewed-by: Philip Yang <[email protected]>


Thanks
River

*From:*Yang, Philip <[email protected]>
*Sent:* Thursday, April 9, 2026 11:33 PM
*To:* Zhang, Tiantian (Celine) <[email protected]>; YuanShang Mao (River) <[email protected]>; Yang, Philip <[email protected]>; Koenig, Christian <[email protected]> *Cc:* [email protected]; Liu, JennyJing (Jenny Jing) <[email protected]> *Subject:* Re: [PATCH] drm/amdkfd: check if vm ready in svm map and unmap to gpu

On 2026-04-07 03:45, Zhang, Tiantian (Celine) wrote:

    [AMD Official Use Only - AMD Internal Distribution Only]

    Hi @Yang, Philip <mailto:[email protected]>,

    Could you please help to review this patch, thanks a lot~

    Best Regards,

    Celine Zhang

    -----Original Message-----
    From: YuanShang Mao (River) <[email protected]>
    <mailto:[email protected]>
    Sent: Wednesday, April 1, 2026 5:56 PM
    To: Yang, Philip <[email protected]> <mailto:[email protected]>
    Cc: Koenig, Christian <[email protected]>
    <mailto:[email protected]>; [email protected];
    Zhang, Tiantian (Celine) <[email protected]>
    <mailto:[email protected]>
    Subject: RE: [PATCH] drm/amdkfd: check if vm ready in svm map and
    unmap to gpu

    [AMD Official Use Only - AMD Internal Distribution Only]

    Hi @Yang, Philip

    Could help review this patch?

    Thanks

    River

    -----Original Message-----

    From: Koenig, Christian <[email protected]
    <mailto:[email protected]>>

    Sent: Tuesday, March 31, 2026 7:32 PM

    To: YuanShang Mao (River) <[email protected]
    <mailto:[email protected]>>; [email protected]
    <mailto:[email protected]>; Yang, Philip
    <[email protected] <mailto:[email protected]>>

    Subject: Re: [PATCH] drm/amdkfd: check if vm ready in svm map and
    unmap to gpu

    On 3/26/26 11:36, YuanShang wrote:

    > Don't map or unmap svm range to gpu if vm is not ready for updates.

    >

    > Why: DRM entity may already be killed when the svm worker try to

    > update gpu vm.

    >

    > Signed-off-by: YuanShang <[email protected]
    <mailto:[email protected]>>

    Looks correct to me, but I think somebody else already added those
    checks.

    @Philip is that correct? If not please help reviewing the patch.

    Thanks,

    Christian.

    > ---

    > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +++++++++++

    >  1 file changed, 11 insertions(+)

    >

    > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

    > b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

    > index 8167fe642341..7f905a7805fa 100644

    > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

    > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

    > @@ -1366,6 +1366,12 @@ svm_range_unmap_from_gpu(struct
    amdgpu_device

    > *adev, struct amdgpu_vm *vm,

    >

    >       pr_debug("CPU[0x%llx 0x%llx] -> GPU[0x%llx 0x%llx]\n",
    start, last,

    >               gpu_start, gpu_end);

    > +

    > +     if (!amdgpu_vm_ready(vm)) {

    > +             pr_debug("VM not ready, canceling unmap\n");

    > +             return -EINVAL;

    > +     }

    > +

The change looks fine, but it is unnecessary after checking the details of amdgpu_vm_ready.

It is impossible the "DRM entity may already be killed when the svm worker try to update gpu vm", guessing the svm worker is p->svms.restore_work, svm_range_list_fini cancel the work or wait for it to finish. kfd_process_wq_release does svm_range_list_fini first, then fput(pdd->drm_file) to reduce
the vm refcount, then calls amdgpu_vm_fini, to destroy drm sched entity.

If you see the real issue, please post the dmesg log to help understand.

Regards,
Philip



    >       return amdgpu_vm_update_range(adev, vm, false, true, true,
    false, NULL, gpu_start,

    > gpu_end, init_pte_value, 0, 0, NULL, NULL,

    > fence); @@ -1443,6 +1449,11 @@

    > svm_range_map_to_gpu(struct kfd_process_device *pdd, struct
    svm_range *prange,

    >       pr_debug("svms 0x%p [0x%lx 0x%lx] readonly %d\n",
    prange->svms,

    >                last_start, last_start + npages - 1, readonly);

    >

    > +     if (!amdgpu_vm_ready(vm)) {

    > +             pr_debug("VM not ready, canceling map\n");

    > +             return -EINVAL;

    > +     }

    > +

    >       for (i = offset; i < offset + npages; i++) {

    >               uint64_t gpu_start;

    >               uint64_t gpu_end;

Reply via email to