[AMD Official Use Only - AMD Internal Distribution Only] @Yang, Philip >I notice KFD has another different issue with fclose -> amdgpu_flush, >that fork evict parent process queues when child process close the >inherited drm node file handle, amdgpu_flush will signal parent process >KFD eviction fence added to vm root bo resv, this cause performance drop >if python application uses lots of popen.
Yes. Closing inherited drm node file handle will evict parent process queues, since drm share vm with kfd. >function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can >be removed too. Sure. If we decide to remove amdgpu_flush. @Koenig, Christian @Deucher, Alexander, do you have any concern on removal of amdgpu_flush? Thanks River -----Original Message----- From: Yang, Philip <philip.y...@amd.com> Sent: Friday, June 27, 2025 10:44 PM To: YuanShang Mao (River) <yuanshang....@amd.com>; amd-gfx@lists.freedesktop.org Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; cao, lin <lin....@amd.com>; Deng, Emily <emily.d...@amd.com>; Deucher, Alexander <alexander.deuc...@amd.com> Subject: Re: [PATCH] drm/amdgpu: delete function amdgpu_flush On 2025-06-27 01:20, YuanShang Mao (River) wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > Currently, amdgpu_flush is used to prevent new jobs from being submitted in > the same context when a file descriptor is closed and to wait for existing > jobs to complete. Additionally, if the current process is in an exit state > and the latest job of the entity was submitted by this process, the entity is > terminated. > > There is an issue where, if drm scheduler is not woken up for some reason, > the amdgpu_flush will remain hung, and another process holding this file > cannot submit a job to wake up the drm scheduler. I notice KFD has another different issue with fclose -> amdgpu_flush, that fork evict parent process queues when child process close the inherited drm node file handle, amdgpu_flush will signal parent process KFD eviction fence added to vm root bo resv, this cause performance drop if python application uses lots of popen. [677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu] [677852.634814] __dma_fence_enable_signaling+0x3e/0xe0 [677852.634820] dma_fence_wait_timeout+0x3a/0x140 [677852.634825] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl] [677852.634831] amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu] [677852.635026] amdgpu_flush+0x34/0x50 [amdgpu] [677852.635208] filp_flush+0x38/0x90 [677852.635213] filp_close+0x14/0x30 [677852.635216] do_close_on_exec+0xdd/0x130 [677852.635221] begin_new_exec+0x1da/0x490 [677852.635225] load_elf_binary+0x307/0xea0 [677852.635231] ? srso_alias_return_thunk+0x5/0xfbef5 [677852.635235] ? ima_bprm_check+0xa2/0xd0 [677852.635240] search_binary_handler+0xda/0x260 [677852.635245] exec_binprm+0x58/0x1a0 [677852.635249] bprm_execve.part.0+0x16f/0x210 [677852.635254] bprm_execve+0x45/0x80 [677852.635257] do_execveat_common.isra.0+0x190/0x200 > > The intended purpose of the flush operation in linux is to flush the content > written by the current process to the hardware, rather than shutting down > related services upon the process's exit, which would prevent other processes > from using them. Now, amdgpu_flush cannot execute concurrently with command > submission ioctl, which also leads to performance degradation. fclose -> filp_flush -> fput, if fput release the last reference of drm node file handle, call amdgpu_driver_postclose_kms -> amdgpu_ctx_mgr_fini will flush the entities, so amdgpu_flush is not needed. I thought to add workaround to skip amdgpu_flush if (vm->task_info->tgid != current->group_leader->pid) for KFD, this patch will fix both gfx and KFD, one stone for two birds. function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can be removed too. Regards, Philip > > An example of a shared DRM file is when systemd stop the display manager; > systemd will close the file descriptor of Xorg that it holds. > > Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: amdgpu_ctx_get: locked by other > task times 8811 > Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: owner stack: > Jun 11 16:24:24 ubuntu2404-2 kernel: task:(sd-rmrf) state:D stack:0 > pid:3407 tgid:3407 ppid:1 flags:0x00004002 > Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace: > Jun 11 16:24:24 ubuntu2404-2 kernel: <TASK> > Jun 11 16:24:24 ubuntu2404-2 kernel: __schedule+0x279/0x6b0 > Jun 11 16:24:24 ubuntu2404-2 kernel: schedule+0x29/0xd0 > Jun 11 16:24:24 ubuntu2404-2 kernel: amddrm_sched_entity_flush+0x13e/0x270 > [amd_sched] > Jun 11 16:24:24 ubuntu2404-2 kernel: ? > __pfx_autoremove_wake_function+0x10/0x10 > Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_ctx_mgr_entity_flush+0xd6/0x200 > [amdgpu] > Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_flush+0x29/0x50 [amdgpu] > Jun 11 16:24:24 ubuntu2404-2 kernel: filp_flush+0x38/0x90 > Jun 11 16:24:24 ubuntu2404-2 kernel: filp_close+0x14/0x30 > Jun 11 16:24:24 ubuntu2404-2 kernel: __close_range+0x1b0/0x230 > Jun 11 16:24:24 ubuntu2404-2 kernel: __x64_sys_close_range+0x17/0x30 > Jun 11 16:24:24 ubuntu2404-2 kernel: x64_sys_call+0x1e0f/0x25f0 > Jun 11 16:24:24 ubuntu2404-2 kernel: do_syscall_64+0x7e/0x170 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? __count_memcg_events+0x86/0x160 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? > count_memcg_events.constprop.0+0x2a/0x50 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? handle_mm_fault+0x1df/0x2d0 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_user_addr_fault+0x5d5/0x870 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? irqentry_exit_to_user_mode+0x43/0x250 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? irqentry_exit+0x43/0x50 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? exc_page_fault+0x96/0x1c0 > Jun 11 16:24:24 ubuntu2404-2 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e > Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x762b6df1677b > Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffdb20ad718 EFLAGS: > 00000246 ORIG_RAX: 00000000000001b4 > Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: > 0000000000000000 RCX: 0000762b6df1677b > Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 0000000000000000 RSI: > 000000007fffffff RDI: 0000000000000003 > Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffdb20ad730 R08: > 0000000000000000 R09: 0000000000000000 > Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000000000000008 R11: > 0000000000000246 R12: 0000000000000007 > Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000000 R14: > 0000000000000000 R15: 0000000000000000 > Jun 11 16:24:24 ubuntu2404-2 kernel: </TASK> > Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: current stack: > Jun 11 16:24:24 ubuntu2404-2 kernel: task:Xorg state:R running > task stack:0 pid:2343 tgid:2343 ppid:2341 flags:0x00000008 > Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace: > Jun 11 16:24:24 ubuntu2404-2 kernel: <TASK> > Jun 11 16:24:24 ubuntu2404-2 kernel: sched_show_task+0x122/0x180 > Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_ctx_get+0xf6/0x120 [amdgpu] > Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_cs_ioctl+0xb6/0x2110 [amdgpu] > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? update_cfs_group+0x111/0x120 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? enqueue_entity+0x3a6/0x550 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pfx_amdgpu_cs_ioctl+0x10/0x10 > [amdgpu] > Jun 11 16:24:24 ubuntu2404-2 kernel: drm_ioctl_kernel+0xbc/0x120 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: drm_ioctl+0x2f6/0x5b0 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pfx_amdgpu_cs_ioctl+0x10/0x10 > [amdgpu] > Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu] > Jun 11 16:24:24 ubuntu2404-2 kernel: __x64_sys_ioctl+0xa3/0xf0 > Jun 11 16:24:24 ubuntu2404-2 kernel: x64_sys_call+0x11ad/0x25f0 > Jun 11 16:24:24 ubuntu2404-2 kernel: do_syscall_64+0x7e/0x170 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? ksys_read+0xe6/0x100 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? idr_find+0xf/0x20 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_syncobj_array_free+0x5a/0x80 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_syncobj_reset_ioctl+0xbd/0xd0 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? > __pfx_drm_syncobj_reset_ioctl+0x10/0x10 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_ioctl_kernel+0xbc/0x120 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? __check_object_size.part.0+0x3a/0x150 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? _copy_to_user+0x41/0x60 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_ioctl+0x326/0x5b0 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? > __pfx_drm_syncobj_reset_ioctl+0x10/0x10 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? kvm_clock_get_cycles+0x18/0x40 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pm_runtime_suspend+0x7b/0xd0 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? amdgpu_drm_ioctl+0x70/0x90 [amdgpu] > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? __x64_sys_ioctl+0xbb/0xf0 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? syscall_exit_to_user_mode+0x4e/0x250 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? syscall_exit_to_user_mode+0x4e/0x250 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f > Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 > Jun 11 16:24:24 ubuntu2404-2 kernel: ? sysvec_apic_timer_interrupt+0x57/0xc0 > Jun 11 16:24:24 ubuntu2404-2 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e > Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x7156c3524ded > Jun 11 16:24:24 ubuntu2404-2 kernel: Code: 04 25 28 00 00 00 48 89 45 c8 31 > c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 > 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 > 00 00 > Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffe4afcc410 EFLAGS: > 00000246 ORIG_RAX: 0000000000000010 > Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: > 0000578954b74cf8 RCX: 00007156c3524ded > Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 00007ffe4afcc4f0 RSI: > 00000000c0186444 RDI: 0000000000000012 > Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffe4afcc460 R08: > 00007ffe4afcc7a0 R09: 00007ffe4afcc4b0 > Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000578954b862f0 R11: > 0000000000000246 R12: 00000000c0186444 > Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000012 R14: > 0000000000000060 R15: 0000578954b46380 > Jun 11 16:24:24 ubuntu2404-2 kernel: </TASK> > > Signed-off-by: YuanShang <yuanshang....@amd.com> > > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 ------------- > 1 file changed, 13 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index 2bb02fe9c880..ee6b59bfd798 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -2947,22 +2947,9 @@ static const struct dev_pm_ops amdgpu_pm_ops = { > .runtime_idle = amdgpu_pmops_runtime_idle, }; > > -static int amdgpu_flush(struct file *f, fl_owner_t id) -{ > - struct drm_file *file_priv = f->private_data; > - struct amdgpu_fpriv *fpriv = file_priv->driver_priv; > - long timeout = MAX_WAIT_SCHED_ENTITY_Q_EMPTY; > - > - timeout = amdgpu_ctx_mgr_entity_flush(&fpriv->ctx_mgr, timeout); > - timeout = amdgpu_vm_wait_idle(&fpriv->vm, timeout); > - > - return timeout >= 0 ? 0 : timeout; > -} > - > static const struct file_operations amdgpu_driver_kms_fops = { > .owner = THIS_MODULE, > .open = drm_open, > - .flush = amdgpu_flush, > .release = drm_release, > .unlocked_ioctl = amdgpu_drm_ioctl, > .mmap = drm_gem_mmap, > -- > 2.25.1 >