On 1/12/26 12:29 PM, Mario Limonciello wrote:
When a surprise unplug occurs while a process has active KFD queues,
userspace never gets a chance to call kfd_ioctl_destroy_queue() to
properly clean them up. This leads to a WARN_ON in uninitialize()
complaining about active_queue_count or processes_count being non-zero.
The issue is that during surprise unplug:
1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
2. It calls amdgpu_amdkfd_device_fini_sw()
3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
dqm->processes_count > 0)
The warning triggers because the queues were never destroyed - userspace
had no opportunity to clean them up before the device disappeared.
Fix this by checking for device unplug in kfd_cleanup_nodes() and
calling process_termination for each affected process before
uninitializing the DQM. This mirrors what happens during normal process
shutdown (kfd_process_notifier_release_internal), ensuring queues are
properly cleaned up even during surprise removal.
Cc: Felix Kuehling <[email protected]>
Cc: Kent Russell <[email protected]>
Cc: [email protected]
Signed-off-by: Mario Limonciello <[email protected]>
Ping?
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index e9cfb80bd436..7727b66e6afb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd,
unsigned int num_nodes)
flush_workqueue(kfd->ih_wq);
destroy_workqueue(kfd->ih_wq);
+ /*
+ * For surprise unplugs with running processes, we need to clean up
+ * queues before uninitializing the DQM to avoid WARN in uninitialize.
+ * This handles the case where userspace can't destroy queues normally.
+ */
+ if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
+ struct kfd_process *p;
+ unsigned int temp;
+ int idx;
+
+ idx = srcu_read_lock(&kfd_processes_srcu);
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ int j;
+
+ for (j = 0; j < p->n_pdds; j++) {
+ struct kfd_process_device *pdd = p->pdds[j];
+
+ if (pdd->dev->kfd != kfd)
+ continue;
+
+ dev_info(kfd_device,
+ "Terminating queues for process %d on
unplugged device\n",
+ p->lead_thread->pid);
+
+
pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
+
&pdd->qpd);
+ pdd->already_dequeued = true;
+ }
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+ }
+
for (i = 0; i < num_nodes; i++) {
knode = kfd->nodes[i];
device_queue_manager_uninit(knode->dqm);