Hi,
Le 16/04/2025 à 12:01, Khatri, Sunil a écrit :
On 4/16/2025 12:56 PM, Tvrtko Ursulin wrote:
On 15/04/2025 19:43, Sunil Khatri wrote:
add process and pid information in the userqueue error
logging to make it more useful in resolving the error
by logs.
Sample log:
[ 42.444297] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for
fence f=000000001c74d978 for comm:Xwayland pid:3427
[ 42.444669] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout
waiting for comm:Xwayland pid:3427
[ 42.824729] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for
fence f=0000000074407d3e for comm:systemd-logind pid:1058
[ 42.825082] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout
waiting for comm:systemd-logind pid:1058
Signed-off-by: Sunil Khatri <sunil.kha...@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c b/drivers/gpu/drm/amd/amdgpu/
amdgpu_userqueue.c
index 1867520ba258..05c1ee27a319 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
@@ -43,7 +43,7 @@ amdgpu_userqueue_cleanup(struct amdgpu_userq_mgr *uq_mgr,
if (f && !dma_fence_is_signaled(f)) {
ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100));
if (ret <= 0) {
- DRM_ERROR("Timed out waiting for fence f=%p\n", f);
+ drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n",
f);
You decided to leave %p after all?
Yes we are printing the fence ptr here to see which fence is timing out. Anyways right now intention
of this patch is to add additional process information along with existing information like fence here.
I agree with Tvrtko, "fence=%llu:%llu" would be better to identify "which fence is
timing out".
Pierre-Eric
regards
Sunil
return;
}
}
@@ -440,7 +440,8 @@ amdgpu_userqueue_resume_all(struct amdgpu_userq_mgr *uq_mgr)
}
if (ret)
- DRM_ERROR("Failed to map all the queues\n");
+ drm_file_err(uq_mgr->file, "Failed to map all the queue\n");
You lost the plural by accident.
Yes i will add 's'. Noted.
I am also not sure "all the queues" makes sense in this context versus "all queues" but it's
inconsequential really.
Regards
Sunil
Yes it all queues from a uq_mgr.
+
return ret;
}
@@ -598,7 +599,8 @@ amdgpu_userqueue_suspend_all(struct amdgpu_userq_mgr
*uq_mgr)
}
if (ret)
- DRM_ERROR("Couldn't unmap all the queues\n");
+ drm_file_err(uq_mgr->file, "Couldn't unmap all the queues\n");
+
return ret;
}
@@ -615,7 +617,7 @@ amdgpu_userqueue_wait_for_signal(struct amdgpu_userq_mgr
*uq_mgr)
continue;
ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100));
if (ret <= 0) {
- DRM_ERROR("Timed out waiting for fence f=%p\n", f);
+ drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n",
f);
return -ETIMEDOUT;
}
}
@@ -634,13 +636,13 @@ amdgpu_userqueue_suspend(struct amdgpu_userq_mgr *uq_mgr,
/* Wait for any pending userqueue fence work to finish */
ret = amdgpu_userqueue_wait_for_signal(uq_mgr);
if (ret) {
- DRM_ERROR("Not suspending userqueue, timeout waiting for work\n");
+ drm_file_err(uq_mgr->file, "Not suspending userqueue, timeout
waiting\n");
return;
}
ret = amdgpu_userqueue_suspend_all(uq_mgr);
if (ret) {
- DRM_ERROR("Failed to evict userqueue\n");
+ drm_file_err(uq_mgr->file, "Failed to evict userqueue\n");
return;
It is pre-existing but strikes me as odd that failure to amdgpu_userqueue_suspend_all() logs a
failure to *evict* instead of suspend (as the previous log does). Anyway, I did not look at the
surrounding code so just thinking out loud.
Yes suspend failed as all the fences were not evicted and thats why suspend failed. Anyways there
are already alex patches which will change this to unmap as a code reorganisation for suspend/resume
is in pipeline.
regards
Sunil
Regards,
Tvrtko
}