Re: [PATCH v3 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset

2025-06-23 Thread Philipp Stanner
On Wed, 2025-06-18 at 11:47 -0300, Maíra Canal wrote: > Xe can skip the reset if TDR has fired before the free job worker and > can > also re-arm the timeout timer in some scenarios. Instead of > manipulating > scheduler's internals, inform the scheduler that the job did not > actually > timeout an

Re: [PATCH] drm/sched/tests: Make timedout_job callback a better role model

2025-06-23 Thread Philipp Stanner
On Thu, 2025-06-05 at 15:41 +0200, Philipp Stanner wrote: > Since the drm_mock_scheduler does not have real users in userspace, > nor > does it have real hardware or firmware rings, it's not necessary to > signal timedout fences nor free jobs - from a functional standpoint. &g

Re: [PATCH] drm/sched/tests: Make timedout_job callback a better role model

2025-06-16 Thread Philipp Stanner
On Mon, 2025-06-16 at 09:49 -0300, Maíra Canal wrote: > Hi Danilo, > > On 16/06/25 08:14, Danilo Krummrich wrote: > > On Mon, Jun 16, 2025 at 11:57:47AM +0100, Tvrtko Ursulin wrote: > > > Code looks fine, but currently nothing is broken and I disagree > > > with the > > > goal that the _mock_^1 co

Re: [RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback

2025-06-16 Thread Philipp Stanner
On Mon, 2025-06-16 at 10:27 +0100, Tvrtko Ursulin wrote: > > On 12/06/2025 15:20, Philipp Stanner wrote: > > On Thu, 2025-06-12 at 15:17 +0100, Tvrtko Ursulin wrote: > > > > > > On 03/06/2025 10:31, Philipp Stanner wrote: > > > > Since its inception

Re: [PATCH v1] drm/amdgpu: give each kernel job a unique id

2025-06-13 Thread Philipp Stanner
On Fri, 2025-06-13 at 10:23 +0200, Christian König wrote: > On 6/13/25 01:48, Danilo Krummrich wrote: > > On Thu, Jun 12, 2025 at 09:00:34AM +0200, Christian König wrote: > > > On 6/11/25 17:11, Danilo Krummrich wrote: > > > > > > > Mhm, reiterating our internal discussion on the mailing > > > > >

[PATCH v2] drm/sched: Clarify scenarios for separate workqueues

2025-06-12 Thread Philipp Stanner
about pitfalls. Co-authored-by: Danilo Krummrich Signed-off-by: Philipp Stanner --- Changes in v2: - Add new docu section for concurrency in the scheduler. (Sima) - Document what an ordered workqueue passed to the scheduler can be useful for. (Christian, Sima) - Warn more detailed about pote

Re: [RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback

2025-06-12 Thread Philipp Stanner
On Thu, 2025-06-12 at 15:17 +0100, Tvrtko Ursulin wrote: > > On 03/06/2025 10:31, Philipp Stanner wrote: > > Since its inception, the GPU scheduler can leak memory if the > > driver > > calls drm_sched_fini() while there are still jobs in flight. > > > >

[PATCH] drm/sched/tests: Make timedout_job callback a better role model

2025-06-05 Thread Philipp Stanner
r new scheduler users. Therefore, they should approximate the canonical usage as much as possible. Make sure timed out hardware fences get signaled with the appropriate error code. Signed-off-by: Philipp Stanner --- .../gpu/drm/scheduler/tests/mock_scheduler.c | 26 ++- 1

Re: [PATCH] drm/sched: Discourage usage of separate workqueues

2025-06-05 Thread Philipp Stanner
On Wed, 2025-06-04 at 17:07 +0200, Simona Vetter wrote: > On Wed, Jun 04, 2025 at 11:41:25AM +0200, Christian König wrote: > > On 6/4/25 10:16, Philipp Stanner wrote: > > > struct drm_sched_init_args provides the possibility of letting > > > the > > > sche

[PATCH] drm/sched: Discourage usage of separate workqueues

2025-06-04 Thread Philipp Stanner
n the documentation. Suggested-by: Danilo Krummrich Signed-off-by: Philipp Stanner --- include/drm/gpu_scheduler.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 81dcbfc8c223..11740d745223 100644 --- a/includ

Re: [RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job

2025-06-03 Thread Philipp Stanner
On Tue, 2025-06-03 at 13:27 +0100, Tvrtko Ursulin wrote: > > On 03/06/2025 10:31, Philipp Stanner wrote: > > An alternative version to [1], based on Tvrtko's suggestion from > > [2]. > > > > I tested this for Nouveau. Works. > > > > I'm having

[RFC PATCH 6/6] drm/nouveau: Remove waitque for sched teardown

2025-06-03 Thread Philipp Stanner
nouveau_sched_fence_context_kill() the waitque is not necessary anymore. Remove the waitque. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++- drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +++-- drivers/gpu/drm/nouveau/nouveau_uvmm.c | 8 3 files

[RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback

2025-06-03 Thread Philipp Stanner
the hardware fence associated with the job. Afterwards, the scheduler can savely use the established free_job() callback for freeing the job. Implement the new backend_ops callback cancel_job(). Suggested-by: Tvrtko Ursulin Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler

[RFC PATCH 3/6] drm/sched: Warn if pending list is not empty

2025-06-03 Thread Philipp Stanner
drm_sched_fini() can leak jobs under certain circumstances. Warn if that happens. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler/sched_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c

[RFC PATCH 2/6] drm/sched/tests: Implement cancel_job()

2025-06-03 Thread Philipp Stanner
hardware fence. That should be repaired and cleaned up, but it's probably better to do that in a separate series. Signed-off-by: Philipp Stanner --- .../gpu/drm/scheduler/tests/mock_scheduler.c | 71 +++ drivers/gpu/drm/scheduler/tests/sched_tests.h | 4 +- 2 files change

[RFC PATCH 5/6] drm/nouveau: Add new callback for scheduler teardown

2025-06-03 Thread Philipp Stanner
There is a new callback for always tearing the scheduler down in a leak-free, deadlock-free manner. Port Nouveau as its first user by providing the scheduler with a callback that ensures the fence context gets killed in drm_sched_fini(). Signed-off-by: Philipp Stanner --- drivers/gpu/drm

[RFC PATCH 4/6] drm/nouveau: Make fence container helper usable driver-wide

2025-06-03 Thread Philipp Stanner
: Philipp Stanner --- drivers/gpu/drm/nouveau/nouveau_fence.c | 20 +++- drivers/gpu/drm/nouveau/nouveau_fence.h | 6 ++ 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c index

[RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job

2025-06-03 Thread Philipp Stanner
ps://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursu...@igalia.com/ Philipp Stanner (6): drm/sched: Avoid memory leaks with cancel_job() callback drm/sched/tests: Implement cancel_job() drm/sched: Warn if pending list is not empty drm/nouveau: Make fence container helper usable driver-wide

Re: [PATCH] drm/etnaviv: Protect the scheduler's pending list with its lock

2025-06-02 Thread Philipp Stanner
es: 704d3d60fec4 ("drm/etnaviv: don't block scheduler when GPU is > still active") Could also contain a "Closes: " with the link to the appropriate message from thread [1] from below. You might also include "Reported-by: Philipp" since I technically first describ

Re: [PATCH v2 6/8] drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset

2025-06-02 Thread Philipp Stanner
On Mon, 2025-06-02 at 08:36 -0300, Maíra Canal wrote: > Hi Philipp, > > On 02/06/25 04:28, Philipp Stanner wrote: > > On Fri, 2025-05-30 at 11:01 -0300, Maíra Canal wrote: > > [...] > > > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c > > >

Re: [PATCH v2] drm/sched/tests: Use one lock for fence context

2025-06-02 Thread Philipp Stanner
On Tue, 2025-05-27 at 12:10 +0200, Philipp Stanner wrote: > There is no need for separate locks for single jobs and the entire > scheduler. The dma_fence context can be protected by the scheduler > lock, > allowing for removing the jobs' locks. This simplifies things and > re

Re: [PATCH v2 3/8] drm/sched: Reduce scheduler's timeout for timeout tests

2025-06-02 Thread Philipp Stanner
I'd call that patch sth like "Make timeout unit tests faster". Makes more obvious what it's about. P. On Fri, 2025-05-30 at 11:01 -0300, Maíra Canal wrote: > As more KUnit tests are introduced to evaluate the basic capabilities > of > the `timedout_job()` hook, the test suite will continue to inc

Re: [PATCH v2 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset

2025-06-02 Thread Philipp Stanner
On Fri, 2025-05-30 at 11:01 -0300, Maíra Canal wrote: > Xe can skip the reset if TDR has fired before the free job worker and > can > also re-arm the timeout timer in some scenarios. Instead of using the > scheduler internals to add the job to the pending list, use the > DRM_GPU_SCHED_STAT_NO_HANG

Re: [PATCH v2 6/8] drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset

2025-06-02 Thread Philipp Stanner
On Fri, 2025-05-30 at 11:01 -0300, Maíra Canal wrote: > Etnaviv can skip a hardware reset in two situations: > >   1. TDR has fired before the free-job worker and the timeout is > spurious. >   2. The GPU is still making progress on the front-end and we can > give > the job a chance to comple

Re: [PATCH v2 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset

2025-06-02 Thread Philipp Stanner
On Fri, 2025-05-30 at 11:01 -0300, Maíra Canal wrote: > When a CL/CSD job times out, we check if the GPU has made any > progress > since the last timeout. If so, instead of resetting the hardware, we > skip > the reset and allow the timer to be rearmed. This gives long-running > jobs > a chance to

Re: [PATCH v2 2/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-06-02 Thread Philipp Stanner
Hi, thx for the update. Seems to be developing nicely. Some comments below. On Fri, 2025-05-30 at 11:01 -0300, Maíra Canal wrote: > When the DRM scheduler times out, it's possible that the GPU isn't > hung; > instead, a job may still be running, and there may be no valid reason > to > reset the h

Re: [PATCH v11 00/10] Improve gpu_scheduler trace events + UAPI

2025-05-28 Thread Philipp Stanner
On Mon, 2025-05-26 at 14:54 +0200, Pierre-Eric Pelloux-Prayer wrote: > Hi, > > The initial goal of this series was to improve the drm and amdgpu > trace events to be able to expose more of the inner workings of > the scheduler and drivers to developers via tools. > > Then, the series evolved to b

[PATCH v2] drm/sched/tests: Use one lock for fence context

2025-05-27 Thread Philipp Stanner
scheduler lock. Signed-off-by: Philipp Stanner --- Changes in v2: - Make commit message more neutral by stating it's about simplifying the code. (Tvrtko) --- drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 5 ++--- drivers/gpu/drm/scheduler/tests/sched_tests.h| 1 - 2 files change

Re: [PATCH 1/4] drm/sched: optimize drm_sched_job_add_dependency

2025-05-26 Thread Philipp Stanner
On Mon, 2025-05-26 at 13:16 +0200, Christian König wrote: > On 5/26/25 11:34, Philipp Stanner wrote: > > On Mon, 2025-05-26 at 11:25 +0200, Christian König wrote: > > > On 5/23/25 16:16, Danilo Krummrich wrote: > > > > On Fri, May 23, 2025 at 04:11:39PM +0200,

Re: [PATCH 1/4] drm/sched: optimize drm_sched_job_add_dependency

2025-05-26 Thread Philipp Stanner
On Fri, 2025-05-23 at 14:56 +0200, Christian König wrote: > It turned out that we can actually massively optimize here. > > The previous code was horrible inefficient since it constantly > released > and re-acquired the lock of the xarray and started each iteration > from the > base of the array t

Re: [PATCH 1/4] drm/sched: optimize drm_sched_job_add_dependency

2025-05-26 Thread Philipp Stanner
On Mon, 2025-05-26 at 11:25 +0200, Christian König wrote: > On 5/23/25 16:16, Danilo Krummrich wrote: > > On Fri, May 23, 2025 at 04:11:39PM +0200, Danilo Krummrich wrote: > > > On Fri, May 23, 2025 at 02:56:40PM +0200, Christian König wrote: > > > > It turned out that we can actually massively opt

Re: [PATCH 1/4] drm/sched: optimize drm_sched_job_add_dependency a bit

2025-05-26 Thread Philipp Stanner
+Cc Matthew, again :) On Thu, 2025-05-22 at 18:19 +0200, Christian König wrote: > On 5/22/25 16:27, Tvrtko Ursulin wrote: > > > > On 22/05/2025 14:41, Christian König wrote: > > > Since we already iterated over the xarray we know at which index > > > the new > > > entry should be stored. So inste

Re: [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue

2025-05-22 Thread Philipp Stanner
On Thu, 2025-05-22 at 14:37 +0100, Tvrtko Ursulin wrote: > > On 22/05/2025 09:27, Philipp Stanner wrote: > > From: Philipp Stanner > > > > The GPU scheduler currently does not ensure that its pending_list > > is > > empty before performing various other

Re: [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method

2025-05-22 Thread Philipp Stanner
On Thu, 2025-05-22 at 15:06 +0100, Tvrtko Ursulin wrote: > > On 22/05/2025 09:27, Philipp Stanner wrote: > > The drm_gpu_scheduler now supports a callback to help > > drm_sched_fini() > > avoid memory leaks. This callback instructs the driver to signal > > a

Re: [PATCH] drm/sched/tests: Use one lock for fence context

2025-05-22 Thread Philipp Stanner
On Wed, 2025-05-21 at 11:24 +0100, Tvrtko Ursulin wrote: > > On 21/05/2025 11:04, Philipp Stanner wrote: > > When the unit tests were implemented, each scheduler job got its > > own, > > distinct lock. This is not how dma_fence context locking rules are > > t

Re: [PATCH 2/2] drm/nouveau: Don't signal when killing the fence context

2025-05-22 Thread Philipp Stanner
On Thu, 2025-05-22 at 15:24 +0200, Christian König wrote: > On 5/22/25 15:16, Philipp Stanner wrote: > > On Thu, 2025-05-22 at 15:09 +0200, Christian König wrote: > > > On 5/22/25 14:59, Danilo Krummrich wrote: > > > > On Thu, May 22, 2025 at 02:34:33PM +0200,

Re: [PATCH 2/2] drm/nouveau: Don't signal when killing the fence context

2025-05-22 Thread Philipp Stanner
On Thu, 2025-05-22 at 15:09 +0200, Christian König wrote: > On 5/22/25 14:59, Danilo Krummrich wrote: > > On Thu, May 22, 2025 at 02:34:33PM +0200, Christian König wrote: > > > See all the functions inside include/linux/dma-fence.h can be > > > used by everybody. It's basically the public interface

Re: [PATCH 2/2] drm/nouveau: Don't signal when killing the fence context

2025-05-22 Thread Philipp Stanner
On Thu, 2025-05-22 at 14:34 +0200, Christian König wrote: > On 5/22/25 14:20, Philipp Stanner wrote: > > On Thu, 2025-05-22 at 14:06 +0200, Christian König wrote: > > > On 5/22/25 13:25, Philipp Stanner wrote: > > > > dma_fence_is_signa

Re: [PATCH 2/2] drm/nouveau: Don't signal when killing the fence context

2025-05-22 Thread Philipp Stanner
On Thu, 2025-05-22 at 14:06 +0200, Christian König wrote: > On 5/22/25 13:25, Philipp Stanner wrote: > > dma_fence_is_signaled_locked(), which is used in > > nouveau_fence_context_kill(), can signal fences below the surface > > through a callback. > > > > The

[PATCH 1/2] dma-buf: Add __dma_fence_is_signaled()

2025-05-22 Thread Philipp Stanner
ed. Use it internally. Suggested-by: Tvrtko Ursulin Signed-off-by: Philipp Stanner --- include/linux/dma-fence.h | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 48b5202c531d..ac951a54a007 10

[PATCH 2/2] drm/nouveau: Don't signal when killing the fence context

2025-05-22 Thread Philipp Stanner
which only checks, never signals. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c index d5654e26d5bc..993b3dcb5db0

[PATCH v3 5/5] drm/nouveau: Remove waitque for sched teardown

2025-05-22 Thread Philipp Stanner
nouveau_sched_fence_context_kill() the waitque is not necessary anymore. Remove the waitque. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++- drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +++-- drivers/gpu/drm/nouveau/nouveau_uvmm.c | 8 3 files

[PATCH v3 4/5] drm/nouveau: Add new callback for scheduler teardown

2025-05-22 Thread Philipp Stanner
There is a new callback for always tearing the scheduler down in a leak-free, deadlock-free manner. Port Nouveau as its first user by providing the scheduler with a callback that ensures the fence context gets killed in drm_sched_fini(). Signed-off-by: Philipp Stanner --- drivers/gpu/drm

[PATCH v3 0/5] Fix memory leaks in drm_sched_fini()

2025-05-22 Thread Philipp Stanner
ovide users with a more reliable, clean scheduler API. Philipp Philipp Stanner (5): drm/sched: Fix teardown leaks with waitqueue drm/sched/tests: Port tests to new cleanup method drm/sched: Warn if pending list is not empty drm/nouveau: Add new callback for scheduler teardown drm/nouveau: Remove

[PATCH v3 3/5] drm/sched: Warn if pending list is not empty

2025-05-22 Thread Philipp Stanner
drm_sched_fini() can leak jobs under certain circumstances. Warn if that happens. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler/sched_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c

[PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method

2025-05-22 Thread Philipp Stanner
a new error field for the fence error. Keep the job status as DRM_MOCK_SCHED_JOB_DONE for now, since there is no party for which checking for a CANCELED status would be useful currently. Signed-off-by: Philipp Stanner --- .../gpu/drm/scheduler/tests/mock_scheduler.c | 67

[PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue

2025-05-22 Thread Philipp Stanner
From: Philipp Stanner The GPU scheduler currently does not ensure that its pending_list is empty before performing various other teardown tasks in drm_sched_fini(). If there are still jobs in the pending_list, this is problematic because after scheduler teardown, no one will call

[PATCH] drm/sched/tests: Use one lock for fence context

2025-05-21 Thread Philipp Stanner
dma_fence rules, e.g., ensuring that only one fence gets signaled at a time. Use the fence context (scheduler) lock for the jobs. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 5 ++--- drivers/gpu/drm/scheduler/tests/sched_tests.h| 1 - 2 files changed

Re: [PATCH 1/3] drm/sched: add drm_sched_prealloc_dependency_slots v3

2025-05-21 Thread Philipp Stanner
On Tue, 2025-05-20 at 17:15 +0100, Tvrtko Ursulin wrote: > > On 19/05/2025 10:04, Philipp Stanner wrote: > > On Mon, 2025-05-19 at 09:51 +0100, Tvrtko Ursulin wrote: > > > > > > On 16/05/2025 18:16, Philipp Stanner wrote: > > > > On Fri, 2025-

Re: [PATCH v9 02/10] drm/sched: store the drm client_id in drm_sched_fence

2025-05-19 Thread Philipp Stanner
On Mon, 2025-05-19 at 13:02 +0200, Pierre-Eric Pelloux-Prayer wrote: > > > Le 15/05/2025 à 08:53, Pierre-Eric Pelloux-Prayer a écrit : > > Hi, > > > > Le 14/05/2025 à 14:44, Philipp Stanner a écrit : > > > On Thu, 2025-04-24 at 10:38 +0200, Pierre-Eric Pell

Re: [PATCH 1/3] drm/sched: add drm_sched_prealloc_dependency_slots v3

2025-05-19 Thread Philipp Stanner
On Mon, 2025-05-19 at 09:51 +0100, Tvrtko Ursulin wrote: > > On 16/05/2025 18:16, Philipp Stanner wrote: > > On Fri, 2025-05-16 at 15:30 +0100, Tvrtko Ursulin wrote: > > > > > > On 16/05/2025 14:38, Philipp Stanner wrote: > > > > On Fri, 2025-

Re: [PATCH 1/3] drm/sched: add drm_sched_prealloc_dependency_slots v3

2025-05-16 Thread Philipp Stanner
On Fri, 2025-05-16 at 15:30 +0100, Tvrtko Ursulin wrote: > > On 16/05/2025 14:38, Philipp Stanner wrote: > > On Fri, 2025-05-16 at 13:10 +0100, Tvrtko Ursulin wrote: > > > > > > On 16/05/2025 12:53, Tvrtko Ursulin wrote: > > > > > > > > On

Re: [PATCH 1/3] drm/sched: add drm_sched_prealloc_dependency_slots v3

2025-05-16 Thread Philipp Stanner
On Fri, 2025-05-16 at 13:10 +0100, Tvrtko Ursulin wrote: > > On 16/05/2025 12:53, Tvrtko Ursulin wrote: > > > > On 16/05/2025 08:28, Philipp Stanner wrote: > > > On Thu, 2025-05-15 at 17:17 +0100, Tvrtko Ursulin wrote: > > > > > > &

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

2025-05-16 Thread Philipp Stanner
On Fri, 2025-05-16 at 10:33 +0100, Tvrtko Ursulin wrote: > > On 24/04/2025 10:55, Philipp Stanner wrote: > > The waitqueue that ensures that drm_sched_fini() blocks until the > > pending_list has become empty could theoretically cause that > > function to > > bl

Re: [PATCH] drm/scheduler: signal scheduled fence when kill job

2025-05-16 Thread Philipp Stanner
that will never be resolved. Fix this issue by ensuring > that   > scheduled fences are properly signaled when an entity is killed, > allowing   > dependent applications to continue execution. That sounds perfect, yes, Thx. Reviewed-by: Philipp Stanner P. > > Thanks, >

Re: [PATCH 1/3] drm/sched: add drm_sched_prealloc_dependency_slots v3

2025-05-16 Thread Philipp Stanner
On Thu, 2025-05-15 at 17:17 +0100, Tvrtko Ursulin wrote: > > On 15/05/2025 16:00, Christian König wrote: > > Sometimes drivers need to be able to submit multiple jobs which > > depend on > > each other to different schedulers at the same time, but using > > drm_sched_job_add_dependency() can't fai

Re: [PATCH v4 04/40] drm/sched: Add enqueue credit limit

2025-05-15 Thread Philipp Stanner
Hello, On Wed, 2025-05-14 at 09:59 -0700, Rob Clark wrote: > From: Rob Clark > > Similar to the existing credit limit mechanism, but applying to jobs > enqueued to the scheduler but not yet run. > > The use case is to put an upper bound on preallocated, and > potentially > unneeded, pgtable pag

Re: [PATCH] drm/scheduler: signal scheduled fence when kill job

2025-05-15 Thread Philipp Stanner
ssue simply that the fence might be dropped unsignaled, being a bug by definition? Needs to be written down. Grammar is also a bit too broken. And running the unit tests before pushing is probably also a good idea. > > > > Signed-off-by: Lin.Cao Acked-by: Philipp Stanner > > Revie

Re: [PATCH v9 09/10] drm/doc: document some tracepoints as uAPI

2025-05-14 Thread Philipp Stanner
On Thu, 2025-04-24 at 10:38 +0200, Pierre-Eric Pelloux-Prayer wrote: > This commit adds a document section in drm-uapi.rst about > tracepoints, > and mark the events gpu_scheduler_trace.h as stable uAPI. > > The goal is to explicitly state that tools can rely on the fields, > formats and semantics

Re: [PATCH v9 08/10] drm: get rid of drm_sched_job::id

2025-05-14 Thread Philipp Stanner
On Thu, 2025-04-24 at 10:38 +0200, Pierre-Eric Pelloux-Prayer wrote: > Its only purpose was for trace events, but jobs can already be > uniquely identified using their fence. > > The downside of using the fence is that it's only available > after 'drm_sched_job_arm' was called which is true for al

Re: [PATCH v9 05/10] drm/sched: trace dependencies for gpu jobs

2025-05-14 Thread Philipp Stanner
nit: title: s/gpu/GPU We also mostly start with an upper case letter after the :, but JFYI, it's not a big deal. P. On Thu, 2025-04-24 at 10:38 +0200, Pierre-Eric Pelloux-Prayer wrote: > We can't trace dependencies from drm_sched_job_add_dependency > because when it's called the job's fence is

Re: [PATCH v9 02/10] drm/sched: store the drm client_id in drm_sched_fence

2025-05-14 Thread Philipp Stanner
On Thu, 2025-04-24 at 10:38 +0200, Pierre-Eric Pelloux-Prayer wrote: > This will be used in a later commit to trace the drm client_id in > some of the gpu_scheduler trace events. > > This requires changing all the users of drm_sched_job_init to > add an extra parameter. > > The newly added drm_cl

Re: [PATCH v2 6/6] drm/sched: Port unit tests to new cleanup design

2025-05-14 Thread Philipp Stanner
On Wed, 2025-05-14 at 09:30 +0100, Tvrtko Ursulin wrote: > > On 12/05/2025 09:00, Philipp Stanner wrote: > > On Thu, 2025-05-08 at 13:51 +0100, Tvrtko Ursulin wrote: > > > > > > Hi Philipp, > > > > > > On 08/05/2025 12:03, Philipp Stanner

[PATCH v3] drm/vmwgfx: Use non-hybrid PCI devres API

2025-05-14 Thread Philipp Stanner
-managed pcim_request_all_regions(). Signed-off-by: Philipp Stanner Reviewed-by: Zack Rusin --- Changes in v3: - Use the correct driver name in the commit message. (Zack) Changes in v2: - Fix unused variable error. --- drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 14 +++--- 1 file changed, 3

Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-05-13 Thread Philipp Stanner
On Sat, 2025-05-03 at 17:59 -0300, Maíra Canal wrote: > When the DRM scheduler times out, it's possible that the GPU isn't > hung; > instead, a job may still be running, and there may be no valid reason > to > reset the hardware. This can occur in two situations: > >   1. The GPU exposes some mech

Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-05-12 Thread Philipp Stanner
On Mon, 2025-05-12 at 16:09 +0200, Philipp Stanner wrote: > On Mon, 2025-05-12 at 11:04 -0300, Maíra Canal wrote: > > Hi Philipp, > > > > On 12/05/25 08:13, Philipp Stanner wrote: > > > On Tue, 2025-05-06 at 07:32 -0700, Matthew Brost wrote: > > > >

Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-05-12 Thread Philipp Stanner
On Mon, 2025-05-12 at 11:04 -0300, Maíra Canal wrote: > Hi Philipp, > > On 12/05/25 08:13, Philipp Stanner wrote: > > On Tue, 2025-05-06 at 07:32 -0700, Matthew Brost wrote: > > > On Mon, May 05, 2025 at 07:41:09PM -0700, Matthew Brost wrote: > > > > On S

Re: [RFC v4 16/16] drm/sched: Embed run queue singleton into the scheduler

2025-05-12 Thread Philipp Stanner
o Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > Cc: Philipp Stanner > --- >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  6 ++-- >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  6 ++-- >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  5 +++- >

Re: [RFC v4 12/16] drm/sched: Remove idle entity from tree

2025-05-12 Thread Philipp Stanner
every > popped job. That there is no need to do so doesn't imply that you can't keep them around. The commit message doesn't make the motivation clear > > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > C

Re: [RFC v4 10/16] drm/sched: Free all finished jobs at once

2025-05-12 Thread Philipp Stanner
gt; completed jobs as soon as possible so the metric is most up to date > when > view from the submission side of things. > > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > Cc: Philipp Stanner > --- &

Re: [RFC v4 05/16] drm/sched: Consolidate drm_sched_job_timedout

2025-05-12 Thread Philipp Stanner
he function. Same here, that's a good candidate for a separate patch / series. P. > > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > Cc: Philipp Stanner > --- >  drivers/gpu/drm/scheduler/sched_main.c | 37 +++

Re: [RFC v4 04/16] drm/sched: Avoid double re-lock on the job free path

2025-05-12 Thread Philipp Stanner
heduling policy, not general other improvements. P. > > Signed-off-by: Tvrtko Ursulin > Cc: Christian König > Cc: Danilo Krummrich > Cc: Matthew Brost > Cc: Philipp Stanner > --- >  drivers/gpu/drm/scheduler/sched_main.c | 39 +++- > -- >  1

Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-05-12 Thread Philipp Stanner
On Tue, 2025-05-06 at 07:32 -0700, Matthew Brost wrote: > On Mon, May 05, 2025 at 07:41:09PM -0700, Matthew Brost wrote: > > On Sat, May 03, 2025 at 05:59:52PM -0300, Maíra Canal wrote: > > > When the DRM scheduler times out, it's possible that the GPU > > > isn't hung; > > > instead, a job may sti

Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-05-12 Thread Philipp Stanner
On Sat, 2025-05-03 at 17:59 -0300, Maíra Canal wrote: > When the DRM scheduler times out, it's possible that the GPU isn't > hung; > instead, a job may still be running, and there may be no valid reason > to > reset the hardware. This can occur in two situations: > >   1. The GPU exposes some mech

Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running

2025-05-12 Thread Philipp Stanner
On Wed, 2025-05-07 at 13:50 +0100, Tvrtko Ursulin wrote: > > On 07/05/2025 13:33, Maíra Canal wrote: > > Hi Tvrtko, > > > > Thanks for the review! > > > > On 06/05/25 08:32, Tvrtko Ursulin wrote: > > > > > > On 03/05/2025 21:59, Maíra Canal wrote: > > > > When the DRM scheduler times out, it's

Re: [PATCH v2 6/6] drm/sched: Port unit tests to new cleanup design

2025-05-12 Thread Philipp Stanner
On Thu, 2025-05-08 at 13:51 +0100, Tvrtko Ursulin wrote: > > Hi Philipp, > > On 08/05/2025 12:03, Philipp Stanner wrote: > > On Thu, 2025-04-24 at 11:55 +0200, Philipp Stanner wrote: > > > The unit tests so far took care manually of avoiding memory leaks > > >

Re: [PATCH] drm/sched: Fix UAF in drm_sched_fence_get_timeline_name()

2025-05-12 Thread Philipp Stanner
Hi, On Fri, 2025-05-09 at 14:29 -0700, Rob Clark wrote: > From: Rob Clark > > The fence can outlive the sched, so it is not safe to dereference the > sched in drm_sched_fence_get_timeline_name() Thx for the fix. Looks correct to me. Some nits > > Signed-off-by: Rob Clark This is clearly a b

Re: [PATCH v2] drm/vmgfx: Use non-hybrid PCI devres API

2025-05-09 Thread Philipp Stanner
On Thu, 2025-05-08 at 11:39 -0400, Zack Rusin wrote: > On Thu, May 8, 2025 at 6:40 AM Philipp Stanner > wrote: > > > > On Wed, 2025-04-23 at 14:06 +0200, Philipp Stanner wrote: > > > vmgfx enables its PCI device with pcim_enable_device(). This, > > &g

Re: [PATCH] drm/cirrus: Use non-hybrid PCI devres API

2025-05-09 Thread Philipp Stanner
On Thu, 2025-05-08 at 12:44 +0200, Javier Martinez Canillas wrote: > Philipp Stanner writes: > > Hello Philipp, > > > On Tue, 2025-04-22 at 23:51 +0200, Javier Martinez Canillas wrote: > > > Philipp Stanner writes: > > > > > > Hello Philipp, >

Re: [PATCH v2] drm/vmgfx: Use non-hybrid PCI devres API

2025-05-08 Thread Philipp Stanner
On Thu, 2025-05-08 at 11:39 -0400, Zack Rusin wrote: > On Thu, May 8, 2025 at 6:40 AM Philipp Stanner > wrote: > > > > On Wed, 2025-04-23 at 14:06 +0200, Philipp Stanner wrote: > > > vmgfx enables its PCI device with pcim_enable_device(). This, > > &g

Re: [PATCH v2 6/6] drm/sched: Port unit tests to new cleanup design

2025-05-08 Thread Philipp Stanner
On Thu, 2025-04-24 at 11:55 +0200, Philipp Stanner wrote: > The unit tests so far took care manually of avoiding memory leaks > that > might have occurred when calling drm_sched_fini(). > > The scheduler now takes care by itself of avoiding memory leaks if > the > driver

Re: [PATCH v2] drm/vmgfx: Use non-hybrid PCI devres API

2025-05-08 Thread Philipp Stanner
On Wed, 2025-04-23 at 14:06 +0200, Philipp Stanner wrote: > vmgfx enables its PCI device with pcim_enable_device(). This, > implicitly, switches the function pci_request_regions() into managed > mode, where it becomes a devres function. > > The PCI subsystem wants to remove thi

Re: [PATCH] drm/cirrus: Use non-hybrid PCI devres API

2025-05-08 Thread Philipp Stanner
On Tue, 2025-04-22 at 23:51 +0200, Javier Martinez Canillas wrote: > Philipp Stanner writes: > > Hello Philipp, > > > cirrus enables its PCI device with pcim_enable_device(). This, > > implicitly, switches the function pci_request_regions() into > > managed >

Re: [PATCH 4/4] drm/nouveau: Check dma_fence in canonical way

2025-05-08 Thread Philipp Stanner
On Mon, 2025-04-28 at 16:45 +0200, Christian König wrote: > On 4/24/25 15:02, Philipp Stanner wrote: > > In nouveau_fence_done(), a fence is checked for being signaled by > > manually evaluating the base fence's bits. This can be done in a > > canonical manner thr

Re: [PATCH v2] drm/sched: fix the warning in drm_sched_job_done

2025-05-08 Thread Philipp Stanner
t; > -Original Message- > From: Koenig, Christian > Sent: Tuesday, April 29, 2025 12:49 PM > To: Khatri, Sunil ; > dri-devel@lists.freedesktop.org; Danilo Krummrich ; > Philipp Stanner > Cc: Deucher, Alexander ; Tvrtko Ursulin > ; Pelloux-Prayer, Pierre-Eric > &

[PATCH 2/4] drm/nouveau: Simplify calls to nvif_event_block()

2025-04-24 Thread Philipp Stanner
nouveau_fence_signal() returns a de-facto boolean to indicate when nvif_event_block() shall be called. The code can be made more compact and readable by calling nvif_event_block() in nouveau_fence_update() directly. Make those calls in nouveau_fence.c more canonical. Signed-off-by: Philipp

[PATCH 1/4] drm/nouveau: nouveau_fence: Standardize list iterations

2025-04-24 Thread Philipp Stanner
nouveau_fence.c iterates over lists in a non-canonical way. Since the operations done are just basic for-each-loops and list-empty checks, they should be written in the standard form. Use standard list operations. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/nouveau/nouveau_fence.c | 21

Re: [PATCH 4/4] drm/nouveau: Check dma_fence in canonical way

2025-04-24 Thread Philipp Stanner
On Thu, 2025-04-24 at 15:24 +0200, Danilo Krummrich wrote: > On 4/24/25 3:02 PM, Philipp Stanner wrote: > > In nouveau_fence_done(), a fence is checked for being signaled by > > manually evaluating the base fence's bits. This can be done in a > > canonical manner thr

[PATCH 3/4] drm/nouveau: Simplify nouveau_fence_done()

2025-04-24 Thread Philipp Stanner
nouveau_fence_done() contains an if branch that checks whether a nouveau_fence has either of the two existing nouveau_fence backend ops, which will always evaluate to true. Remove the surplus check. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/nouveau/nouveau_fence.c | 24

[PATCH 4/4] drm/nouveau: Check dma_fence in canonical way

2025-04-24 Thread Philipp Stanner
In nouveau_fence_done(), a fence is checked for being signaled by manually evaluating the base fence's bits. This can be done in a canonical manner through dma_fence_is_signaled(). Replace the bit-check with dma_fence_is_signaled(). Signed-off-by: Philipp Stanner --- drivers/gpu/drm/no

[PATCH 0/4] drm/nouveau: Simplify nouveau_fence.c

2025-04-24 Thread Philipp Stanner
/ Philipp Stanner (4): drm/nouveau: nouveau_fence: Standardize list iterations drm/nouveau: Simplify calls to nvif_event_block() drm/nouveau: Simplify nouveau_fence_done() drm/nouveau: Check dma_fence in canonical way drivers/gpu/drm/nouveau/nouveau_fence.c | 72 +++-- 1 file

[PATCH v2 4/6] drm/nouveau: Add new callback for scheduler teardown

2025-04-24 Thread Philipp Stanner
There is a new callback for always tearing the scheduler down in a leak-free, deadlock-free manner. Port Nouveau as its first user by providing the scheduler with a callback that ensures the fence context gets killed in drm_sched_fini(). Signed-off-by: Philipp Stanner --- drivers/gpu/drm

[PATCH v2 6/6] drm/sched: Port unit tests to new cleanup design

2025-04-24 Thread Philipp Stanner
the unit tests. Remove the manual cleanup code. Signed-off-by: Philipp Stanner --- .../gpu/drm/scheduler/tests/mock_scheduler.c | 34 --- 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler

[PATCH v2 3/6] drm/sched: Warn if pending list is not empty

2025-04-24 Thread Philipp Stanner
drm_sched_fini() can leak jobs under certain circumstances. Warn if that happens. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler/sched_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c

[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

2025-04-24 Thread Philipp Stanner
callback is not implemented. Suggested-by: Danilo Krummrich Signed-off-by: Philipp Stanner --- drivers/gpu/drm/scheduler/sched_main.c | 47 +- include/drm/gpu_scheduler.h| 11 ++ 2 files changed, 42 insertions(+), 16 deletions(-) diff --git a/drivers/gpu

[PATCH v2 5/6] drm/nouveau: Remove waitque for sched teardown

2025-04-24 Thread Philipp Stanner
nouveau_sched_fence_context_kill() the waitque is not necessary anymore. Remove the waitque. Signed-off-by: Philipp Stanner --- drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++- drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +++-- drivers/gpu/drm/nouveau/nouveau_uvmm.c | 8 3 files

[PATCH v2 1/6] drm/sched: Fix teardown leaks with waitqueue

2025-04-24 Thread Philipp Stanner
From: Philipp Stanner The GPU scheduler currently does not ensure that its pending_list is empty before performing various other teardown tasks in drm_sched_fini(). If there are still jobs in the pending_list, this is problematic because after scheduler teardown, no one will call

[PATCH v2 0/6] drm/sched: Fix memory leaks in drm_sched_fini()

2025-04-24 Thread Philipp Stanner
ks fine and solves the problem (though we did discover an unrelated problem inside Nouveau in the process). It also works with the unit tests. I'm looking forward to your input and feedback. I really hope we can work this RFC into something that can provide users with a more reliable, clean

[PATCH v2] drm/vmgfx: Use non-hybrid PCI devres API

2025-04-23 Thread Philipp Stanner
-managed pcim_request_all_regions(). Signed-off-by: Philipp Stanner --- Changes in v2: - Fix unused variable error. --- drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 14 +++--- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c b/drivers/gpu/drm

Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty

2025-04-22 Thread Philipp Stanner
On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote: > On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote: > > > > On 22/04/2025 13:32, Danilo Krummrich wrote: > > > On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote: > > > > > > > > On 22/04/2025 12:13, Danilo Krumm

  1   2   3   4   5   6   >