On 06/05/2025 13:46, Maíra Canal wrote:
Hi Tvrtko,

Thanks for your review!

On 06/05/25 08:49, Tvrtko Ursulin wrote:

On 03/05/2025 21:59, Maíra Canal wrote:
Currently, if we add the assertions presented in this commit to the mock
scheduler, we will see the following output:

[15:47:08] ============== [PASSED] drm_sched_basic_tests ==============
[15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) =========
[15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/ gpu/ drm/scheduler/tests/tests_basic.c:246 [15:47:08] Expected list_empty(&sched->job_list) to be true, but is false
[15:47:08] [FAILED] drm_sched_basic_timeout
[15:47:08] # module: drm_sched_tests

This occurs because `mock_sched_timedout_job()` doesn't properly handle
the hang. From the DRM sched documentation, `drm_sched_stop()` and
`drm_sched_start()` are typically used for reset recovery. If these
functions are not used, the offending job won't be freed and should be
freed by the caller.

Currently, the mock scheduler doesn't use the functions provided by the
API, nor does it handle the freeing of the job. As a result, the job isn't
removed from the job list.

For the record the job does gets freed via the kunit managed allocation.

Sorry, I didn't express myself correctly. Indeed, it is. I meant that
the DRM scheduler didn't free the job.


It was a design choice for this test to be a *strict* unit test which tests only a _single_ thing. And that is that the timedout_job() hook gets called. As such the hook was implemented to satisfy that single requirement only.


What do you think about checking that `sched->job_list` won't be empty?

I wanted to add such assertion to make sure that the behavior of the
timeout won't change in future (e.g. a patch makes a change that calls
`free_job()` for the guilty job at timeout). Does it make sense to you?

Where would that assert be?

But I also do not oppose making it test multiple things in one test per se.

This commit mocks a GPU reset by stopping the scheduler affected by the
reset, waiting a couple of microseconds to mimic a hardware reset, and
then restart the affected scheduler.

Signed-off-by: Maíra Canal <mca...@igalia.com>
---
  drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 10 ++++++++++
  drivers/gpu/drm/scheduler/tests/tests_basic.c    |  3 +++
  2 files changed, 13 insertions(+)


[...]

diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/ gpu/drm/scheduler/tests/tests_basic.c index 7230057e0594c6246f02608f07fcb1f8d738ac75..8f960f0fd31d0af7873f410ceba2d636f58a5474 100644
--- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
+++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
@@ -241,6 +241,9 @@ static void drm_sched_basic_timeout(struct kunit *test)
              job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
              DRM_MOCK_SCHED_JOB_TIMEDOUT);
+    KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));

Hmm I think this assert could be racy because it appears to rely on the free worker to run and cleanup the "finished" job in the window between drm_mock_sched_job_wait_finished() (or drm_sched_start(), depends how you look at it) and here. Am I missing something?

 From what I understand, the job is freed by the timeout worker [1] after
`drm_sched_stop()` marked the job as guilty.

Therefore, if the timeout was called (and we asserted that through
`job->flags`), we can be sure that the job was freed.

[1] https://gitlab.freedesktop.org/drm/misc/kernel/-/blob/drm-misc-next/ drivers/gpu/drm/scheduler/sched_main.c#L568

Hm I thought it would end up on the dma_fence_remove_callback() == true branch in drm_sched_stop().

I gave it a quick spin locally and that indeed appears to be the case. So AFAICT it does rely on the free worker to have had executed before the assert.

Regards,

Tvrtko


Best Regards,
- Maíra


Regards,

Tvrtko

+    KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
 > +>       drm_mock_sched_entity_free(entity);
  }




Reply via email to