On Thu, 2025-03-06 at 17:10 +0000, Tvrtko Ursulin wrote: > > On 06/03/2025 15:18, Tvrtko Ursulin wrote: > > > > On 06/03/2025 14:56, Tvrtko Ursulin wrote: > > > > > > On 06/03/2025 12:37, Philipp Stanner wrote: > > > > On Tue, 2025-03-04 at 13:10 +0000, Tvrtko Ursulin wrote: > > > > > There has repeatedly been quite a bit of apprehension when > > > > > any change > > > > > to the DRM > > > > > scheduler is proposed, with two main reasons being code base > > > > > is > > > > > considered > > > > > fragile, not well understood and not very well documented, > > > > > and > > > > > secondly the lack > > > > > of systematic testing outside the vendor specific tests > > > > > suites and/or > > > > > test > > > > > farms. > > > > > > > > > > This series is an attempt to dislodge this status quo by > > > > > adding some > > > > > unit tests > > > > > using the kunit framework. > > > > > > > > > > General approach is that there is a mock "hardware" backend > > > > > which can > > > > > be > > > > > controlled from tests, which in turn allows exercising > > > > > various > > > > > scheduler code > > > > > paths. > > > > > > > > > > Only some simple basic tests get added in the series and > > > > > hopefully it > > > > > is easy to > > > > > understand what tests are doing. > > > > > > > > > > An obligatory "screenshot" for reference: > > > > > > > > > > [14:29:37] ============ drm_sched_basic_tests (3 subtests) > > > > > ============ > > > > > [14:29:38] [PASSED] drm_sched_basic_submit > > > > > [14:29:38] ================== drm_sched_basic_test > > > > > =================== > > > > > [14:29:38] [PASSED] A queue of jobs in a single entity > > > > > [14:29:38] [PASSED] A chain of dependent jobs across multiple > > > > > entities > > > > > [14:29:38] [PASSED] Multiple independent job queues > > > > > [14:29:38] [PASSED] Multiple inter-dependent job queues > > > > > [14:29:38] ============== [PASSED] drm_sched_basic_test > > > > > =============== > > > > > [14:29:38] [PASSED] drm_sched_basic_entity_cleanup > > > > > [14:29:38] ============== [PASSED] drm_sched_basic_tests > > > > > ============== > > > > > [14:29:38] ======== drm_sched_basic_timeout_tests (1 subtest) > > > > > ========= > > > > > [14:29:40] [PASSED] drm_sched_basic_timeout > > > > > [14:29:40] ========== [PASSED] drm_sched_basic_timeout_tests > > > > > ========== > > > > > [14:29:40] ======= drm_sched_basic_priority_tests (2 > > > > > subtests) > > > > > ======== > > > > > [14:29:42] [PASSED] drm_sched_priorities > > > > > [14:29:42] [PASSED] drm_sched_change_priority > > > > > [14:29:42] ========= [PASSED] drm_sched_basic_priority_tests > > > > > ========== > > > > > [14:29:42] ====== drm_sched_basic_modify_sched_tests (1 > > > > > subtest) > > > > > ====== > > > > > [14:29:43] [PASSED] drm_sched_test_modify_sched > > > > > [14:29:43] ======= [PASSED] > > > > > drm_sched_basic_modify_sched_tests > > > > > ======== > > > > > [14:29:43] > > > > > ============================================================ > > > > > [14:29:43] Testing complete. Ran 10 tests: passed: 10 > > > > > [14:29:43] Elapsed time: 13.330s total, 0.001s configuring, > > > > > 4.005s > > > > > building, 9.276s running > > > > > > > > Yo, > > > > > > > > so I tried to test this all this in QEMU and I am encountering > > > > some > > > > explosions when I activate the scheduler tests. Just DRM tests > > > > boot > > > > fine. > > > > > > > > I'm using a kernel on relatively current drm-misc-next: > > > > 44d2f310f008 > > > > > > > > I apply your series, then > > > > make defconfig > > > > make menuconfig # switch on kunit framework and scheduler tests > > > > install everything + initramfs > > > > > > > > Boot then causes errors as below. Just using the DRM kunit > > > > tests works > > > > fine. > > > > > > > > Excerpt of the first fault: > > > > > > > > [ 1.040513] # kunit_device: pass:3 fail:0 skip:0 total:3 > > > > [ 1.040867] # Totals: pass:3 fail:0 skip:0 total:3 > > > > [ 1.041296] ok 7 kunit_device > > > > [ 1.041936] KTAP version 1 > > > > [ 1.042186] # Subtest: kunit_fault > > > > [ 1.042517] # module: kunit_test > > > > [ 1.042517] 1..1 > > > > [ 1.043147] BUG: kernel NULL pointer dereference, address: > > > > 0000000000000000 > > > > [ 1.043765] #PF: supervisor write access in kernel mode > > > > [ 1.044189] #PF: error_code(0x0002) - not-present page > > > > [ 1.044617] PGD 0 P4D 0 > > > > [ 1.044818] Oops: Oops: 0002 [#1] PREEMPT SMP PTI > > > > [ 1.045380] CPU: 7 UID: 0 PID: 214 Comm: kunit_try_catch > > > > Tainted: > > > > G N 6.14.0-rc4-00387-g33e4632926a0 #8 > > > > [ 1.046262] Tainted: [N]=TEST > > > > [ 1.046521] Hardware name: QEMU Standard PC (i440FX + PIIX, > > > > 1996), > > > > BIOS 1.16.3-2.fc40 04/01/2014 > > > > [ 1.047224] RIP: 0010:kunit_test_null_dereference+0x37/0x80 > > > > [ 1.047706] Code: 80 b5 49 c7 c0 50 7f 56 b4 ba 01 00 00 00 > > > > 65 48 > > > > 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8d 4c 24 07 48 c7 > > > > c6 80 > > > > 8a 26 b5 <c7> 04 25 00 00 00 00 00 00 00 00 48 c7 87 70 01 00 > > > > 00 a6 > > > > e9 8c b5 > > > > [ 1.049204] RSP: 0000:ffffa609807c7ec8 EFLAGS: 00010246 > > > > [ 1.049642] RAX: 0000000000000000 RBX: ffff91d982623000 RCX: > > > > ffffa609807c7ecf > > > > [ 1.050213] RDX: 0000000000000001 RSI: ffffffffb5268a80 RDI: > > > > ffffa60980013c68 > > > > [ 1.050799] RBP: ffff91d98105afc0 R08: ffffffffb4567f50 R09: > > > > ffffffffb5807ce8 > > > > [ 1.051375] R10: 0000000000000000 R11: 0000000000000001 R12: > > > > ffff91d98105afc0 > > > > [ 1.051941] R13: ffff91d983c749c0 R14: ffffffffb45685e0 R15: > > > > ffff91d982623000 > > > > [ 1.052543] FS: 0000000000000000(0000) > > > > GS:ffff91e48f9c0000(0000) > > > > knlGS:0000000000000000 > > > > [ 1.053187] CS: 0010 DS: 0000 ES: 0000 CR0: > > > > 0000000080050033 > > > > [ 1.053649] CR2: 0000000000000000 CR3: 00000004cee30000 CR4: > > > > 00000000000006f0 > > > > [ 1.054214] Call Trace: > > > > [ 1.054427] <TASK> > > > > [ 1.054597] ? __die+0x1e/0x60 > > > > [ 1.054844] ? page_fault_oops+0x17b/0x4a0 > > > > [ 1.055174] ? search_extable+0x26/0x30 > > > > [ 1.055482] ? kunit_test_null_dereference+0x37/0x80 > > > > [ 1.055888] ? search_module_extables+0x14/0x50 > > > > [ 1.056255] ? exc_page_fault+0x6b/0x150 > > > > [ 1.056571] ? asm_exc_page_fault+0x26/0x30 > > > > [ 1.056898] ? > > > > __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10 > > > > [ 1.057387] ? __pfx_kunit_fail_assert_format+0x10/0x10 > > > > [ 1.057799] ? kunit_test_null_dereference+0x37/0x80 > > > > [ 1.058195] ? __kthread_parkme+0x33/0x80 > > > > [ 1.058523] kunit_generic_run_threadfn_adapter+0x1c/0x40 > > > > [ 1.058949] kthread+0xe9/0x1f0 > > > > [ 1.059206] ? __pfx_kthread+0x10/0x10 > > > > [ 1.059513] ret_from_fork+0x2f/0x50 > > > > [ 1.059798] ? __pfx_kthread+0x10/0x10 > > > > [ 1.060095] ret_from_fork_asm+0x1a/0x30 > > > > [ 1.060421] </TASK> > > > > [ 1.060597] Modules linked in: > > > > [ 1.060841] CR2: 0000000000000000 > > > > [ 1.061104] ---[ end trace 0000000000000000 ]--- > > > > [ 1.061481] RIP: 0010:kunit_test_null_dereference+0x37/0x80 > > > > > > > > > > > > I attach my kernel config and the full log file. > > > > > > > > What's awkward is that it does not seem to be related directly > > > > to > > > > sched, but only faults with sched. > > > > > > > > > > > > Could you try to reproduce this, Tvrtko? > > > > > > Any chance that between the two runs you somehow manage to enable > > > CONFIG_KUNIT_FAULT_TEST? > > > > Hmm in the sea of kunit_test_null_dereference there was a drm sched > > related fail too. Investigating. > > Well this was quite embarrassing - I had an use after free due > relying > on scheduler fences for querying job status. That's what I get for > over-relying on KASAN...
Interesting, so KASAN had a false negative with those? > > I've fixed it by tracking the job completion status in the mock job > object directly and sent v4 out. > > Also interestingly, for me testing under qemu failed to catch it. > Only > one out of two real hw test machines hit it. That's very awkward. How can my QEMU be that different from your QEMU? Did you use the commit I provided above as the base of your kernel, too? > Excellent that you gave it > a spin and caught it before merge, thanks for that! No worries, that's our mission. Cool that you found it so quickly! P. > > Regards, > > Tvrtko > > > > > Regards, > > > > Tvrtko > > > > > > > > > > > > > > Thanks > > > > P. > > > > > > > > > > > > > > > > > > v2: > > > > > * Parameterize a bunch of similar tests. > > > > > * Improve test commentary. > > > > > * Rename TDR test to timeout. (Christian) > > > > > * Improve quality and consistency of naming. (Philipp) > > > > > > > > > > RFC v2 -> series v1: > > > > > * Rebased for drm_sched_init changes. > > > > > * Fixed modular build. > > > > > * Added some comments. > > > > > * Filename renames. (Philipp) > > > > > > > > > > v2: > > > > > * Dealt with a bunch of checkpatch warnings. > > > > > > > > > > v3: > > > > > * Some mock API renames, kerneldoc grammar fixes and > > > > > indentation > > > > > fixes. > > > > > > > > > > Cc: Christian König <christian.koe...@amd.com> > > > > > Cc: Danilo Krummrich <d...@kernel.org> > > > > > Cc: Matthew Brost <matthew.br...@intel.com> > > > > > Cc: Philipp Stanner <pha...@kernel.org> > > > > > > > > > > Tvrtko Ursulin (5): > > > > > drm: Move some options to separate new Kconfig > > > > > drm/scheduler: Add scheduler unit testing infrastructure > > > > > and some > > > > > basic tests > > > > > drm/scheduler: Add a simple timeout test > > > > > drm/scheduler: Add basic priority tests > > > > > drm/scheduler: Add a basic test for modifying entities > > > > > scheduler > > > > > list > > > > > > > > > > drivers/gpu/drm/Kconfig | 109 +---- > > > > > drivers/gpu/drm/Kconfig.debug | 115 +++++ > > > > > drivers/gpu/drm/scheduler/.kunitconfig | 12 + > > > > > drivers/gpu/drm/scheduler/Makefile | 2 + > > > > > drivers/gpu/drm/scheduler/tests/Makefile | 7 + > > > > > .../gpu/drm/scheduler/tests/mock_scheduler.c | 323 > > > > > +++++++++++++ > > > > > drivers/gpu/drm/scheduler/tests/sched_tests.h | 223 > > > > > +++++++++ > > > > > drivers/gpu/drm/scheduler/tests/tests_basic.c | 426 > > > > > ++++++++++++++++++ > > > > > 8 files changed, 1113 insertions(+), 104 deletions(-) > > > > > create mode 100644 drivers/gpu/drm/Kconfig.debug > > > > > create mode 100644 drivers/gpu/drm/scheduler/.kunitconfig > > > > > create mode 100644 drivers/gpu/drm/scheduler/tests/Makefile > > > > > create mode 100644 > > > > > drivers/gpu/drm/scheduler/tests/mock_scheduler.c > > > > > create mode 100644 > > > > > drivers/gpu/drm/scheduler/tests/sched_tests.h > > > > > create mode 100644 > > > > > drivers/gpu/drm/scheduler/tests/tests_basic.c > > > > > > > > > > > > > > >