On 13/03/2025 08:34, Philipp Stanner wrote:
On Wed, 2025-03-12 at 14:33 +0000, Tvrtko Ursulin wrote:

On 12/03/2025 13:49, Philipp Stanner wrote:

[snip]

There I see a UAF. Do you have an idea what that might be? I
would only
expect memory leaks and the test assertions failing.

It is expected all hell to break loose in aspirational mode.
There
the
mock scheduler shutdown relies solely on drm_sched_fini. So all
tests
which do not fully idle the mock scheduler themselves, whether
implicity
or explicitly, will explode in various ways.

To clarify I made it like that because that is what I understood
was
your vision for drm_sched_fini. That it would be able to clean
everything up etc.

If you meant something different we need to adjust it. Although
it
could
be done later as well.

My primary intention was to use those tests to see whether my
memory
leak fix works or not.

Okay here I don't know exactly what are you changing and how you are
testing to comment.

Thus I'd need something like in v4, where the tests run as intended
but
leak the jobs in the pending_list. Would indeed be neat if they
also
check for list_len(pending_list) so that you don't have to run
kmemleak
manually.

And since we shouldn't have such leaks in production tests the idea
of
optionally enabling them arose.

More generally speaking, I think that the tests should behave as we
expect drivers to behave *currently*, while we have the option for
the
tests to purposefully misbehave. In my case this means: just leak
the
jobs and optionally tell about the leaks.

When used as intended, the scheduler currently AFAIK doesn't have
UAF
or nullptrderefs, so they shouldn't occur in tests that use it as
intended. Now the leaks are special because we never intended them.

So what surprised me is that, when compared to v4, this v7 here
actually now also has UAF. Besides not cleaning up leaks, what are
you
doing in aspiritional mode?

Let's talk it through before you invest your time into v8

1)

"Normal" mode is fine, do you agree? No leaks, no UAF. Ie. it is
exercising the scheduler how drivers are supposed to use it today.

Yes, it is fine.


2)

Aspirational mode I added on your request and you can define how you
want it to work.

My personal intention is simple: have the jobs leaked like in v4,
without UAF or faults and the like.


Lets have a look in drm_mock_sched_fini() and see that it is pretty
straightforward. First I'll pre-process and annotate the normal
version:

void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
{
        struct drm_mock_sched_job *job, *next;
        struct kunit  *test = sched->test;
        unsigned long flags;
        LIST_HEAD(signal);

        // Stop the scheduler workqueues so nothing further gets
scheduled
        drm_sched_wqueue_stop(&sched->base);

        // Idle the mock scheduler in-flight jobs:

        // First move them to a local list so any in parallel
"hardware"
activity does not see them any more

        spin_lock_irqsave(&sched->lock, flags);
        list_for_each_entry_safe(job, next, &sched->job_list, link)
                list_move_tail(&job->link, &signal);
        spin_unlock_irqrestore(&sched->lock, flags);

        // Now for jobs which "hardware" hasn't processed yet we do:
        //
        // 1. Cancel the completion timer
        // 2. Mark the job as complete (signal the hw fence, wake up
waiters)
        // 3. Release the hardware fence (list had a reference)
        // 4. Free the job itself

        list_for_each_entry(job, &signal, link) {
                hrtimer_cancel(&job->timer);
                drm_mock_sched_job_complete(job);
                dma_fence_put(&job->hw_fence);
                drm_sched_job_cleanup(&job->base);
        }

        // Finally drm_sched_fini
        drm_sched_fini(&sched->base);
}

In aspirational mode it becomes this:

void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
{
        struct kunit  *test = sched->test;

        drm_sched_fini(&sched->base);

        KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
        KUNIT_ASSERT_TRUE(test, list_empty(&sched-
base.pending_list));
}

So relies on drm_sched_fini to cleanup _everything_. Including the
driver (mock scheduler) state. Which is what I understood you wanted.

Thx for the detailed explanation. _Everything_ seems a lot, though.
Even in this version, you still stop submission to your entities, don't
you?

IOW: is drm_sched_fini(), in aspirational mode, responsible for more
than preventing the leaks (and for what it is already doing in mainline
linux)?

Yes it is, it would need to idle and free all driver state. Again, this is I understood you wanted to work towards, but as this thread will conclude as dropping it for now it doesn't matter.

For example drm_sched_basic_entity_cleanup test case will exit (kill
the
entities and tear down the scheduler) with half of the submitted jobs
still in flight. That's one example of what will be caught here by
the
asserts and also UAF since kunit test will just exit and free up
memory
regardless.

Hmm. OK, so kunit does the free. But who does the access?
drm_sched_fini() still stops all the work items.

In-flight jobs. Completion timers will fire, simulating hardware interrupts firing with a real driver.

If this isn't what you had in mind you can either a) tell me what you
want maybe I can tweak it*, or b) we can drop the aspirational patch
for
now.


So, here's the bad news. I had a chat with Danilo yesterday who
reminded me of the fact that there are testers out there (like the
Intel bots) which do random kconfigs - so avoiding that with another
kconfig option does not solve the problem.

Thus, I think the way to go is to indeed drop aspirational mode and
whoever works on Schrödinger-Problems (like myself) has to tweak the
tests locally.

That's probably even more productive process-wise, otherwise we'd have
to remove broken aspects of (or the entire) aspirational mode every
time that we solved a known problem.

Unless you have a better proposal, let's drop that mode.


Sorry for the zig-zag, but that wasn't on my / our radar

No worries. Can you review the v7 as is or you want me to send v8? I am pretty sure aspirational mode patch is standalone and could be simply dropped when merging.

Regards,

Tvrtko

*)
For example it would be possible to avoid UAFs by moving away from
kunit
managed allocations and just leak memory. Plus, it would also be
required to cancel the timers or so.

For running in the VM or UML cases, which I thought were typical, it
wouldn't be a benefit, but if you are worried about running on the
host
kernel and expect no crashes, then that aspect would need to change.

Regards,

Tvrtko



Reply via email to