On 14/08/2025 12:49, Philipp Stanner wrote:
On Thu, 2025-08-14 at 12:45 +0100, Tvrtko Ursulin wrote:
On 14/08/2025 11:42, Tvrtko Ursulin wrote:
On 21/07/2025 08:52, Philipp Stanner wrote:
+Cc Tvrtko, who's currently reworking FIFO and RR.
On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote:
Fixes an issue where entities are added to the run queue in
drm_sched_rq_update_fifo_locked after being killed, causing a
slab-use-after-free error.
Signed-off-by: James Flowers <bold.zone2...@fastmail.com>
---
This issue was detected by syzkaller running on a Steam Deck OLED.
Unfortunately I don't have a reproducer for it. I've
Well, now that's kind of an issue – if you don't have a reproducer, how
can you know that your patch is correct? How can we?
It would certainly be good to know what the fuzz testing framework
does.
included the KASAN reports below:
Anyways, KASAN reports look interesting. But those might be many
different issues. Again, would be good to know what the fuzzer has been
testing. Can you maybe split this fuzz test into sub-tests? I suspsect
those might be different faults.
Anyways, taking a first look…
[SNIP]
==================================================================
If this is a race, then this patch here is broken, too, because you're
checking the 'stopped' boolean as the callers of that function do, too
– just later. :O
Could still race, just less likely.
The proper way to fix it would then be to address the issue where the
locking is supposed to happen. Let's look at, for example,
drm_sched_entity_push_job():
void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
{
(Bla bla bla)
…………
/* first job wakes up scheduler */
if (first) {
struct drm_gpu_scheduler *sched;
struct drm_sched_rq *rq;
/* Add the entity to the run queue */
spin_lock(&entity->lock);
if (entity->stopped) { <---- Aha!
spin_unlock(&entity->lock);
DRM_ERROR("Trying to push to a killed entity\n");
return;
}
rq = entity->rq;
sched = rq->sched;
spin_lock(&rq->lock);
drm_sched_rq_add_entity(rq, entity);
if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
drm_sched_rq_update_fifo_locked(entity, rq, submit_ts);
<---- bumm!
spin_unlock(&rq->lock);
spin_unlock(&entity->lock);
But the locks are still being hold. So that "shouldn't be happening"(tm).
Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini()
stop entities. The former holds appropriate locks, but drm_sched_fini()
doesn't. So that looks like a hot candidate to me. Opinions?
On the other hand, aren't drivers prohibited from calling
drm_sched_entity_push_job() after calling drm_sched_fini()? If the
fuzzer does that, then it's not the scheduler's fault.
Could you test adding spin_lock(&entity->lock) to drm_sched_fini()?
Would be cool if Tvrtko and Christian take a look. Maybe we even have a
fundamental design issue.
It would be nice to have a reproducer and from this thread I did not
manage to figure out if the syzkaller snipper James posted was it, or
not quite it.
In either case, I think one race I see relates to the early exit !
entity->rq check before setting entity->stopped in drm_sched_entity_kill().
If the entity was not submitted at all yet (at the time of process
exit / entity kill), entity->stopped will therefore not be set. A
parallel job submit can then re-add the entity to the tree, as process
exit / file close / entity kill is continuing and is about to kfree the
entity (in the case of amdgpu report there are two entities embedded in
file_priv).
One way to make this more robust is to make the entity->rq check in
drm_sched_entity_kill() stronger. Or actually to remove it altogether.
But I think it also requires checking for entity->stopped in
drm_sched_entity_select_rq() and propagating the error code all the way
out from drm_sched_job_arm().
That was entity->stopped is properly serialized and acted upon early
enough to avoid dereferencing a freed entity and avoid creating jobs not
attached to anything (but only have a warning from push job).
Disclaimer I haven't tried to experiment with this yet, so I may be
missing something. At least writing a reproducer for the race I
described sounds easy so unless someone shouts I am talking nonsense I
can do that and also sketch out a fix. *If* the theory will hold water
after I write the test case.
Nah I was talking nonsense. Forgot entity->rq is assigned on entity init
and jobs cannot be created unless it is set.
Okay, I have no theories as to what bug syzkaller found.
I just was about to answer.
I agree that the rq check should be fine.
As you can see in the thread, I suspect that this is a race between
drm_sched_entity_push_job() and drm_sched_fini().
See here:
https://lore.kernel.org/dri-devel/20250813085654.102504-2-pha...@kernel.org/
Yeah I read it.
Problem with the amdgpu angle and this KASAN report is that to me it
looked the UAF is about the two VM update entities embedded in struct
file priv. And the schedulers used to initialize those are not torn down
until driver unload. So I didn't think syzkaller would have hit that and
was looking for alternative ideas.
Regards,
Tvrtko
I think as long as there's no reproducer there is not much to do for us
here. A long term goal, though, is to enforce the life time rules.
Entities must be torn down before their scheduler. Checking this for
all drivers will be quite some work, though..
P.
Regards,
Tvrtko
}
/**