On Mon Jan 13, 2025 at 1:55 AM PST, Christian König wrote: > Am 13.01.25 um 09:43 schrieb Philipp Stanner: >> [SNIP] >>>> The handling of NULL values is half-baked. >>>> >>>> In my opinion, you should define if drm_sched_pick_best() may put a >>>> NULL into >>>> rq. If your answer is yes, it might put a NULL there; then, there >>>> should be a >>>> BUG_ON(!entity->rq) after the invocation of >>>> drm_sched_entity_select_rq(). >>>> If your answer is no, the BUG_ON() should be in >>>> drm_sched_pick_best(). >>> Yeah good point. >>> >>> We might not want a BUG_ON(), that is only justified when we prevent >>> further damage (e.g. random data corruption or similar). >>> >>> I suggest using a WARN(!shed, "Submission without activated >>> sheduler!"). >>> This way the system has at least a chance of survival should the >>> scheduler become ready later on. >>> >>> On the other hand the BUG_ON() or the NULL pointer deref should only >>> kill the application thread which is submitting something before the >>> driver is resumed. So that might help to pinpoint where the actually >>> issue is. >> As I see it the BUG_ON() would just be a more pretty NULL pointer >> deref. If we agree that this is effectively a misuse of the scheduler >> API we probably want to add it to make it more pretty, though? > > The only alternative I can see is that the scheduler API gracefully > handles submits to non-ready schedulers. E.g. that > drm_sched_entity_push_job() detects this condition and instead of > pushing the job sets and error code and signals the fences. > > But that might not be a good idea. > > It just moves the crash from one place to another and in general I fully > agree the driver is misusing the scheduler API to do something which > won't work and potentially crash the whole system. > >> @Philipp: >> BTW, I only just discovered this thread by coincidence. Please use >> get_maintainer. The scheduler currently has 4 maintainers, and none of >> them is on CC. > > Oh good, point I was already wondering why nobody else commented and > didn't realized that nobody was on CC. > > Thanks, > Christian.
I'm only seeing this mail exchange months after the fact because I was linked to it by someone on IRC, and I am making a wild guess here. Could this sleep wake issue also be caused by a similar thing to the panics and SMU hangs I was experiencing with my own issue? It's an issue known to have the same workaround for both 6000 and 7000 series users. A specific kernel commit seems to affect it as well. If you could test whether you can still reproduce the error after disabling GFXOFF states with the following kernel commandline override: amdgpu.ppfeaturemask=0xfff73fff And report back. Unless it's already something long solved? Since this particular thread died back in January, I guess nothing has happened since? > >> >> Danke, >> P. >> >>> Regards, >>> Christian. >>> >>>> That helps guys with zero domain knowledge, like me, to figure out >>>> how >>>> this is all >>>> supposed to work. >>>> >>>> best regards, >>>> Philipp