On Thu, Apr 10, 2025 at 4:04 PM Christian König <christian.koe...@amd.com> wrote: > > Am 10.04.25 um 15:56 schrieb Qiang Yu: > >>>> This prevents applications with multiple contexts from running into a > >>>> race condition between running tasks and context destroy when > >>>> terminating. > >>>> > > If implementing flush callback fix the problem, it must not be when SIGKILL. > > SIGKILL also calls flush (again eventually), but we can detect that in the > scheduler code. > > See the code and comment in drm_sched_entity_flush(). We could potentially > improve the comment cause it's not 100% correct, but it should work under all > cases. > > > Could you describe the problem and how this fix solves it? As I don't > > understand > > how the above difference could fix a race bug. > > That was the point I wasn't sure about either. It should *not* fix any race, > it's just gives a nicer SIGKILL experience.
Thanks for this feedback. So as mentioned in the initial letter, I'm also trying to understand if this is the correct fix. This problem is unfortunately not trivial to reproduce, there is only one application which can reproduce this so far and it is a complicated one with multiple contexts and relatively lenghty jobs. What I observed so far is that as it is right now, a context might be destroyed while a job is running (e.g. by killing the application at the right time), and a context destroy appears to release buffers that are still being used by the running job which is what causes the read faults. This added flush made it so that the jobs always finish before the context destroy gets called, which prevents the issue for this application in my attempts. But I suppose it might also just be that it had some more time to finish now. If this is not the correct fix then there might be some more missing synchronization between running job and context destroy in lima? Thanks Erico