On Fri, Apr 11, 2025 at 09:52:30AM -0400, Alyssa Rosenzweig wrote: > > 2. Device Lost > > -------------- > > > > At this point we're left with no other choice than to kill the context. > > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink > > does), but it will probably not cope well with an entire strom of these > > just to get the first frame out. > > > > Here comes the horrible trick: > > > > We'll keep rendering the entire frame by just smashing one single reserve > > page (per context) into the pte every time there's a fault. It will result > > in total garbage, and we probably want to shot the context the moment the > > VS stages have finished, but it allows us to collect an accurate estimate > > of how much memory we'd have needed. We need to pass that to the vulkan > > driver as part of the device lost processing, so that it can keep that as > > the starting point for the userspace dynamic memory requirement > > guesstimate as a lower bound. Together with the (scaled to that > > requirement) gpu driver memory pool and the core mm watermarks, that > > should allow us to not hit a device lost again hopefully. > > This doesn't work if vertex stages are allowed to have side effects > (which is required for adult-level APIs and can effectively get hit with > streamout on panfrost). Once you have anything involving side effects, > you can't replay work, there's no way to cope with that. No magic Zink > can do either.
Yeah no attempts at reply, it's just standard gl error handling. So either tossing the context and reporting that through arb_robustness. Or tossing the context, "transparently" creating a new one and a mix of recreating some internal driver objects and thoughts&prayers to give the new context the best chances possible. You really want all the tricks in step 1 and the quirks in 3 to make sure this doesn't ever happen. Or at most once per app, hopefully. I promised terrible after all :-P Cheers, Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch