On 7/25/23 04:55, André Almeida wrote: > Hi everyone, > > It's not clear what we should do about non-robust OpenGL apps after GPU > resets, so I'll try to summarize the topic, show some options and my proposal > to move forward on that. > > Em 27/06/2023 10:23, André Almeida escreveu: >> +Robustness >> +---------- >> + >> +The only way to try to keep an application working after a reset is if it >> +complies with the robustness aspects of the graphical API that it is using. >> + >> +Graphical APIs provide ways to applications to deal with device resets. >> However, >> +there is no guarantee that the app will use such features correctly, and the >> +UMD can implement policies to close the app if it is a repeating offender, >> +likely in a broken loop. This is done to ensure that it does not keep >> blocking >> +the user interface from being correctly displayed. This should be done even >> if >> +the app is correct but happens to trigger some bug in the hardware/driver. >> + > Depending on the OpenGL version, there are different robustness API available: > > - OpenGL ABR extension [0] > - OpenGL KHR extension [1] > - OpenGL ES extension [2] > > Apps written in OpenGL should use whatever version is available for them to > make the app robust for GPU resets. That usually means calling > GetGraphicsResetStatusARB(), checking the status, and if it encounter > something different from NO_ERROR, that means that a reset has happened, the > context is considered lost and should be recreated. If an app follow this, it > will likely succeed recovering a reset. > > What should non-robustness apps do then? They certainly will not be notified > if a reset happens, and thus can't recover if their context is lost. OpenGL > specification does not explicitly define what should be done in such > situations[3], and I believe that usually when the spec mandates to close the > app, it would explicitly note it. > > However, in reality there are different types of device resets, causing > different results. A reset can be precise enough to damage only the guilty > context, and keep others alive. > > Given that, I believe drivers have the following options: > > a) Kill all non-robust apps after a reset. This may lead to lose work from > innocent applications. > > b) Ignore all non-robust apps OpenGL calls. That means that applications > would still be alive, but the user interface would be freeze. The user would > need to close it manually anyway, but in some corner cases, the app could > autosave some work or the user might be able to interact with it using some > alternative method (command line?). > > c) Kill just the affected non-robust applications. To do that, the driver > need to be 100% sure on the impact of its resets. > > RadeonSI currently implements a), as can be seen at [4], while Iris > implements what I think it's c)[5]. > > For the user experience point-of-view, c) is clearly the best option, but > it's the hardest to archive. There's not much gain on having b) over a), > perhaps it could be an optional env var for such corner case applications.
I disagree on these conclusions. c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not. > [0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt > [1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt > [2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt > [3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf > [4] > https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657 > [5] > https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842 -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer