On 7/5/23 08:30, Marek Olšák wrote: > On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daen...@mailbox.org> wrote: > On 7/4/23 04:34, Marek Olšák wrote: > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daen...@mailbox.org > > wrote: > > On 6/30/23 22:32, Marek Olšák wrote: > > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer > <michel.daen...@mailbox.org> wrote: > > >> On 6/30/23 16:59, Alex Deucher wrote: > > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > > >>> <sebastian.w...@redhat.com <mailto:sebastian.w...@redhat.com> > wrote: > > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida > <andrealm...@igalia.com> wrote: > > >>>>> > > >>>>> +Robustness > > >>>>> +---------- > > >>>>> + > > >>>>> +The only way to try to keep an application working after a > reset is if it > > >>>>> +complies with the robustness aspects of the graphical API > that it is using. > > >>>>> + > > >>>>> +Graphical APIs provide ways to applications to deal with > device resets. However, > > >>>>> +there is no guarantee that the app will use such features > correctly, and the > > >>>>> +UMD can implement policies to close the app if it is a > repeating offender, > > >>>>> +likely in a broken loop. This is done to ensure that it does > not keep blocking > > >>>>> +the user interface from being correctly displayed. This > should be done even if > > >>>>> +the app is correct but happens to trigger some bug in the > hardware/driver. > > >>>> > > >>>> I still don't think it's good to let the kernel arbitrarily > kill > > >>>> processes that it thinks are not well-behaved based on some > heuristics > > >>>> and policy. > > >>>> > > >>>> Can't this be outsourced to user space? Expose the information > about > > >>>> processes causing a device and let e.g. systemd deal with > coming up > > >>>> with a policy and with killing stuff. > > >>> > > >>> I don't think it's the kernel doing the killing, it would be > the UMD. > > >>> E.g., if the app is guilty and doesn't support robustness the > UMD can > > >>> just call exit(). > > >> > > >> It would be safer to just ignore API calls[0], similarly to what > is done until the application destroys the context with robustness. Calling > exit() likely results in losing any unsaved work, whereas at least some > applications might otherwise allow saving the work by other means. > > > > > > That's a terrible idea. Ignoring API calls would be identical to > a freeze. You might as well disable GPU recovery because the result would be > the same. > > > > No GPU recovery would affect everything using the GPU, whereas this > affects only non-robust applications. > > > > which is currently the majority. > > Not sure where you're going with this. Applications need to use > robustness to be able to recover from a GPU hang, and the GPU needs to be > reset for that. So disabling GPU reset is not the same as what we're > discussing here. > > > > > - non-robust contexts: call exit(1) immediately, which is the > best way to recover > > > > That's not the UMD's call to make. > > > > That's absolutely the UMD's call to make because that's mandated by the > hw and API design > > Can you point us to a spec which mandates that the process must be killed > in this case? > > > > and only driver devs know this, which this thread is a proof of. The > default behavior is to skip all command submission if a non-robust context is > lost, which looks like a freeze. That's required to prevent infinite hangs > from the same context and can be caused by the side effects of the GPU reset > itself, not by the cause of the previous hang. The only way out of that is > killing the process. > > The UMD killing the process is not the only way out of that, and doing so > is overreach on its part. The UMD is but one out of many components in a > process, not the main one or a special one. It doesn't get to decide when the > process must die, certainly not under circumstances where it must be able to > continue while ignoring API calls (that's required for robustness). > > > You're mixing things up. Robust apps don't any special action from a UMD. > Only non-robust apps need to be killed for proper recovery with the only > other alternative being not updating the window/screen,
I'm saying they don't "need" to be killed, since the UMD must be able to keep going while ignoring API calls (or it couldn't support robustness). It's a choice, one which is not for the UMD to make. > Also it's already used and required by our customers on Android because > killing a process returns the user to the desktop screen and can generate a > crash dump instead of keeping the app output frozen, and they agree that this > is the best user experience given the circumstances. Then some appropriate Android component needs to make that call. The UMD is not it. > > >> [0] Possibly accompanied by a one-time message to stderr > along the lines of "GPU reset detected but robustness not enabled in context, > ignoring OpenGL API calls". -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer