On Wed, Jul 5, 2023 at 3:32 AM Michel Dänzer <michel.daen...@mailbox.org> wrote: > > On 7/5/23 08:30, Marek Olšák wrote: > > On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daen...@mailbox.org> wrote: > > On 7/4/23 04:34, Marek Olšák wrote: > > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daen...@mailbox.org > > > wrote: > > > On 6/30/23 22:32, Marek Olšák wrote: > > > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer > > <michel.daen...@mailbox.org> wrote: > > > >> On 6/30/23 16:59, Alex Deucher wrote: > > > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > > > >>> <sebastian.w...@redhat.com <mailto:sebastian.w...@redhat.com> > > wrote: > > > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida > > <andrealm...@igalia.com> wrote: > > > >>>>> > > > >>>>> +Robustness > > > >>>>> +---------- > > > >>>>> + > > > >>>>> +The only way to try to keep an application working after a > > reset is if it > > > >>>>> +complies with the robustness aspects of the graphical API > > that it is using. > > > >>>>> + > > > >>>>> +Graphical APIs provide ways to applications to deal with > > device resets. However, > > > >>>>> +there is no guarantee that the app will use such features > > correctly, and the > > > >>>>> +UMD can implement policies to close the app if it is a > > repeating offender, > > > >>>>> +likely in a broken loop. This is done to ensure that it > > does not keep blocking > > > >>>>> +the user interface from being correctly displayed. This > > should be done even if > > > >>>>> +the app is correct but happens to trigger some bug in the > > hardware/driver. > > > >>>> > > > >>>> I still don't think it's good to let the kernel arbitrarily > > kill > > > >>>> processes that it thinks are not well-behaved based on some > > heuristics > > > >>>> and policy. > > > >>>> > > > >>>> Can't this be outsourced to user space? Expose the > > information about > > > >>>> processes causing a device and let e.g. systemd deal with > > coming up > > > >>>> with a policy and with killing stuff. > > > >>> > > > >>> I don't think it's the kernel doing the killing, it would be > > the UMD. > > > >>> E.g., if the app is guilty and doesn't support robustness the > > UMD can > > > >>> just call exit(). > > > >> > > > >> It would be safer to just ignore API calls[0], similarly to > > what is done until the application destroys the context with robustness. > > Calling exit() likely results in losing any unsaved work, whereas at least > > some applications might otherwise allow saving the work by other means. > > > > > > > > That's a terrible idea. Ignoring API calls would be identical > > to a freeze. You might as well disable GPU recovery because the result > > would be the same. > > > > > > No GPU recovery would affect everything using the GPU, whereas > > this affects only non-robust applications. > > > > > > which is currently the majority. > > > > Not sure where you're going with this. Applications need to use > > robustness to be able to recover from a GPU hang, and the GPU needs to be > > reset for that. So disabling GPU reset is not the same as what we're > > discussing here. > > > > > > > > - non-robust contexts: call exit(1) immediately, which is the > > best way to recover > > > > > > That's not the UMD's call to make. > > > > > > That's absolutely the UMD's call to make because that's mandated by > > the hw and API design > > > > Can you point us to a spec which mandates that the process must be > > killed in this case? > > > > > > > and only driver devs know this, which this thread is a proof of. The > > default behavior is to skip all command submission if a non-robust context > > is lost, which looks like a freeze. That's required to prevent infinite > > hangs from the same context and can be caused by the side effects of the > > GPU reset itself, not by the cause of the previous hang. The only way out > > of that is killing the process. > > > > The UMD killing the process is not the only way out of that, and doing > > so is overreach on its part. The UMD is but one out of many components in a > > process, not the main one or a special one. It doesn't get to decide when > > the process must die, certainly not under circumstances where it must be > > able to continue while ignoring API calls (that's required for robustness). > > > > > > You're mixing things up. Robust apps don't any special action from a UMD. > > Only non-robust apps need to be killed for proper recovery with the only > > other alternative being not updating the window/screen, > > I'm saying they don't "need" to be killed, since the UMD must be able to keep > going while ignoring API calls (or it couldn't support robustness). It's a > choice, one which is not for the UMD to make. > > > > Also it's already used and required by our customers on Android because > > killing a process returns the user to the desktop screen and can generate a > > crash dump instead of keeping the app output frozen, and they agree that > > this is the best user experience given the circumstances. > > Then some appropriate Android component needs to make that call. The UMD is > not it.
We can change it once Android and Linux have a better way to handle non-robust apps. Until then, generating a core dump after a GPU crash produces the best outcome for users and developers. Marek