doc: Document DRM device reset expectations

Marek Olšák Wed, 05 Jul 2023 08:54:40 -0700

On Wed, Jul 5, 2023 at 3:32 AM Michel Dänzer <michel.daen...@mailbox.org> wrote:
>
> On 7/5/23 08:30, Marek Olšák wrote:
> > On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daen...@mailbox.org> wrote:
> >     On 7/4/23 04:34, Marek Olšák wrote:
> >     > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daen...@mailbox.org 
> > > wrote:
> >     >     On 6/30/23 22:32, Marek Olšák wrote:
> >     >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer 
> > <michel.daen...@mailbox.org> wrote:
> >     >     >> On 6/30/23 16:59, Alex Deucher wrote:
> >     >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >     >     >>> <sebastian.w...@redhat.com <mailto:sebastian.w...@redhat.com> 
> > wrote:
> >     >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida 
> > <andrealm...@igalia.com> wrote:
> >     >     >>>>>
> >     >     >>>>> +Robustness
> >     >     >>>>> +----------
> >     >     >>>>> +
> >     >     >>>>> +The only way to try to keep an application working after a 
> > reset is if it
> >     >     >>>>> +complies with the robustness aspects of the graphical API 
> > that it is using.
> >     >     >>>>> +
> >     >     >>>>> +Graphical APIs provide ways to applications to deal with 
> > device resets. However,
> >     >     >>>>> +there is no guarantee that the app will use such features 
> > correctly, and the
> >     >     >>>>> +UMD can implement policies to close the app if it is a 
> > repeating offender,
> >     >     >>>>> +likely in a broken loop. This is done to ensure that it 
> > does not keep blocking
> >     >     >>>>> +the user interface from being correctly displayed. This 
> > should be done even if
> >     >     >>>>> +the app is correct but happens to trigger some bug in the 
> > hardware/driver.
> >     >     >>>>
> >     >     >>>> I still don't think it's good to let the kernel arbitrarily 
> > kill
> >     >     >>>> processes that it thinks are not well-behaved based on some 
> > heuristics
> >     >     >>>> and policy.
> >     >     >>>>
> >     >     >>>> Can't this be outsourced to user space? Expose the 
> > information about
> >     >     >>>> processes causing a device and let e.g. systemd deal with 
> > coming up
> >     >     >>>> with a policy and with killing stuff.
> >     >     >>>
> >     >     >>> I don't think it's the kernel doing the killing, it would be 
> > the UMD.
> >     >     >>> E.g., if the app is guilty and doesn't support robustness the 
> > UMD can
> >     >     >>> just call exit().
> >     >     >>
> >     >     >> It would be safer to just ignore API calls[0], similarly to 
> > what is done until the application destroys the context with robustness. 
> > Calling exit() likely results in losing any unsaved work, whereas at least 
> > some applications might otherwise allow saving the work by other means.
> >     >     >
> >     >     > That's a terrible idea. Ignoring API calls would be identical 
> > to a freeze. You might as well disable GPU recovery because the result 
> > would be the same.
> >     >
> >     >     No GPU recovery would affect everything using the GPU, whereas 
> > this affects only non-robust applications.
> >     >
> >     > which is currently the majority.
> >
> >     Not sure where you're going with this. Applications need to use 
> > robustness to be able to recover from a GPU hang, and the GPU needs to be 
> > reset for that. So disabling GPU reset is not the same as what we're 
> > discussing here.
> >
> >
> >     >     > - non-robust contexts: call exit(1) immediately, which is the 
> > best way to recover
> >     >
> >     >     That's not the UMD's call to make.
> >     >
> >     > That's absolutely the UMD's call to make because that's mandated by 
> > the hw and API design
> >
> >     Can you point us to a spec which mandates that the process must be 
> > killed in this case?
> >
> >
> >     > and only driver devs know this, which this thread is a proof of. The 
> > default behavior is to skip all command submission if a non-robust context 
> > is lost, which looks like a freeze. That's required to prevent infinite 
> > hangs from the same context and can be caused by the side effects of the 
> > GPU reset itself, not by the cause of the previous hang. The only way out 
> > of that is killing the process.
> >
> >     The UMD killing the process is not the only way out of that, and doing 
> > so is overreach on its part. The UMD is but one out of many components in a 
> > process, not the main one or a special one. It doesn't get to decide when 
> > the process must die, certainly not under circumstances where it must be 
> > able to continue while ignoring API calls (that's required for robustness).
> >
> >
> > You're mixing things up. Robust apps don't any special action from a UMD. 
> > Only non-robust apps need to be killed for proper recovery with the only 
> > other alternative being not updating the window/screen,
>
> I'm saying they don't "need" to be killed, since the UMD must be able to keep 
> going while ignoring API calls (or it couldn't support robustness). It's a 
> choice, one which is not for the UMD to make.
>
>
> > Also it's already used and required by our customers on Android because 
> > killing a process returns the user to the desktop screen and can generate a 
> > crash dump instead of keeping the app output frozen, and they agree that 
> > this is the best user experience given the circumstances.
>
> Then some appropriate Android component needs to make that call. The UMD is 
> not it.


We can change it once Android and Linux have a better way to handle
non-robust apps. Until then, generating a core dump after a GPU crash
produces the best outcome for users and developers.

Marek

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

Reply via email to