doc: Document DRM device reset expectations

Michel Dänzer Wed, 05 Jul 2023 00:32:25 -0700

On 7/5/23 08:30, Marek Olšák wrote:
> On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daen...@mailbox.org> wrote:
>     On 7/4/23 04:34, Marek Olšák wrote:
>     > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daen...@mailbox.org > 
> wrote:
>     >     On 6/30/23 22:32, Marek Olšák wrote:
>     >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer 
> <michel.daen...@mailbox.org> wrote:
>     >     >> On 6/30/23 16:59, Alex Deucher wrote:
>     >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>     >     >>> <sebastian.w...@redhat.com <mailto:sebastian.w...@redhat.com> 
> wrote:
>     >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida 
> <andrealm...@igalia.com> wrote:
>     >     >>>>>
>     >     >>>>> +Robustness
>     >     >>>>> +----------
>     >     >>>>> +
>     >     >>>>> +The only way to try to keep an application working after a 
> reset is if it
>     >     >>>>> +complies with the robustness aspects of the graphical API 
> that it is using.
>     >     >>>>> +
>     >     >>>>> +Graphical APIs provide ways to applications to deal with 
> device resets. However,
>     >     >>>>> +there is no guarantee that the app will use such features 
> correctly, and the
>     >     >>>>> +UMD can implement policies to close the app if it is a 
> repeating offender,
>     >     >>>>> +likely in a broken loop. This is done to ensure that it does 
> not keep blocking
>     >     >>>>> +the user interface from being correctly displayed. This 
> should be done even if
>     >     >>>>> +the app is correct but happens to trigger some bug in the 
> hardware/driver.
>     >     >>>>
>     >     >>>> I still don't think it's good to let the kernel arbitrarily 
> kill
>     >     >>>> processes that it thinks are not well-behaved based on some 
> heuristics
>     >     >>>> and policy.
>     >     >>>>
>     >     >>>> Can't this be outsourced to user space? Expose the information 
> about
>     >     >>>> processes causing a device and let e.g. systemd deal with 
> coming up
>     >     >>>> with a policy and with killing stuff.
>     >     >>>
>     >     >>> I don't think it's the kernel doing the killing, it would be 
> the UMD.
>     >     >>> E.g., if the app is guilty and doesn't support robustness the 
> UMD can
>     >     >>> just call exit().
>     >     >>
>     >     >> It would be safer to just ignore API calls[0], similarly to what 
> is done until the application destroys the context with robustness. Calling 
> exit() likely results in losing any unsaved work, whereas at least some 
> applications might otherwise allow saving the work by other means.
>     >     >
>     >     > That's a terrible idea. Ignoring API calls would be identical to 
> a freeze. You might as well disable GPU recovery because the result would be 
> the same.
>     >
>     >     No GPU recovery would affect everything using the GPU, whereas this 
> affects only non-robust applications.
>     >
>     > which is currently the majority.
> 
>     Not sure where you're going with this. Applications need to use 
> robustness to be able to recover from a GPU hang, and the GPU needs to be 
> reset for that. So disabling GPU reset is not the same as what we're 
> discussing here.
> 
> 
>     >     > - non-robust contexts: call exit(1) immediately, which is the 
> best way to recover
>     >
>     >     That's not the UMD's call to make.
>     >
>     > That's absolutely the UMD's call to make because that's mandated by the 
> hw and API design
> 
>     Can you point us to a spec which mandates that the process must be killed 
> in this case?
> 
> 
>     > and only driver devs know this, which this thread is a proof of. The 
> default behavior is to skip all command submission if a non-robust context is 
> lost, which looks like a freeze. That's required to prevent infinite hangs 
> from the same context and can be caused by the side effects of the GPU reset 
> itself, not by the cause of the previous hang. The only way out of that is 
> killing the process.
> 
>     The UMD killing the process is not the only way out of that, and doing so 
> is overreach on its part. The UMD is but one out of many components in a 
> process, not the main one or a special one. It doesn't get to decide when the 
> process must die, certainly not under circumstances where it must be able to 
> continue while ignoring API calls (that's required for robustness).
> 
> 
> You're mixing things up. Robust apps don't any special action from a UMD. 
> Only non-robust apps need to be killed for proper recovery with the only 
> other alternative being not updating the window/screen,


I'm saying they don't "need" to be killed, since the UMD must be able to keep 
going while ignoring API calls (or it couldn't support robustness). It's a 
choice, one which is not for the UMD to make.


> Also it's already used and required by our customers on Android because 
> killing a process returns the user to the desktop screen and can generate a 
> crash dump instead of keeping the app output frozen, and they agree that this 
> is the best user experience given the circumstances.

Then some appropriate Android component needs to make that call. The UMD is not 
it.


>     >     >>     [0] Possibly accompanied by a one-time message to stderr 
> along the lines of "GPU reset detected but robustness not enabled in context, 
> ignoring OpenGL API calls".


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

Reply via email to