I don't know what iris does, but I would guess that the same problems as with AMD GPUs apply, making GPUs resets very fragile.
Marek On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koe...@amd.com> wrote: > My main question is what does the iris driver better than radeonsi when > the client doesn't support the robustness extension? > > From Daniels description it sounds like they have at least a partial > recovery mechanism in place. > > Apart from that I completely agree to what you said below. > > Christian. > > Am 26.03.22 um 01:53 schrieb Olsak, Marek: > > [AMD Official Use Only] > > amdgpu has 2 resets: soft reset and hard reset. > > The soft reset is able to recover from an infinite loop and even some GPU > hangs due to bad shaders or bad states. The soft reset uses a signal that > kills all currently-running shaders of a certain process (VM context), > which unblocks the graphics pipeline, so draws and command buffers finish > but are not correctly. This can then cause a hard hang if the shader was > supposed to signal work completion through a shader store instruction and a > non-shader consumer is waiting for it (skipping the store instruction by > killing the shader won't signal the work, and thus the consumer will be > stuck, requiring a hard reset). > > The hard reset can recover from other hangs, which is great, but it may > use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory > contents, but we should assume that any process that had running jobs on > the GPU during a GPU reset has its memory resources in an inconsistent > state, and thus following command buffers can cause another GPU hang. The > shader store example above is enough to cause another hard hang due to > incorrect content in memory resources, which can contain synchronization > primitives that are used internally by the hardware. > > Asking the driver to replay a command buffer that caused a hang is a sure > way to hang it again. Unrelated processes can be affected due to lost VRAM > or the misfortune of using the GPU while the GPU hang occurred. The window > system should recreate GPU resources and redraw everything without > affecting applications. If apps use GL, they should do the same. Processes > that can't recover by redrawing content can be terminated or left alone, > but they shouldn't be allowed to submit work to the GPU anymore. > > dEQP only exercises the soft reset. I think WebGL is only able to trigger > a soft reset at this point, but Vulkan can also trigger a hard reset. > > Marek > ------------------------------ > *From:* Koenig, Christian <christian.koe...@amd.com> > <christian.koe...@amd.com> > *Sent:* March 23, 2022 11:25 > *To:* Daniel Vetter <dan...@ffwll.ch> <dan...@ffwll.ch>; Daniel Stone > <dan...@fooishbar.org> <dan...@fooishbar.org>; Olsak, Marek > <marek.ol...@amd.com> <marek.ol...@amd.com>; Grodzovsky, Andrey > <andrey.grodzov...@amd.com> <andrey.grodzov...@amd.com> > *Cc:* Rob Clark <robdcl...@gmail.com> <robdcl...@gmail.com>; Rob Clark > <robdcl...@chromium.org> <robdcl...@chromium.org>; Sharma, Shashank > <shashank.sha...@amd.com> <shashank.sha...@amd.com>; Christian König > <ckoenig.leichtzumer...@gmail.com> <ckoenig.leichtzumer...@gmail.com>; > Somalapuram, Amaranath <amaranath.somalapu...@amd.com> > <amaranath.somalapu...@amd.com>; Abhinav Kumar <quic_abhin...@quicinc.com> > <quic_abhin...@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org> > <dri-devel@lists.freedesktop.org>; amd-gfx list > <amd-...@lists.freedesktop.org> <amd-...@lists.freedesktop.org>; Deucher, > Alexander <alexander.deuc...@amd.com> <alexander.deuc...@amd.com>; > Shashank Sharma <contactshashanksha...@gmail.com> > <contactshashanksha...@gmail.com> > *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event > > [Adding Marek and Andrey as well] > > Am 23.03.22 um 16:14 schrieb Daniel Vetter: > > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <dan...@fooishbar.org> > <dan...@fooishbar.org> wrote: > >> Hi, > >> > >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdcl...@gmail.com> > <robdcl...@gmail.com> wrote: > >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König > >>> <christian.koe...@amd.com> <christian.koe...@amd.com> wrote: > >>>> Well you can, it just means that their contexts are lost as well. > >>> Which is rather inconvenient when deqp-egl reset tests, for example, > >>> take down your compositor ;-) > >> Yeah. Or anything WebGL. > >> > >> System-wide collateral damage is definitely a non-starter. If that > >> means that the userspace driver has to do what iris does and ensure > >> everything's recreated and resubmitted, that works too, just as long > >> as the response to 'my adblocker didn't detect a crypto miner ad' is > >> something better than 'shoot the entire user session'. > > Not sure where that idea came from, I thought at least I made it clear > > that legacy gl _has_ to recover. It's only vk and arb_robustness gl > > which should die without recovery attempt. > > > > The entire discussion here is who should be responsible for replay and > > at least if you can decide the uapi, then punting that entirely to > > userspace is a good approach. > > Yes, completely agree. We have the approach of re-submitting things in > the kernel and that failed quite miserable. > > In other words currently a GPU reset has something like a 99% chance to > get down your whole desktop. > > Daniel can you briefly explain what exactly iris does when a lost > context is detected without gl robustness? > > It sounds like you guys got that working quite well. > > Thanks, > Christian. > > > > > Ofc it'd be nice if the collateral damage is limited, i.e. requests > > not currently on the gpu, or on different engines and all that > > shouldn't be nuked, if possible. > > > > Also ofc since msm uapi is that the kernel tries to recover there's > > not much we can do there, contexts cannot be shot. But still trying to > > replay them as much as possible feels a bit like overkill. > > -Daniel > > > >> Cheers, > >> Daniel > > > > > > >