On 03/01/2017 12:38, Chris Wilson wrote: > On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote: >> >> On 03/01/2017 12:13, Chris Wilson wrote: >>> On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote: >>>> >>>> On 03/01/2017 11:46, Chris Wilson wrote: >>>>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: >>>>>> >>>>>> On 03/01/2017 11:05, Chris Wilson wrote: >>>>>>> The struct dma_fence carries a status field exposed to userspace by >>>>>>> sync_file. This is inspected after the fence is signaled and can convey >>>>>>> whether or not the request completed successfully, or in our case if we >>>>>>> detected a hang during the request (signaled via -EIO in >>>>>>> SYNC_IOC_FILE_INFO). >>>>>>> >>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> >>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com> >>>>>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- >>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c >>>>>>> b/drivers/gpu/drm/i915/i915_gem.c >>>>>>> index 204c4a673bf3..bc99c0e292d8 100644 >>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c >>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>>>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct >>>>>>> intel_engine_cs *engine) >>>>>>> ring_hung = false; >>>>>>> } >>>>>>> >>>>>>> - if (ring_hung) >>>>>>> + if (ring_hung) { >>>>>>> i915_gem_context_mark_guilty(request->ctx); >>>>>>> - else >>>>>>> + request->fence.status = -EIO; >>>>>>> + } else { >>>>>>> i915_gem_context_mark_innocent(request->ctx); >>>>>>> + } >>>>>>> >>>>>>> if (!ring_hung) >>>>>>> return; >>>>>>> >>>>>> >>>>>> Reading what happens later in this function, should we set the >>>>>> status of all the other requests we are about to clear? >>>>>> >>>>>> However one thing I don't understand is how this scheme interacts >>>>>> with the current userspace. We will clear/no-nop some of the >>>>>> submitted requests since the state is corrupt. But how will >>>>>> userspace notice this before it submits more requets? >>>>> >>>>> There is no mechanism currently for user space to be able to detect a >>>>> hung request. (It can use the uevent for async notification of the >>>>> hang/reset, but that will not tell you who caused the hang.) Userspace >>>>> can track the number of hangs it caused, but the delay makes any >>>>> roundtripping impractical (i.e. you have to synchronise before all >>>>> rendering if you must detect the event immediately). Note also that we >>>>> do not want to give out interprocess information (i.e. to allow one >>>>> client to spy on another), which makes things harder to get right. >>>> >>>> So idea is to clear already submitted requests _if_ the userspace is >>>> synchronising before all rendering and looking at reset stats, to >>>> make it theoretically possible to detect the corrupt state? >>> >>> No, I'm just don't see a way that userspace can detect the hang without >>> testing after seeing the request signaled (either by waiting on the >>> batch or by waiting on the fence), i.e. by being completely synchronous >>> (or at least chosing its synchronous points very carefully, such as >>> around IPC). It can either poll reset-count or use sync_file (which >>> requires fence exporting). >>> >>> The current robustness interfaces is a basic query on whether any reset >>> occurred within the context, not when. >> >> Why do we bother with clearing the submitted requests then? > > The same reason we ban processes from submitting new requests if they > cause repeated hangs. If before we ban that client, it has already > submitted 1000 hanging requests, it has successfully locked the machine > up for a couple of hours.
So we would need to gate clearing on the transition to banned state I think. Because currently it does in unconditionally. Regards, Tvrtko