On 03/01/2017 12:13, Chris Wilson wrote: > On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote: >> >> On 03/01/2017 11:46, Chris Wilson wrote: >>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: >>>> >>>> On 03/01/2017 11:05, Chris Wilson wrote: >>>>> The struct dma_fence carries a status field exposed to userspace by >>>>> sync_file. This is inspected after the fence is signaled and can convey >>>>> whether or not the request completed successfully, or in our case if we >>>>> detected a hang during the request (signaled via -EIO in >>>>> SYNC_IOC_FILE_INFO). >>>>> >>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> >>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com> >>>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com> >>>>> --- >>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- >>>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c >>>>> b/drivers/gpu/drm/i915/i915_gem.c >>>>> index 204c4a673bf3..bc99c0e292d8 100644 >>>>> --- a/drivers/gpu/drm/i915/i915_gem.c >>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct >>>>> intel_engine_cs *engine) >>>>> ring_hung = false; >>>>> } >>>>> >>>>> - if (ring_hung) >>>>> + if (ring_hung) { >>>>> i915_gem_context_mark_guilty(request->ctx); >>>>> - else >>>>> + request->fence.status = -EIO; >>>>> + } else { >>>>> i915_gem_context_mark_innocent(request->ctx); >>>>> + } >>>>> >>>>> if (!ring_hung) >>>>> return; >>>>> >>>> >>>> Reading what happens later in this function, should we set the >>>> status of all the other requests we are about to clear? >>>> >>>> However one thing I don't understand is how this scheme interacts >>>> with the current userspace. We will clear/no-nop some of the >>>> submitted requests since the state is corrupt. But how will >>>> userspace notice this before it submits more requets? >>> >>> There is no mechanism currently for user space to be able to detect a >>> hung request. (It can use the uevent for async notification of the >>> hang/reset, but that will not tell you who caused the hang.) Userspace >>> can track the number of hangs it caused, but the delay makes any >>> roundtripping impractical (i.e. you have to synchronise before all >>> rendering if you must detect the event immediately). Note also that we >>> do not want to give out interprocess information (i.e. to allow one >>> client to spy on another), which makes things harder to get right. >> >> So idea is to clear already submitted requests _if_ the userspace is >> synchronising before all rendering and looking at reset stats, to >> make it theoretically possible to detect the corrupt state? > > No, I'm just don't see a way that userspace can detect the hang without > testing after seeing the request signaled (either by waiting on the > batch or by waiting on the fence), i.e. by being completely synchronous > (or at least chosing its synchronous points very carefully, such as > around IPC). It can either poll reset-count or use sync_file (which > requires fence exporting). > > The current robustness interfaces is a basic query on whether any reset > occurred within the context, not when.
Why do we bother with clearing the submitted requests then? Regards, Tvrtko