Re: [Intel-gfx] [PATCH] drm/i915: optionally ban context on first hang

Paul Berry Wed, 11 Sep 2013 09:27:19 -0700

On 11 September 2013 07:50, Mika Kuoppala <mika.kuopp...@linux.intel.com>wrote:


> Paul Berry <stereotype...@gmail.com> writes:
>
> > On 10 September 2013 06:16, Mika Kuoppala <mika.kuopp...@linux.intel.com
> >wrote:
> >
> >> Current policy is to ban context if it manages to hang
> >> gpu in a certain time windows. Paul Berry asked if more
> >> strict policy could be available for use cases where
> >> the application doesn't know if the rendering command stream
> >> sent to gpu is valid or not.
> >>
> >> Provide an option, flag on context creation time, to let
> >> userspace to set more strict policy for handling gpu hangs for
> >> this context. If context with this flag set ever hangs the gpu,
> >> it will be permanently banned from accessing the GPU.
> >> All subsequent batch submissions will return -EIO.
> >>
> >> Requested-by: Paul Berry <stereotype...@gmail.com>
> >> Cc: Paul Berry <stereotype...@gmail.com>
> >> Cc: Ben Widawsky <b...@bwidawsk.net>
> >> Signed-off-by: Mika Kuoppala <mika.kuopp...@intel.com>
> >>
> >
> > (Cc-ing Ian since this impacts ARB_robustness, which he's been working
> on).
> >
> > To clarify my reasons for requesting this feature, it's not necessarily
> for
> > use cases where the application doesn't know if the rendering command
> > stream is valid.  Rather, it's for any case where there is the risk of a
> > GPU hang (this might happen even if the command stream is valid, for
> > example because of an infinite loop in a WebGL shader).  Since the user
> > mode application (Mesa in my example) assumes that each batch buffer runs
> > to completion before the next batch buffer runs, it frequently includes
> > commands in batch buffer N that rely on state established by commands in
> > batch buffer N-1.  If batch buffer N-1 was interrupted due to a GPU hang,
> > then some of its state updates may not have completed, resulting in a
> > sizeable risk that batch buffer N (and a potentially unlimited number of
> > subsequent batches) will produce a GPU hang also.  The only reliable way
> to
> > recover from this situation is for Mesa to send a new batch buffer that
> > sets up the GPU state from scratch rather than relying on state
> established
> > in previous batch buffers.
>
> Thanks for the clarification. I have updated the commit message.
> >
> > Since Mesa doesn't wait for batch buffer N-1 to complete before
> submitting
> > batch buffer N, once a GPU hang occurs the kernel must regard any
> > subsequent buffers as suspect, until it receives some notification from
> > Mesa that the next batch is going to set up the GPU state from scratch.
> > When we met in June, we decided that the notification mechanism would be
> > for Mesa to stop using the context that caused the GPU hang, and create a
> > new context.  The first batch buffer sent to the new context would (of
> > necessity) set up the GPU state from scratch.  Consequently, all the
> kernel
> > needs to do to implement the new policy is to permanently ban any context
> > involved in a GPU hang.
>
> Involved as a guilty of hang or ban every context who had batches pending?
>
> We could add I915_CONTEXT_BAN_ON_PENDING flag also and with it all contexts
> that were affected would get -EIO on next batch submission after the hang.
>
> > Question, since I'm not terribly familiar with the kernel code: is it
> > possible for the ring buffer to contain batches belonging to multiple
> > contexts at a time?
>
> Yes.
>
> > If so, then what happens if a GPU hang occurs?  For
> > instance, let's say that the ring contains batch A1 from context A
> followed
> > by batch B2 from context B.  What happens if a GPU hang occurs while
> > executing batch A1?  Ideally the kernel would consider only context A to
> > have been involved in the GPU hang, and automatically re-submit batch B2
> so
> > that context B it not affected by the hang.  Less ideally, but still ok,
> > would be for the kernel to consider both contexts A and B to be involved
> in
> > the GPU hang, and apply both contexts' banning policies.  If, however,
> the
> > kernel considered only context A to be involved in the GPU hang, but
> failed
> > to re-submit batch B2, then that would risk future GPU hangs from context
> > B, since a future batch B3 from context B would likely rely on state that
> > should have been established by batch B2.
> >
>
> This patch will only ban the offending context. Other contexts
> will lose the batches that were pending as the request queue will be
> cleared on reset following the hang. As things are now, kernel wont
> re-submit anything by itself.
>

Thanks for the clarification.

The important thing from Mesa's point of view is to make sure that batch N
submitted to context C will only be executed if batch N-1 has run to
completion.  We would like this invariant to hold even if other contexts
cause GPU hangs.  Under the current state of affairs, where a hang on
context A can cause a batch belonging to context B to be lost, we would
need the I915_CONTEXT_BAN_ON_PENDING flag in order to achieve that
invariant.  But if the kernel ever got changed in the future so that it
automatically re-submitted pending batches upon recovery from a GPU hang*
(a change I would advocate), then we wouldn't need the
I915_CONTEXT_BAN_ON_PENDING flag anymore, and in fact setting it would be
counterproductive.

(*Of course, in order to avoid cascading GPU hangs, the kernel should only
re-submit pending batches from contexts other than the offending context)

So I would be in favor of adding a I915_CONTEXT_BAN_ON_PENDING flag, but
I'd suggest renaming it to something like I915_CONTEXT_BAN_ON_BATCH_LOSS.
That way, if in the future, we add the ability for the kernel to re-submit
pending batches upon recovery from a GPU hang, then it will be clear that
I915_CONTEXT_BAN_ON_BATCH_LOSS doesn't apply to the contexts that had their
batches automatically re-submitted.


>
> I have been also working with ioctl (get_reset_stats, for arb robustness
> extension) which allows application to sort out which contexts were
> affected by hang. Here is the planned ioctl for arb robustness
> extension:
>
> https://github.com/mkuoppal/linux/commit/698a413472edaec78852b8ca9849961cbdc40d78
>
> This allows applications then to detect which contexts need to resubmit
> their state and also will give information if the context had batch
> active or pending when the gpu hang happened.
>

That ioctl seems reasonable to me.  My only comment is that we might want
to consider renaming the "batch_pending" field in drm_i915_reset_stats to
"batch_loss", for similar reasons to what I stated above.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH] drm/i915: optionally ban context on first hang

Reply via email to