On 11 September 2013 07:50, Mika Kuoppala <mika.kuopp...@linux.intel.com>wrote:
> Paul Berry <stereotype...@gmail.com> writes: > > > On 10 September 2013 06:16, Mika Kuoppala <mika.kuopp...@linux.intel.com > >wrote: > > > >> Current policy is to ban context if it manages to hang > >> gpu in a certain time windows. Paul Berry asked if more > >> strict policy could be available for use cases where > >> the application doesn't know if the rendering command stream > >> sent to gpu is valid or not. > >> > >> Provide an option, flag on context creation time, to let > >> userspace to set more strict policy for handling gpu hangs for > >> this context. If context with this flag set ever hangs the gpu, > >> it will be permanently banned from accessing the GPU. > >> All subsequent batch submissions will return -EIO. > >> > >> Requested-by: Paul Berry <stereotype...@gmail.com> > >> Cc: Paul Berry <stereotype...@gmail.com> > >> Cc: Ben Widawsky <b...@bwidawsk.net> > >> Signed-off-by: Mika Kuoppala <mika.kuopp...@intel.com> > >> > > > > (Cc-ing Ian since this impacts ARB_robustness, which he's been working > on). > > > > To clarify my reasons for requesting this feature, it's not necessarily > for > > use cases where the application doesn't know if the rendering command > > stream is valid. Rather, it's for any case where there is the risk of a > > GPU hang (this might happen even if the command stream is valid, for > > example because of an infinite loop in a WebGL shader). Since the user > > mode application (Mesa in my example) assumes that each batch buffer runs > > to completion before the next batch buffer runs, it frequently includes > > commands in batch buffer N that rely on state established by commands in > > batch buffer N-1. If batch buffer N-1 was interrupted due to a GPU hang, > > then some of its state updates may not have completed, resulting in a > > sizeable risk that batch buffer N (and a potentially unlimited number of > > subsequent batches) will produce a GPU hang also. The only reliable way > to > > recover from this situation is for Mesa to send a new batch buffer that > > sets up the GPU state from scratch rather than relying on state > established > > in previous batch buffers. > > Thanks for the clarification. I have updated the commit message. > > > > Since Mesa doesn't wait for batch buffer N-1 to complete before > submitting > > batch buffer N, once a GPU hang occurs the kernel must regard any > > subsequent buffers as suspect, until it receives some notification from > > Mesa that the next batch is going to set up the GPU state from scratch. > > When we met in June, we decided that the notification mechanism would be > > for Mesa to stop using the context that caused the GPU hang, and create a > > new context. The first batch buffer sent to the new context would (of > > necessity) set up the GPU state from scratch. Consequently, all the > kernel > > needs to do to implement the new policy is to permanently ban any context > > involved in a GPU hang. > > Involved as a guilty of hang or ban every context who had batches pending? > > We could add I915_CONTEXT_BAN_ON_PENDING flag also and with it all contexts > that were affected would get -EIO on next batch submission after the hang. > > > Question, since I'm not terribly familiar with the kernel code: is it > > possible for the ring buffer to contain batches belonging to multiple > > contexts at a time? > > Yes. > > > If so, then what happens if a GPU hang occurs? For > > instance, let's say that the ring contains batch A1 from context A > followed > > by batch B2 from context B. What happens if a GPU hang occurs while > > executing batch A1? Ideally the kernel would consider only context A to > > have been involved in the GPU hang, and automatically re-submit batch B2 > so > > that context B it not affected by the hang. Less ideally, but still ok, > > would be for the kernel to consider both contexts A and B to be involved > in > > the GPU hang, and apply both contexts' banning policies. If, however, > the > > kernel considered only context A to be involved in the GPU hang, but > failed > > to re-submit batch B2, then that would risk future GPU hangs from context > > B, since a future batch B3 from context B would likely rely on state that > > should have been established by batch B2. > > > > This patch will only ban the offending context. Other contexts > will lose the batches that were pending as the request queue will be > cleared on reset following the hang. As things are now, kernel wont > re-submit anything by itself. > Thanks for the clarification. The important thing from Mesa's point of view is to make sure that batch N submitted to context C will only be executed if batch N-1 has run to completion. We would like this invariant to hold even if other contexts cause GPU hangs. Under the current state of affairs, where a hang on context A can cause a batch belonging to context B to be lost, we would need the I915_CONTEXT_BAN_ON_PENDING flag in order to achieve that invariant. But if the kernel ever got changed in the future so that it automatically re-submitted pending batches upon recovery from a GPU hang* (a change I would advocate), then we wouldn't need the I915_CONTEXT_BAN_ON_PENDING flag anymore, and in fact setting it would be counterproductive. (*Of course, in order to avoid cascading GPU hangs, the kernel should only re-submit pending batches from contexts other than the offending context) So I would be in favor of adding a I915_CONTEXT_BAN_ON_PENDING flag, but I'd suggest renaming it to something like I915_CONTEXT_BAN_ON_BATCH_LOSS. That way, if in the future, we add the ability for the kernel to re-submit pending batches upon recovery from a GPU hang, then it will be clear that I915_CONTEXT_BAN_ON_BATCH_LOSS doesn't apply to the contexts that had their batches automatically re-submitted. > > I have been also working with ioctl (get_reset_stats, for arb robustness > extension) which allows application to sort out which contexts were > affected by hang. Here is the planned ioctl for arb robustness > extension: > > https://github.com/mkuoppal/linux/commit/698a413472edaec78852b8ca9849961cbdc40d78 > > This allows applications then to detect which contexts need to resubmit > their state and also will give information if the context had batch > active or pending when the gpu hang happened. > That ioctl seems reasonable to me. My only comment is that we might want to consider renaming the "batch_pending" field in drm_i915_reset_stats to "batch_loss", for similar reasons to what I stated above.
_______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx