On Thu, Jan 06, 2022 at 04:31:43PM -0800, john.c.harri...@intel.com wrote:
> From: John Harrison <john.c.harri...@intel.com>
> 
> There is a race (already documented in the code) whereby a context can
> be (re-)queued for submission at the same time as it is being banned
> due to a hang and reset. That leads to a hang/reset report from GuC
> for a context which i915 thinks is already banned.
> 

I think there are 2 issues here.

1. Banning of context (e.g. user closes a non-persistent context)
results in an context reset. In this case we will receive a G2H
indicating a context reset and we want to convert the context reset to a
nop.

2. A GT reset races with a context reset result in the context getting
banned before the G2H is processed. Again we want to convert the context
reset to a nop. This race should be sealed when we can flush the G2H
handler in the reset path. Flushing G2H handler depends on the error
capture not allocating memory in non-sleeping contexts. Thomas H had a
patch for this.

In both cases we shouldn't print an error.

> While the race is indented to be fixed in a future GuC update, there
> is no actual harm beyond the wasted execution time of that new hang
> detection period. The context has already been banned for bad
> behaviour so a fresh hang is hardly surprising and certainly isn't
> going to be losing any work that wouldn't already have been lost if
> there was no race.
>

See above, I think you are confusing the issues here. This won't be
fixed by an updated GuC firmware.

> So don't treat this situation as an error. The error message is seen
> by the CI system as something fatal and causes test failures. Instead,
> just print an informational so the user at least knows a context reset
> occurred (given that the error capture is being skipped).
> 
> Signed-off-by: John Harrison <john.c.harri...@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 9989d121127d..e8a32a7e7daf 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3978,6 +3978,10 @@ static void guc_handle_context_reset(struct intel_guc 
> *guc,
>                  !context_blocked(ce))) {
>               capture_error_state(guc, ce);
>               guc_context_replay(ce);
> +     } else if (intel_context_is_banned(ce)) {
> +             drm_info(&guc_to_gt(guc)->i915->drm,
> +                      "Reset notificaion for banned context 0x%04X on %s",
> +                      ce->guc_id.id, ce->engine->name);

The context being blocking isn't an error either. I think real fix is
changing the below drm_err to drm_info and call it a day.

Matt

>       } else {
>               drm_err(&guc_to_gt(guc)->i915->drm,
>                       "Invalid GuC engine reset notificaion for 0x%04X on %s: 
> banned = %d, blocked = %d",
> -- 
> 2.25.1
> 

Reply via email to