[AMD Official Use Only - Internal Distribution Only]

H Andrey

The scenario I hit here is not the one you mentioned, let me explain it with 
more details by another much easier understood example:

Consider ring you have a job1 on KCQ, but the timeout of KCQ is 60 seconds 
(just for example)
You also have a job2 on GFX ring, and the timeout of GFX is 2 seconds

We submit job1 first, and assume job1 have bug and it will cause shader hang 
very very soon
After 10 seconds we submit job2, since KCQ have 60 seconds to report TDR thus 
SW know nothing about the engine already hang
After 2 seconds we got TDR report from job2 on GFX ring, sched_job_timeout() 
think the leading job of GFX ring is the black sheep so it is deleted from the 
mirror list
But in fact this job1 is innocent, and we should insert it back after recovery 
, and due to it was already deleted this innocent job’s context/process is 
really harmed

Hope above example helps

Thanks

发件人: Grodzovsky, Andrey <andrey.grodzov...@amd.com>
发送时间: 2021年2月27日 0:50
收件人: Liu, Monk <monk....@amd.com>; Koenig, Christian 
<christian.koe...@amd.com>; amd-gfx@lists.freedesktop.org
抄送: Zhang, Andy <andy.zh...@amd.com>; Chen, Horace <horace.c...@amd.com>; 
Zhang, Jack (Jian) <jack.zha...@amd.com>
主题: Re: [RFC] a new approach to detect which ring is the real black sheep upon 
TDR reported



On 2021-02-26 6:54 a.m., Liu, Monk wrote:

[AMD Official Use Only - Internal Distribution Only]

See in line

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Koenig, Christian 
<christian.koe...@amd.com><mailto:christian.koe...@amd.com>
Sent: Friday, February 26, 2021 3:58 PM
To: Liu, Monk <monk....@amd.com><mailto:monk....@amd.com>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Zhang, Andy <andy.zh...@amd.com><mailto:andy.zh...@amd.com>; Chen, Horace 
<horace.c...@amd.com><mailto:horace.c...@amd.com>; Zhang, Jack (Jian) 
<jack.zha...@amd.com><mailto:jack.zha...@amd.com>
Subject: Re: [RFC] a new approach to detect which ring is the real black sheep 
upon TDR reported

Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job might pass 
perfectly when run a second time.
[ML] but the innocent ring already reported a TDR, and the drm sched logic 
already deleted this “sched_job” in its mirror list, thus you don’t have chance 
to re-submit it again after reset, that’s the major problem here.



Just to confirm I understand correctly, Monk reports a scenario where the 
second TDR that was reported by the innocent job is bailing out BEFORE having a 
chance to run  drm_sched_stop for that scheduler which should have reinserted 
the job back into mirror list (because the first TDR run is still in progress 
and hence amdgpu_device_lock_adev fails for the second TDR) and so the innocent 
job which was extracted from mirror list in drm_sched_job_timedout is now lost.
If so and as a possible quick fix until we overhaul the entire design as 
suggested in this thread - maybe we can modify 
drm_sched_backend_ops.timedout_job callback to report back premature 
termination BEFORE drm_sched_stop had a chance to run and then reinsert back 
the job into mirror list from within  drm_sched_job_timedout? There is no 
problem of racing against concurrent drm_sched_get_cleanup_job once we reinsert 
there as we don't reference the job pointer anymore after this point and so if 
it's already signaled and freed right away - it's ok.

Andrey



Apart from that the whole ring mirror list turned out to be a really bad idea. 
E.g. we still struggle with object life time because the concept doesn't fit 
into the object model of the GPU scheduler under Linux.

We should probably work on this separately and straighten up the job 
destruction once more and keep the recovery information in the fence instead.
[ML] we claim to our customer that no innocent process will be dropped or 
cancelled, and our current logic works for the most time, but only when there 
are different process running on gfx/computes rings then we would run into the 
tricky situation I stated here, and the proposal is the only way I can figure 
out so far, do you have a better solution or idea we review it as another 
candidate RFC ? Be note that we raised this proposal is because we do hit our 
trouble and we do need to resolve it …. So even a not perfect solution is still 
better than just cancel the innocent job (and their context/process)
Thanks !

Regards,
Christian.
Am 26.02.21 um 06:58 schrieb Liu, Monk:

[AMD Public Use]

Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned out to 
be a general headache of our TDR mechanism , check below scenario:


  1.  There is a job1 running on compute1 ring at timestamp
  2.  There is a job2 running on gfx ring at timestamp
  3.  Job1 is the guilty one, and job1/job2 were scheduled to their rings at 
almost the same timestamp
  4.  After 2 seconds we receive two TDR reporting from both GFX ring and 
compute ring
  5.  Current scheme is that in drm scheduler all the head jobs of those two 
rings are considered “bad job” and taken away from the mirror list
  6.  The result is both the real guilty job (job1) and the innocent job (job2) 
were all deleted from mirror list, and their corresponding contexts were also 
treated as guilty (so the innocent process remains running is not secured)


But by our wish the ideal case is TDR mechanism can detect which ring is the 
guilty ring and the innocent ring can resubmits all its pending jobs:

  1.  Job1 to be deleted from compute1 ring’s mirror list
  2.  Job2 is kept and resubmitted later and its belonging process/context are 
even not aware of this TDR at all


Here I have a proposal tend to achieve above goal and it rough procedure is :

  1.  Once any ring reports a TDR, the head job is *not* treated as “bad job”, 
and it is *not* deleted from the mirror list in drm sched functions
  2.  In vendor’s function (our amdgpu driver here):

     *   reset GPU
     *   repeat below actions on each RINGS * one by one *:

1. take the head job and submit it on this ring

2. see if it completes, if not then this job is the real “bad job”

3.  take it away from mirror list if this head job is “bad job”

     *   After above iteration on all RINGS, we already clears all the bad 
job(s)

  1.  Resubmit all jobs from each mirror list to their corresponding rings 
(this is the existed logic)

The idea of this is to use “serial” way to re-run and re-check each head job of 
each RING, in order to take out the real black sheep and its guilty context.

P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those 
rings are intermutually affected to each other. For SDMA ring timeout it 
definitely proves the head job on SDMA ring is really guilty.

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------





_______________________________________________

amd-gfx mailing list

amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to