Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Andrey Grodzovsky Fri, 26 Feb 2021 08:49:46 -0800


On 2021-02-26 6:54 a.m., Liu, Monk wrote:

[AMD Official Use Only - Internal Distribution Only]

See in line

Thanks

------------------------------------------

Monk Liu | Cloud-GPU Core team

------------------------------------------

*From:* Koenig, Christian <christian.koe...@amd.com>
*Sent:* Friday, February 26, 2021 3:58 PM
*To:* Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
*Cc:* Zhang, Andy <andy.zh...@amd.com>; Chen, Horace<horace.c...@amd.com>; Zhang, Jack (Jian) <jack.zha...@amd.com>*Subject:* Re: [RFC] a new approach to detect which ring is the realblack sheep upon TDR reported
Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.
2. Things get often stuck because of timing issues, so a guilty jobmight pass perfectly when run a second time.
[ML] but the innocent ring already reported a TDR, and the drm schedlogic already deleted this “sched_job” in its mirror list, thus youdon’t have chance to re-submit it again after reset, that’s the majorproblem here.

Just to confirm I understand correctly, Monk reports a scenario wherethe second TDR that was reported by the innocent job is bailing outBEFORE having a chance to run drm_sched_stop for that scheduler whichshould have reinserted the job back into mirror list (because the firstTDR run is still in progress and hence amdgpu_device_lock_adev fails forthe second TDR) and so the innocent job which was extracted from mirrorlist in drm_sched_job_timedout is now lost.If so and as a possible quick fix until we overhaul the entire design assuggested in this thread - maybe we can modifydrm_sched_backend_ops.timedout_job callback to report back prematuretermination BEFORE drm_sched_stop had a chance to run and then reinsertback the job into mirror list from within drm_sched_job_timedout? Thereis no problem of racing against concurrent drm_sched_get_cleanup_jobonce we reinsert there as we don't reference the job pointer anymoreafter this point and so if it's already signaled and freed right away -it's ok.


Andrey

Apart from that the whole ring mirror list turned out to be a reallybad idea. E.g. we still struggle with object life time because theconcept doesn't fit into the object model of the GPU scheduler underLinux.

We should probably work on this separately and straighten up the jobdestruction once more and keep the recovery information in the fenceinstead.

[ML] we claim to our customer that no innocent process will be droppedor cancelled, and our current logic works for the most time, but onlywhen there are different process running on gfx/computes rings then wewould run into the tricky situation I stated here, and the proposal isthe only way I can figure out so far, do you have a better solution oridea we review it as another candidate RFC ? Be note that we raisedthis proposal is because we do hit our trouble and we do need toresolve it …. So even a not perfect solution is still better than justcancel the innocent job (and their context/process)


Thanks !


Regards,
Christian.

Am 26.02.21 um 06:58 schrieb Liu, Monk:

    [AMD Public Use]

    Hi all

    NAVI2X  project hit a really hard to solve issue now, and it is
    turned out to be a general headache of our TDR mechanism , check
    below scenario:

     1. There is a job1 running on compute1 ring at timestamp
     2. There is a job2 running on gfx ring at timestamp
     3. Job1 is the guilty one, and job1/job2 were scheduled to their
        rings at almost the same timestamp
     4. After 2 seconds we receive two TDR reporting from both GFX
        ring and compute ring
     5. *Current scheme is that in drm scheduler all the head jobs of
        those two rings are considered “bad job” and taken away from
        the mirror list *
     6. The result is both the real guilty job (job1) and the innocent
        job (job2) were all deleted from mirror list, and their
        corresponding contexts were also treated as guilty*(so the
        innocent process remains running is not secured)*

    **

    But by our wish the ideal case is TDR mechanism can detect which
    ring is the guilty ring and the innocent ring can resubmits all
    its pending jobs:

     1. Job1 to be deleted from compute1 ring’s mirror list
     2. Job2 is kept and resubmitted later and its belonging
        process/context are even not aware of this TDR at all

    Here I have a proposal tend to achieve above goal and it rough
    procedure is :

     1. Once any ring reports a TDR, the head job is **not** treated
        as “bad job”, and it is **not** deleted from the mirror list
        in drm sched functions
     2. In vendor’s function (our amdgpu driver here):

          * reset GPU
          * repeat below actions on each RINGS * one by one *:

    1.take the head job and submit it on this ring

    2.see if it completes, if not then this job is the real “bad job”

    3. take it away from mirror list if this head job is “bad job”

          * After above iteration on all RINGS, we already clears all
            the bad job(s)

     3. Resubmit all jobs from each mirror list to their corresponding
        rings (this is the existed logic)

    The idea of this is to use “serial” way to re-run and re-check
    each head job of each RING, in order to take out the real black
    sheep and its guilty context.

    P.S.: we can use this approaches only on GFX/KCQ ring reports TDR
    , since those rings are intermutually affected to each other. For
    SDMA ring timeout it definitely proves the head job on SDMA ring
    is really guilty.

    Thanks

    ------------------------------------------

    Monk Liu | Cloud-GPU Core team

    ------------------------------------------


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Reply via email to