Re: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Christian König Sat, 27 Feb 2021 08:07:54 -0800

Am 27.02.21 um 04:50 schrieb Liu, Monk:

[AMD Official Use Only - Internal Distribution Only]
the code I pasted is to illustrate why the innocent job is alreadytaken out in the mirror list thus my suggested proposal won’t workunless we don’t delete the job in sched_job_timeout() routine, and theproblem you stated is with my understanding also kind of related withmy suggested solution – the job removing from list should be handledby driver instead of scheduler .


Yes, exactly that's my thinking as well.

let make scheduler’s duty clear and simple : the sched_job_timeout()only get notification when a sched_job timedout but it doesn’t judge if the leading job in mirror list should be blamed , all thosechecking should be left to driver to take action.


Need to get a detailed look, but it sounds correct as well.

>>If we do this we should probably make it configurable as a moduleparameter.
That’s ok, maybe we can reuse the existed parm “gpu_recovery”, extendit with:
l0 – no recovery initiated after job timeout

l1 – legacy TDR behave

l2 – enhanced TDR behave (the one suggested here)

Yes, something like that should work. Key point is we had a couple ofpeople who already suggested to optimize the reset routine so that itdoesn't take so long.

So far I pushed back on this because the reset routine isn't something Iwould optimize for speed. But when it starts to take something like 10seconds instead of halve a second because you had an extra long runningcompute job we will certainly see complains.


Regards,
Christian.

*发件人:*Koenig, Christian <christian.koe...@amd.com>
*发送时间:*2021年2月26日20:05
*收件人:*Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org

*抄送:*Zhang, Andy <andy.zh...@amd.com>; Chen, Horace<horace.c...@amd.com>; Zhang, Jack (Jian) <jack.zha...@amd.com>*主题:*Re: [RFC] a new approach to detect which ring is the real blacksheep upon TDR reported

Yeah that is exactly the stuff which doesn't works at all. We gotfeedback for multiple people that this whole approach of tying the jobto the tdr was not a good idea at all.

What we should do instead is to have a pointer in the scheduler fenceto which job it belongs. Freeing up the job when the scheduler fenceis signaled is then job of the driver and not the scheduler any more.

The scheduler then gives the scheduler fence to the driver when atimeout is detected and the driver can do the rest of the handling allby itself.


But this problem is orthogonal to the suggested solution here.


    do you have a better solution or idea we review it as another
    candidate RFC ?

I don't see much other option either. We could do something like onlyallowing one application at a time to use the gfx/compute block, butthat would be even worse.

If we do this we should probably make it configurable as a moduleparameter.


Regards,
Christian.

Am 26.02.21 um 12:57 schrieb Liu, Monk:

    [AMD Official Use Only - Internal Distribution Only]

    static void drm_sched_job_timedout(struct work_struct *work)

    279 {

    280     struct drm_gpu_scheduler *sched;

    281     struct drm_sched_job *job;

    282

    283     sched = container_of(work, struct drm_gpu_scheduler,
    work_tdr.work);

    284

    285     /* Protects against concurrent deletion in
    drm_sched_get_cleanup_job */

    *286 spin_lock(&sched->job_list_lock);*

    *287     job = list_first_entry_or_null(&sched->ring_mirror_list,*

    *288 struct drm_sched_job, node);*

    *289*

    *290     if (job) {*

    *291         /**

    *292          * Remove the bad job so it cannot be freed by
    concurrent*

    *293          * drm_sched_cleanup_jobs. It will be reinserted back
    after sched->thread*

    *294          * is parked at which point it's safe.*

    *295          */*

    *296 list_del_init(&job->node);*

    *297 spin_unlock(&sched->job_list_lock);*

    *298*

    *299 job->sched->ops->timedout_job(job);*

    Thanks

    ------------------------------------------

    Monk Liu | Cloud-GPU Core team

    ------------------------------------------

    *From:*Liu, Monk
    *Sent:* Friday, February 26, 2021 7:54 PM
    *To:* Koenig, Christian <christian.koe...@amd.com>
    <mailto:christian.koe...@amd.com>; amd-gfx@lists.freedesktop.org
    <mailto:amd-gfx@lists.freedesktop.org>
    *Cc:* Zhang, Andy <andy.zh...@amd.com>
    <mailto:andy.zh...@amd.com>; Chen, Horace <horace.c...@amd.com>
    <mailto:horace.c...@amd.com>; Zhang, Jack (Jian)
    <jack.zha...@amd.com> <mailto:jack.zha...@amd.com>
    *Subject:* RE: [RFC] a new approach to detect which ring is the
    real black sheep upon TDR reported

    [AMD Official Use Only - Internal Distribution Only]

    See in line

    Thanks

    ------------------------------------------

    Monk Liu | Cloud-GPU Core team

    ------------------------------------------

    *From:*Koenig, Christian <christian.koe...@amd.com
    <mailto:christian.koe...@amd.com>>
    *Sent:* Friday, February 26, 2021 3:58 PM
    *To:* Liu, Monk <monk....@amd.com <mailto:monk....@amd.com>>;
    amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
    *Cc:* Zhang, Andy <andy.zh...@amd.com
    <mailto:andy.zh...@amd.com>>; Chen, Horace <horace.c...@amd.com
    <mailto:horace.c...@amd.com>>; Zhang, Jack (Jian)
    <jack.zha...@amd.com <mailto:jack.zha...@amd.com>>
    *Subject:* Re: [RFC] a new approach to detect which ring is the
    real black sheep upon TDR reported

    Hi Monk,

    in general an interesting idea, but I see two major problems with
    that:

    1. It would make the reset take much longer.

    2. Things get often stuck because of timing issues, so a guilty
    job might pass perfectly when run a second time.

    [ML] but the innocent ring already reported a TDR, and the drm
    sched logic already deleted this “sched_job” in its mirror list,
    thus you don’t have chance to re-submit it again after reset,
    that’s the major problem here.

    Apart from that the whole ring mirror list turned out to be a
    really bad idea. E.g. we still struggle with object life time
    because the concept doesn't fit into the object model of the GPU
    scheduler under Linux.

    We should probably work on this separately and straighten up the
    job destruction once more and keep the recovery information in the
    fence instead.

    [ML] we claim to our customer that no innocent process will be
    dropped or cancelled, and our current logic works for the most
    time, but only when there are different process running on
    gfx/computes rings then we would run into the tricky situation I
    stated here, and the proposal is the only way I can figure out so
    far, do you have a better solution or idea we review it as another
    candidate RFC ? Be note that we raised this proposal is because we
    do hit our trouble and we do need to resolve it …. So even a not
    perfect solution is still better than just cancel the innocent job
    (and their context/process)

    Thanks !


    Regards,
    Christian.

    Am 26.02.21 um 06:58 schrieb Liu, Monk:

        [AMD Public Use]

        Hi all

        NAVI2X  project hit a really hard to solve issue now, and it
        is turned out to be a general headache of our TDR mechanism ,
        check below scenario:

         1. There is a job1 running on compute1 ring at timestamp
         2. There is a job2 running on gfx ring at timestamp
         3. Job1 is the guilty one, and job1/job2 were scheduled to
            their rings at almost the same timestamp
         4. After 2 seconds we receive two TDR reporting from both GFX
            ring and compute ring
         5. *Current scheme is that in drm scheduler all the head jobs
            of those two rings are considered “bad job” and taken away
            from the mirror list *
         6. The result is both the real guilty job (job1) and the
            innocent job (job2) were all deleted from mirror list, and
            their corresponding contexts were also treated as
            guilty*(so the innocent process remains running is not
            secured)*

        **

        But by our wish the ideal case is TDR mechanism can detect
        which ring is the guilty ring and the innocent ring can
        resubmits all its pending jobs:

         1. Job1 to be deleted from compute1 ring’s mirror list
         2. Job2 is kept and resubmitted later and its belonging
            process/context are even not aware of this TDR at all

        Here I have a proposal tend to achieve above goal and it rough
        procedure is :

         1. Once any ring reports a TDR, the head job is **not**
            treated as “bad job”, and it is **not** deleted from the
            mirror list in drm sched functions
         2. In vendor’s function (our amdgpu driver here):

              * reset GPU
              * repeat below actions on each RINGS * one by one *:

        1.take the head job and submit it on this ring

        2.see if it completes, if not then this job is the real “bad job”

        3. take it away from mirror list if this head job is “bad job”

              * After above iteration on all RINGS, we already clears
                all the bad job(s)

         3. Resubmit all jobs from each mirror list to their
            corresponding rings (this is the existed logic)

        The idea of this is to use “serial” way to re-run and re-check
        each head job of each RING, in order to take out the real
        black sheep and its guilty context.

        P.S.: we can use this approaches only on GFX/KCQ ring reports
        TDR , since those rings are intermutually affected to each
        other. For SDMA ring timeout it definitely proves the head job
        on SDMA ring is really guilty.

        Thanks

        ------------------------------------------

        Monk Liu | Cloud-GPU Core team

        ------------------------------------------


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Reply via email to