amdgpu: Avoid HW reset if guilty job already signaled.

Christian König Mon, 29 Apr 2019 12:05:09 -0700

I would clean them up further, but that's only moving code around sofeel free to add my rb to those.


Christian.


Am 29.04.19 um 16:14 schrieb Grodzovsky, Andrey:

Thanks David, with that only patches 5 and 6 are left for the seriesto be reviewed.


Christian, any more comments on those patches ?

Andrey

On 4/27/19 10:56 PM, Zhou, David(ChunMing) wrote:

Sorry, I only can put my Acked-by: Chunming Zhou<david1.z...@amd.com> on patch#3.


I cannot fully judge patch #4, #5, #6.

-David

*From:*amd-gfx <amd-gfx-boun...@lists.freedesktop.org> *On Behalf Of*Grodzovsky, Andrey

*Sent:* Friday, April 26, 2019 10:09 PM

*To:* Koenig, Christian <christian.koe...@amd.com>; Zhou,David(ChunMing) <david1.z...@amd.com>;dri-devel@lists.freedesktop.org; amd-...@lists.freedesktop.org;e...@anholt.net; etna...@lists.freedesktop.org*Cc:* Kazlauskas, Nicholas <nicholas.kazlaus...@amd.com>; Liu, Monk<monk....@amd.com>*Subject:* Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guiltyjob already signaled.


Ping (mostly David and Monk).

Andrey

On 4/24/19 3:09 AM, Christian König wrote:

    Am 24.04.19 um 05:02 schrieb Zhou, David(ChunMing):

        >> - drm_sched_stop(&ring->sched, &job->base);
        >> -
        >>               /* after all hw jobs are reset, hw fence is
        meaningless, so force_completion */
        >> amdgpu_fence_driver_force_completion(ring);
        >>       }

        HW fence are already forced completion, then we can just
        disable irq fence process and ignore hw fence signal when we
        are trying to do GPU reset, I think. Otherwise which will
        make the logic much more complex.

        If this situation happens because of long time execution, we
        can increase timeout of reset detection.


    You are not thinking widely enough, forcing the hw fence to
    complete can trigger other to start other activity in the system.

    We first need to stop everything and make sure that we don't do
    any processing any more and then start with our reset procedure
    including forcing all hw fences to complete.

    Christian.


        -David

        *From:*amd-gfx <amd-gfx-boun...@lists.freedesktop.org>
        <mailto:amd-gfx-boun...@lists.freedesktop.org> *On Behalf Of
        *Grodzovsky, Andrey
        *Sent:* Wednesday, April 24, 2019 12:00 AM
        *To:* Zhou, David(ChunMing) <david1.z...@amd.com>
        <mailto:david1.z...@amd.com>; dri-devel@lists.freedesktop.org
        <mailto:dri-devel@lists.freedesktop.org>;
        amd-...@lists.freedesktop.org
        <mailto:amd-...@lists.freedesktop.org>; e...@anholt.net
        <mailto:e...@anholt.net>; etna...@lists.freedesktop.org
        <mailto:etna...@lists.freedesktop.org>;
        ckoenig.leichtzumer...@gmail.com
        <mailto:ckoenig.leichtzumer...@gmail.com>
        *Cc:* Kazlauskas, Nicholas <nicholas.kazlaus...@amd.com>
        <mailto:nicholas.kazlaus...@amd.com>; Liu, Monk
        <monk....@amd.com> <mailto:monk....@amd.com>
        *Subject:* Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if
        guilty job already signaled.

        No, i mean the actual HW fence which signals when the job
        finished execution on the HW.

        Andrey

        On 4/23/19 11:19 AM, Zhou, David(ChunMing) wrote:

            do you mean fence timer? why not stop it as well when
            stopping sched for the reason of hw reset?

            -------- Original Message --------
            Subject: Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if
            guilty job already signaled.
            From: "Grodzovsky, Andrey"
            To: "Zhou, David(ChunMing)"
            
,dri-devel@lists.freedesktop.org,amd-...@lists.freedesktop.org,e...@anholt.net,etna...@lists.freedesktop.org,ckoenig.leichtzumer...@gmail.com
            
<mailto:dri-devel@lists.freedesktop.org,amd-...@lists.freedesktop.org,e...@anholt.net,etna...@lists.freedesktop.org,ckoenig.leichtzumer...@gmail.com>
            CC: "Kazlauskas, Nicholas" ,"Liu, Monk"


            On 4/22/19 9:09 AM, Zhou, David(ChunMing) wrote:
            > +Monk.
            >
            > GPU reset is used widely in SRIOV, so need
            virtulizatino guy take a look.
            >
            > But out of curious, why guilty job can signal more if
            the job is already
            > set to guilty? set it wrongly?
            >
            >
            > -David


            It's possible that the job does completes at a later time
            then it's
            timeout handler started processing so in this patch we
            try to protect
            against this by rechecking the HW fence after stopping
            all SW
            schedulers. We do it BEFORE marking guilty on the job's
            sched_entity so
            at the point we check the guilty flag is not set yet.

            Andrey


            >
            > 在 2019/4/18 23:00, Andrey Grodzovsky 写道:
            >> Also reject TDRs if another one already running.
            >>
            >> v2:
            >> Stop all schedulers across device and entire XGMI hive
            before
            >> force signaling HW fences.
            >> Avoid passing job_signaled to helper fnctions to keep
            all the decision
            >> making about skipping HW reset in one place.
            >>
            >> v3:
            >> Fix SW sched. hang after non HW reset.
            sched.hw_rq_count has to be balanced
            >> against it's decrement in drm_sched_stop in non HW
            reset case.
            >> v4: rebase
            >> v5: Revert v3 as we do it now in sceduler code.
            >>
            >> Signed-off-by: Andrey Grodzovsky
            <andrey.grodzov...@amd.com>
            <mailto:andrey.grodzov...@amd.com>
            >> ---
            >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 143
            +++++++++++++++++++----------
            >>    1 file changed, 95 insertions(+), 48 deletions(-)
            >>
            >> diff --git
            a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            >> index a0e165c..85f8792 100644
            >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            >> @@ -3334,8 +3334,6 @@ static int
            amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
            >>               if (!ring || !ring->sched.thread)
            >>                       continue;
            >>
            >> - drm_sched_stop(&ring->sched, &job->base);
            >> -
            >>               /* after all hw jobs are reset, hw fence
            is meaningless, so force_completion */
            >> amdgpu_fence_driver_force_completion(ring);
            >>       }
            >> @@ -3343,6 +3341,7 @@ static int
            amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
            >>       if(job)
            >> drm_sched_increase_karma(&job->base);
            >>
            >> +    /* Don't suspend on bare metal if we are not
            going to HW reset the ASIC */
            >>       if (!amdgpu_sriov_vf(adev)) {
            >>
            >>               if (!need_full_reset)
            >> @@ -3480,37 +3479,21 @@ static int
            amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
            >>       return r;
            >>    }
            >>
            >> -static void amdgpu_device_post_asic_reset(struct
            amdgpu_device *adev)
            >> +static bool amdgpu_device_lock_adev(struct
            amdgpu_device *adev, bool trylock)
            >>    {
            >> -    int i;
            >> -
            >> -    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
            >> -            struct amdgpu_ring *ring = adev->rings[i];
            >> -
            >> -            if (!ring || !ring->sched.thread)
            >> -                    continue;
            >> -
            >> -            if (!adev->asic_reset_res)
            >> - drm_sched_resubmit_jobs(&ring->sched);
            >> +    if (trylock) {
            >> +            if (!mutex_trylock(&adev->lock_reset))
            >> +                    return false;
            >> +    } else
            >> + mutex_lock(&adev->lock_reset);
            >>
            >> - drm_sched_start(&ring->sched, !adev->asic_reset_res);
            >> -    }
            >> -
            >> -    if (!amdgpu_device_has_dc_support(adev)) {
            >> - drm_helper_resume_force_mode(adev->ddev);
            >> -    }
            >> -
            >> -    adev->asic_reset_res = 0;
            >> -}
            >> -
            >> -static void amdgpu_device_lock_adev(struct
            amdgpu_device *adev)
            >> -{
            >> - mutex_lock(&adev->lock_reset);
            >> atomic_inc(&adev->gpu_reset_counter);
            >>       adev->in_gpu_reset = 1;
            >>       /* Block kfd: SRIOV would do it separately */
            >>       if (!amdgpu_sriov_vf(adev))
            >> amdgpu_amdkfd_pre_reset(adev);
            >> +
            >> +    return true;
            >>    }
            >>
            >>    static void amdgpu_device_unlock_adev(struct
            amdgpu_device *adev)
            >> @@ -3538,40 +3521,42 @@ static void
            amdgpu_device_unlock_adev(struct amdgpu_device *adev)
            >>    int amdgpu_device_gpu_recover(struct amdgpu_device
            *adev,
            >>                             struct amdgpu_job *job)
            >>    {
            >> -    int r;
            >> +    struct list_head device_list, *device_list_handle
            =  NULL;
            >> +    bool need_full_reset, job_signaled;
            >>       struct amdgpu_hive_info *hive = NULL;
            >> -    bool need_full_reset = false;
            >>       struct amdgpu_device *tmp_adev = NULL;
            >> -    struct list_head device_list, *device_list_handle
            =  NULL;
            >> +    int i, r = 0;
            >>
            >> +    need_full_reset = job_signaled = false;
            >>       INIT_LIST_HEAD(&device_list);
            >>
            >>       dev_info(adev->dev, "GPU reset begin!\n");
            >>
            >> +    hive = amdgpu_get_xgmi_hive(adev, false);
            >> +
            >>       /*
            >> -     * In case of XGMI hive disallow concurrent
            resets to be triggered
            >> -     * by different nodes. No point also since the
            one node already executing
            >> -     * reset will also reset all the other nodes in
            the hive.
            >> +     * Here we trylock to avoid chain of resets
            executing from
            >> +     * either trigger by jobs on different adevs in
            XGMI hive or jobs on
            >> +     * different schedulers for same device while
            this TO handler is running.
            >> +     * We always reset all schedulers for device and
            all devices for XGMI
            >> +     * hive so that should take care of them too.
            >>        */
            >> -    hive = amdgpu_get_xgmi_hive(adev, 0);
            >> -    if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
            >> - !mutex_trylock(&hive->reset_lock))
            >> +
            >> +    if (hive && !mutex_trylock(&hive->reset_lock)) {
            >> +            DRM_INFO("Bailing on TDR for s_job:%llx,
            hive: %llx as another already in progress",
            >> +                     job->base.id, hive->hive_id);
            >>               return 0;
            >> +    }
            >>
            >>       /* Start with adev pre asic reset first for soft
            reset check.*/
            >> -    amdgpu_device_lock_adev(adev);
            >> -    r = amdgpu_device_pre_asic_reset(adev,
            >> - job,
            >> - &need_full_reset);
            >> -    if (r) {
            >> -            /*TODO Should we stop ?*/
            >> -            DRM_ERROR("GPU pre asic reset failed with
            err, %d for drm dev, %s ",
            >> -                      r, adev->ddev->unique);
            >> -            adev->asic_reset_res = r;
            >> +    if (!amdgpu_device_lock_adev(adev, !hive)) {
            >> +            DRM_INFO("Bailing on TDR for s_job:%llx,
            as another already in progress",
            >> + job->base.id);
            >> +            return 0;
            >>       }
            >>
            >>       /* Build list of devices to reset */
            >> -    if  (need_full_reset &&
            adev->gmc.xgmi.num_physical_nodes > 1) {
            >> +    if (adev->gmc.xgmi.num_physical_nodes > 1) {
            >>               if (!hive) {
            >> amdgpu_device_unlock_adev(adev);
            >>                       return -ENODEV;
            >> @@ -3588,13 +3573,56 @@ int
            amdgpu_device_gpu_recover(struct amdgpu_device *adev,
            >>               device_list_handle = &device_list;
            >>       }
            >>
            >> +    /* block all schedulers and reset given job's ring */
            >> +    list_for_each_entry(tmp_adev, device_list_handle,
            gmc.xgmi.head) {
            >> +            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
            >> +                    struct amdgpu_ring *ring =
            tmp_adev->rings[i];
            >> +
            >> +                    if (!ring || !ring->sched.thread)
            >> +                            continue;
            >> +
            >> + drm_sched_stop(&ring->sched, &job->base);
            >> +            }
            >> +    }
            >> +
            >> +
            >> +    /*
            >> +     * Must check guilty signal here since after this
            point all old
            >> +     * HW fences are force signaled.
            >> +     *
            >> +     * job->base holds a reference to parent fence
            >> +     */
            >> +    if (job && job->base.s_fence->parent &&
            >> + dma_fence_is_signaled(job->base.s_fence->parent))
            >> +            job_signaled = true;
            >> +
            >> +    if (!amdgpu_device_ip_need_full_reset(adev))
            >> +            device_list_handle = &device_list;
            >> +
            >> +    if (job_signaled) {
            >> +            dev_info(adev->dev, "Guilty job already
            signaled, skipping HW reset");
            >> +            goto skip_hw_reset;
            >> +    }
            >> +
            >> +
            >> +    /* Guilty job will be freed after this*/
            >> +    r = amdgpu_device_pre_asic_reset(adev,
            >> + job,
            >> + &need_full_reset);
            >> +    if (r) {
            >> +            /*TODO Should we stop ?*/
            >> +            DRM_ERROR("GPU pre asic reset failed with
            err, %d for drm dev, %s ",
            >> +                      r, adev->ddev->unique);
            >> +            adev->asic_reset_res = r;
            >> +    }
            >> +
            >>    retry:    /* Rest of adevs pre asic reset from XGMI
            hive. */
            >>       list_for_each_entry(tmp_adev,
            device_list_handle, gmc.xgmi.head) {
            >>
            >>               if (tmp_adev == adev)
            >>                       continue;
            >>
            >> - amdgpu_device_lock_adev(tmp_adev);
            >> + amdgpu_device_lock_adev(tmp_adev, false);
            >>               r = amdgpu_device_pre_asic_reset(tmp_adev,
            >>                                                NULL,
            >> &need_full_reset);
            >> @@ -3618,9 +3646,28 @@ int
            amdgpu_device_gpu_recover(struct amdgpu_device *adev,
            >>                       goto retry;
            >>       }
            >>
            >> +skip_hw_reset:
            >> +
            >>       /* Post ASIC reset for all devs .*/
            >>       list_for_each_entry(tmp_adev,
            device_list_handle, gmc.xgmi.head) {
            >> - amdgpu_device_post_asic_reset(tmp_adev);
            >> +            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
            >> +                    struct amdgpu_ring *ring =
            tmp_adev->rings[i];
            >> +
            >> +                    if (!ring || !ring->sched.thread)
            >> +                            continue;
            >> +
            >> +                    /* No point to resubmit jobs if
            we didn't HW reset*/
            >> +                    if (!tmp_adev->asic_reset_res &&
            !job_signaled)
            >> + drm_sched_resubmit_jobs(&ring->sched);
            >> +
            >> + drm_sched_start(&ring->sched,
            !tmp_adev->asic_reset_res);
            >> +            }
            >> +
            >> +            if
            (!amdgpu_device_has_dc_support(tmp_adev) && !job_signaled) {
            >> + drm_helper_resume_force_mode(tmp_adev->ddev);
            >> +            }
            >> +
            >> +            tmp_adev->asic_reset_res = 0;
            >>
            >>               if (r) {
            >>                       /* bad news, how to tell it to
            userspace ? */
            >> @@ -3633,7 +3680,7 @@ int
            amdgpu_device_gpu_recover(struct amdgpu_device *adev,
            >> amdgpu_device_unlock_adev(tmp_adev);
            >>       }
            >>
            >> -    if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
            >> +    if (hive)
            >> mutex_unlock(&hive->reset_lock);
            >>
            >>       if (r)
            _______________________________________________
            amd-gfx mailing list
            amd-...@lists.freedesktop.org
            <mailto:amd-...@lists.freedesktop.org>
            https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty job already signaled.

Reply via email to