[AMD Public Use]

Hi, Christian,

I made some change on V3 patch that insert a dma_fence_wait for the first jobs 
after resubmit jobs.
It seems simpler than the V2 patch. Is this what you first thinks of in your 
mind?

Thanks,
Jack

-----Original Message-----
From: Koenig, Christian <christian.koe...@amd.com> 
Sent: Monday, March 8, 2021 3:53 PM
To: Liu, Monk <monk....@amd.com>; Zhang, Jack (Jian) <jack.zha...@amd.com>; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey <andrey.grodzov...@amd.com>; 
Deng, Emily <emily.d...@amd.com>
Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode



Am 08.03.21 um 05:06 schrieb Liu, Monk:
> [AMD Official Use Only - Internal Distribution Only]
>
>>> well first of all please completely drop the affinity group stuff from this 
>>> patch. We should concentrate on one feature at at time.
> We need it to expedite the process, we can introduce this change in 
> another patch
>
>
>>> Then the implementation is way to complicate. All you need to do is insert 
>>> a dma_fence_wait after re-scheduling each job after a reset.
> No that's not true, during the " drm_sched_resubmit_jobs" it will put 
> all jobs in mirror list into the hw ring, but we can only allow the 
> first job to the ring To catch the real guilty one (otherwise it is possible 
> that the later job in the ring also has bug and affect our judgement) So we 
> need to implement a new " drm_sched_resubmit_jobs2()", like this way:

Something like this. But since waiting for the guilty job is AMD specific we 
should rather rework the stuff from the beginning.

What I have in mind is the following:
1. Add a reference from the scheduler fence back to the job which is cleared 
only when the scheduler fence finishes.
2. Completely drop the ring_mirror_list and replace it with a kfifo of pointers 
to the active scheduler fences.
3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro 
which allows drivers to iterate over all the active jobs and resubmit/wait/mark 
them as guilty etc etc..
4. Remove the guilty/karma handling from the scheduler. This is something AMD 
specific and shouldn't leak into common code.

Regards,
Christian.

>
> drm_sched_resubmit_jobs2()
> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max)
>    500 {
>    501     struct drm_sched_job *s_job, *tmp;
>    502     uint64_t guilty_context;
>    503     bool found_guilty = false;
>    504     struct dma_fence *fence;
> + 505     int i = 0;
>    506
>    507     list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, 
> node) {
>    508         struct drm_sched_fence *s_fence = s_job->s_fence;
>    509
> + 510         if (i >= max)
> + 511             break;
> + 512
>    513         if (!found_guilty && atomic_read(&s_job->karma) > 
> sched->hang_limit) {
>    514             found_guilty = true;
>    515             guilty_context = s_job->s_fence->scheduled.context;
>    516         }
>    517
>    518         if (found_guilty && s_job->s_fence->scheduled.context == 
> guilty_context)
>    519             dma_fence_set_error(&s_fence->finished, -ECANCELED);
>    520
>    521         dma_fence_put(s_job->s_fence->parent);
>    522         fence = sched->ops->run_job(s_job);
> + 523         i++;
>    524
>    525         if (IS_ERR_OR_NULL(fence)) {
>    526             if (IS_ERR(fence))
>    527                 dma_fence_set_error(&s_fence->finished, 
> PTR_ERR(fence));
>    528
>    529             s_job->s_fence->parent = NULL;
>    530         } else {
>    531             s_job->s_fence->parent = fence;
>    532         }
>    533
>    534
>    535     }
>    536 }
>    537 EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>    538
>
>
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Koenig, Christian <christian.koe...@amd.com>
> Sent: Sunday, March 7, 2021 3:03 AM
> To: Zhang, Jack (Jian) <jack.zha...@amd.com>; 
> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
> <andrey.grodzov...@amd.com>; Liu, Monk <monk....@amd.com>; Deng, Emily 
> <emily.d...@amd.com>
> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>
> Hi Jack,
>
> well first of all please completely drop the affinity group stuff from this 
> patch. We should concentrate on one feature at at time.
>
> Then the implementation is way to complicate. All you need to do is insert a 
> dma_fence_wait after re-scheduling each job after a reset.
>
> Additional to that this feature is completely AMD specific and shouldn't 
> affect the common scheduler in any way.
>
> Regards,
> Christian.
>
> Am 06.03.21 um 18:25 schrieb Jack Zhang:
>> [Why]
>> Previous tdr design treats the first job in job_timeout as the bad job.
>> But sometimes a later bad compute job can block a good gfx job and 
>> cause an unexpected gfx job timeout because gfx and compute ring 
>> share internal GC HW mutually.
>>
>> [How]
>> This patch implements an advanced tdr mode.It involves an additinal 
>> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit 
>> step in order to find the real bad job.
>>
>> 1. For Bailing TDR job, re-insert it to mirror_list, don't set it to 
>> guilty and leave it to be handled by the main reset thread.
>>
>> 2. Don't set the job to guilty in pre_asic_reset, and leave it to be 
>> handled by Step0 Resubmit Stage.
>>
>> 3. At Step0 Resubmit stage, it first resubmit jobs asynchronously, 
>> then it iterate each ring mirror_list, synchronously pend for each hw 
>> fence being signaled. If the a job's hw fence get timeout, we 
>> identify it as guilty and do hw reset to recover hw. After that, we 
>> would do the normal resubmit step to resubmit left jobs.
>>
>> 4. For whole gpu reset(vram lost), skip Step0 Resubmit as each job 
>> after vram lost was considered as bad job.
>>
>> 5. Involve the concept "Affinity Group".
>> Doing two hw resets is not necessary when there's only one ring that 
>> has jobs among some hw-related rings.Thus, we involve "affinity group".
>> Hw-related rings could be added into a common affinity group, such as 
>> gfx and compute ring. When tdr happens, we iterate all rings in 
>> affinity group, skip Step0 Resubmit stage if there's only one ring's 
>> mirror_list that has valid sched jobs.
>>
>> V2:
>>       -fix a cherry-pick mistake for bailing TDR handling.
>>
>>       -do affinity_group check according to the bad job's sched rather
>>        than the default "1" so that there could be multiple affinity
>>        groups being pre-defined in future.
>>
>> Signed-off-by: Jack Zhang <jack.zha...@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 102 +++++++++++++++++++--
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   2 +-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  47 ++++++++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.h    |   2 +-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   |  27 ++++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   1 +
>>    include/drm/gpu_scheduler.h                |   1 +
>>    7 files changed, 173 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index e247c3a2ec08..8632d7071292 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4188,6 +4188,37 @@ bool amdgpu_device_has_job_running(struct 
>> amdgpu_device *adev)
>>      return false;
>>    }
>>    
>> +bool amdgpu_affinity_group_has_only_or_null_working_ring(struct
>> +amdgpu_device *adev, struct drm_sched_job *s_job) {
>> +       int i;
>> +       int working_ring_num = 0;
>> +
>> +    /*
>> +     * The job is considered as the real bad one
>> +     * if job's sched is not in affinity group
>> +     */
>> +    if (s_job->sched.affinity_group == 0)
>> +                    return true;
>> +
>> +       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +               struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +               if (!ring || !ring->sched.thread)
>> +                       continue;
>> +
>> +               /* for non-empty affinity ring, increase working_ring_num */
>> +               if (ring->sched.affinity_group == 
>> s_job->sched.affinity_group) {
>> +                       if (!list_empty(&ring->sched.ring_mirror_list))
>> +                               working_ring_num++;
>> +               }
>> +       }
>> +
>> +       if (working_ring_num > 1) {
>> +               return false;
>> +       }
>> +       return true;
>> +}
>> +
>>    /**
>>     * amdgpu_device_should_recover_gpu - check if we should try GPU recovery
>>     *
>> @@ -4310,8 +4341,10 @@ static int amdgpu_device_pre_asic_reset(struct 
>> amdgpu_device *adev,
>>              amdgpu_fence_driver_force_completion(ring);
>>      }
>>    
>> -    if(job)
>> -            drm_sched_increase_karma(&job->base);
>> +    if (amdgpu_gpu_recovery != 2) {
>> +            if (job)
>> +                    drm_sched_increase_karma(&job->base);
>> +    }
>>    
>>      /* Don't suspend on bare metal if we are not going to HW reset the ASIC 
>> */
>>      if (!amdgpu_sriov_vf(adev)) {
>> @@ -4639,7 +4672,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>> *adev,
>>      int i, r = 0;
>>      bool need_emergency_restart = false;
>>      bool audio_suspended = false;
>> -
>> +    int     tmp_vram_lost_counter;
>>      /*
>>       * Special case: RAS triggered and full reset isn't supported
>>       */
>> @@ -4690,8 +4723,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>> *adev,
>>                                      job ? job->base.id : -1);
>>    
>>              /* even we skipped this reset, still need to set the job to 
>> guilty */
>> -            if (job)
>> -                    drm_sched_increase_karma(&job->base);
>> +            if (job) {
>> +                    if (amdgpu_gpu_recovery == 2) {
>> +                            if (&job->base) {
>> +                                    
>> spin_lock(&job->base.sched->job_list_lock);
>> +                                    list_add(&job->base.node, 
>> &job->base.sched->ring_mirror_list);
>> +                                    
>> spin_unlock(&job->base.sched->job_list_lock);
>> +                            }
>> +                    } else
>> +                            drm_sched_increase_karma(&job->base);
>> +            }
>>              goto skip_recovery;
>>      }
>>    
>> @@ -4788,6 +4829,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>> *adev,
>>              }
>>      }
>>    
>> +    tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>>      /* Actual ASIC resets if needed.*/
>>      /* TODO Implement XGMI hive reset logic for SRIOV */
>>      if (amdgpu_sriov_vf(adev)) {
>> @@ -4804,18 +4846,64 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>    
>>      /* Post ASIC reset for all devs .*/
>>      list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) 
>> {
>> +            int step = 1;
>>    
>> +            if (amdgpu_gpu_recovery == 2) {
>> +                    if 
>> (amdgpu_affinity_group_has_only_or_null_working_ring(adev,&job->base)
>> +                            || tmp_vram_lost_counter < 
>> atomic_read(&adev->vram_lost_counter)) {
>> +                            DRM_INFO("Skip Stage0 Resubmit Stage\n");
>> +                            /* set guilty */
>> +                            drm_sched_increase_karma(&job->base);
>> +                            step = 1;
>> +                    } else {
>> +                            DRM_INFO("Do Stage0 Resubmit Stage\n");
>> +                            step = 0;
>> +                    }
>> +            }
>> +
>> +retry_resubmit:
>>              for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>                      struct amdgpu_ring *ring = tmp_adev->rings[i];
>> +                    int ret = 0;
>> +                    struct drm_sched_job *s_bad_job = NULL;
>>    
>>                      if (!ring || !ring->sched.thread)
>>                              continue;
>>    
>>                      /* No point to resubmit jobs if we didn't HW reset*/
>> -                    if (!tmp_adev->asic_reset_res && !job_signaled)
>> +                    if (!tmp_adev->asic_reset_res && !job_signaled) {
>> +
>>                              drm_sched_resubmit_jobs(&ring->sched);
>>    
>> -                    drm_sched_start(&ring->sched, 
>> !tmp_adev->asic_reset_res);
>> +                            if (amdgpu_gpu_recovery == 2 && step == 0) {
>> +                                    ret = 
>> amdgpu_wait_resubmitted_jobs_completion(&ring->sched, ring->sched.timeout, 
>> &s_bad_job);
>> +                                    if (ret == -1) {
>> +                                            DRM_ERROR("Found the real bad 
>> job! ring:%s, job_id:%llx\n", ring->sched.name, s_bad_job->id);
>> +                                            /* set guilty */
>> +                                            
>> drm_sched_increase_karma(s_bad_job);
>> +
>> +                                            /* do hw reset */
>> +                                            if (amdgpu_sriov_vf(adev)) {
>> +                                                    
>> amdgpu_virt_fini_data_exchange(adev);
>> +                                                    r = 
>> amdgpu_device_reset_sriov(adev, false);
>> +                                                    if (r)
>> +                                                            
>> adev->asic_reset_res = r;
>> +                                            } else {
>> +                                                    r  = 
>> amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset, false);
>> +                                                    if (r && r == -EAGAIN)
>> +                                                            goto retry;
>> +                                            }
>> +
>> +                                            /* add reset counter so that 
>> the following resubmitted job could flush vmid */
>> +                                            
>> atomic_inc(&tmp_adev->gpu_reset_counter);
>> +                                            step = 1;
>> +                                            goto retry_resubmit;
>> +                                    }
>> +                            }
>> +                    }
>> +
>> +                    if (step == 1)
>> +                            drm_sched_start(&ring->sched, 
>> !tmp_adev->asic_reset_res);
>>              }
>>    
>>              if (!amdgpu_device_has_dc_support(tmp_adev) && !job_signaled) { 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 865f924772b0..9c3f4edb7532 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -509,7 +509,7 @@ module_param_named(compute_multipipe, 
>> amdgpu_compute_multipipe, int, 0444);
>>     * DOC: gpu_recovery (int)
>>     * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The 
>> default is -1 (auto, disabled except SRIOV).
>>     */
>> -MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = 
>> enable, 0 = disable, -1 = auto)");
>> +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (2 = 
>> +advanced tdr mode, 1 = enable, 0 = disable, -1 = auto)");
>>    module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>>    
>>    /**
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 759b34799221..28cda321157a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -281,6 +281,53 @@ void amdgpu_job_stop_all_jobs_on_sched(struct 
>> drm_gpu_scheduler *sched)
>>      }
>>    }
>>    
>> +int amdgpu_wait_resubmitted_jobs_completion(struct drm_gpu_scheduler 
>> +*sched, long timeout, struct drm_sched_job **s_bad_job) {
>> +    struct drm_sched_job *s_job, *tmp;
>> +    int ret = 0;
>> +
>> +    list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) {
>> +            struct drm_sched_fence *s_fence = s_job->s_fence;
>> +
>> +                    if (s_fence->parent == NULL) { /* fail to get a hw 
>> fence */
>> +                            /* process a job */
>> +                            atomic_dec(&sched->num_jobs);
>> +                            dma_fence_get(&s_fence->finished);
>> +                            dma_fence_signal(&s_fence->finished);
>> +                            dma_fence_put(&s_fence->finished);
>> +
>> +                            /* remove node from mirror_list and free the 
>> job */
>> +                            spin_lock(&sched->job_list_lock);
>> +                            list_del_init(&s_job->node);
>> +                            spin_unlock(&sched->job_list_lock);
>> +                            sched->ops->free_job(s_job);
>> +                            continue;
>> +                    }
>> +
>> +                    ret = dma_fence_wait_timeout(s_fence->parent, false, 
>> timeout);
>> +
>> +                    if (ret > 0) { /* succeed */
>> +                            /* process a job */
>> +                            atomic_dec(&sched->num_jobs);
>> +                            dma_fence_get(&s_fence->finished);
>> +                            dma_fence_signal(&s_fence->finished);
>> +                            dma_fence_put(&s_fence->finished);
>> +
>> +                            /* remove node from mirror_list and free the 
>> job */
>> +                            spin_lock(&sched->job_list_lock);
>> +                            list_del_init(&s_job->node);
>> +                            spin_unlock(&sched->job_list_lock);
>> +                            sched->ops->free_job(s_job);
>> +                            continue;
>> +                    } else if (ret == 0) {
>> +                            *s_bad_job = s_job;
>> +                            return -1; /* timeout */
>> +                    }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>    const struct drm_sched_backend_ops amdgpu_sched_ops = {
>>      .dependency = amdgpu_job_dependency,
>>      .run_job = amdgpu_job_run,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> index 81caac9b958a..25292f4699fb 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> @@ -76,5 +76,5 @@ int amdgpu_job_submit_direct(struct amdgpu_job *job, 
>> struct amdgpu_ring *ring,
>>                           struct dma_fence **fence);
>>    
>>    void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler 
>> *sched);
>> -
>> +int amdgpu_wait_resubmitted_jobs_completion(struct drm_gpu_scheduler 
>> +*sched, long timeout, struct drm_sched_job **s_bad_job);
>>    #endif
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> index b644c78475fd..cb50bfc80bc9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> @@ -35,6 +35,11 @@
>>    #include "amdgpu.h"
>>    #include "atom.h"
>>    
>> +static char *amdgpu_affinity_group[] = { "gfx", "comp"
>> +};
>> +
>>    /*
>>     * Rings
>>     * Most engines on the GPU are fed via ring buffers.  Ring @@ 
>> -189,6
>> +194,7 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct 
>> +amdgpu_ring *ring,
>>              ring->adev = adev;
>>              ring->idx = adev->num_rings++;
>>              adev->rings[ring->idx] = ring;
>> +            amdgpu_ring_set_affinity_group(ring);
>>              r = amdgpu_fence_driver_init_ring(ring, sched_hw_submission);
>>              if (r)
>>                      return r;
>> @@ -459,3 +465,24 @@ int amdgpu_ring_test_helper(struct amdgpu_ring *ring)
>>      ring->sched.ready = !r;
>>      return r;
>>    }
>> +
>> +int amdgpu_ring_set_affinity_group(struct amdgpu_ring *ring) {
>> +       struct amdgpu_device *adev = ring->adev;
>> +       int i;
>> +
>> +       for (i = 0; i < ARRAY_SIZE(amdgpu_affinity_group); i++) {
>> +               char *temp_name = amdgpu_affinity_group[i];
>> +
>> +               /* set ring's affinity_group bit if find it in 
>> affinity_group list */
>> +               if (strncmp(ring->name, temp_name, strlen(temp_name)) == 0) {
>> +                       DRM_DEV_INFO(adev->dev, "set ring:%s in 
>> affinity_group\n",
>> +                             ring->name);
>> +                       ring->sched.affinity_group = 1;
>> +                       return 0;
>> +               }
>> +       }
>> +
>> +       ring->sched.affinity_group = 0;
>> +       return 0;
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> index 56acec1075ac..6b0d217e6f5a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> @@ -350,4 +350,5 @@ int amdgpu_debugfs_ring_init(struct amdgpu_device *adev,
>>                           struct amdgpu_ring *ring);
>>    void amdgpu_debugfs_ring_fini(struct amdgpu_ring *ring);
>>    
>> +int amdgpu_ring_set_affinity_group(struct amdgpu_ring *ring);
>>    #endif
>> diff --git a/include/drm/gpu_scheduler.h 
>> b/include/drm/gpu_scheduler.h index 1c815e0a14ed..589cbaea35dc 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -301,6 +301,7 @@ struct drm_gpu_scheduler {
>>      atomic_t                        _score;
>>      bool                            ready;
>>      bool                            free_guilty;
>> +    int                             affinity_group;
>>    };
>>    
>>    int drm_sched_init(struct drm_gpu_scheduler *sched,
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to