[AMD Official Use Only - Internal Distribution Only] Hi, Christian,
Since this change is a bit critical to our project, we would be grateful that could get your review. Are there anything that's is not clear enough I could help to explain? Again, Thanks for your huge help to our problem. Thanks, Jack -----Original Message----- From: Koenig, Christian <christian.koe...@amd.com> Sent: Monday, March 8, 2021 8:45 PM To: Zhang, Jack (Jian) <jack.zha...@amd.com>; Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode Hi Jack, yes that comes pretty close. I'm going over the patch right now. Some things still look a bit complicated to me, but I need to wrap my head around how and why we are doing it this way once more. Christian. Am 08.03.21 um 13:43 schrieb Zhang, Jack (Jian): > [AMD Public Use] > > Hi, Christian, > > I made some change on V3 patch that insert a dma_fence_wait for the first > jobs after resubmit jobs. > It seems simpler than the V2 patch. Is this what you first thinks of in your > mind? > > Thanks, > Jack > > -----Original Message----- > From: Koenig, Christian <christian.koe...@amd.com> > Sent: Monday, March 8, 2021 3:53 PM > To: Liu, Monk <monk....@amd.com>; Zhang, Jack (Jian) > <jack.zha...@amd.com>; amd-gfx@lists.freedesktop.org; Grodzovsky, > Andrey <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com> > Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode > > > > Am 08.03.21 um 05:06 schrieb Liu, Monk: >> [AMD Official Use Only - Internal Distribution Only] >> >>>> well first of all please completely drop the affinity group stuff from >>>> this patch. We should concentrate on one feature at at time. >> We need it to expedite the process, we can introduce this change in >> another patch >> >> >>>> Then the implementation is way to complicate. All you need to do is insert >>>> a dma_fence_wait after re-scheduling each job after a reset. >> No that's not true, during the " drm_sched_resubmit_jobs" it will put >> all jobs in mirror list into the hw ring, but we can only allow the >> first job to the ring To catch the real guilty one (otherwise it is possible >> that the later job in the ring also has bug and affect our judgement) So we >> need to implement a new " drm_sched_resubmit_jobs2()", like this way: > Something like this. But since waiting for the guilty job is AMD specific we > should rather rework the stuff from the beginning. > > What I have in mind is the following: > 1. Add a reference from the scheduler fence back to the job which is cleared > only when the scheduler fence finishes. > 2. Completely drop the ring_mirror_list and replace it with a kfifo of > pointers to the active scheduler fences. > 3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro > which allows drivers to iterate over all the active jobs and > resubmit/wait/mark them as guilty etc etc.. > 4. Remove the guilty/karma handling from the scheduler. This is something AMD > specific and shouldn't leak into common code. > > Regards, > Christian. > >> drm_sched_resubmit_jobs2() >> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max) >> 500 { >> 501 struct drm_sched_job *s_job, *tmp; >> 502 uint64_t guilty_context; >> 503 bool found_guilty = false; >> 504 struct dma_fence *fence; >> + 505 int i = 0; >> 506 >> 507 list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, >> node) { >> 508 struct drm_sched_fence *s_fence = s_job->s_fence; >> 509 >> + 510 if (i >= max) >> + 511 break; >> + 512 >> 513 if (!found_guilty && atomic_read(&s_job->karma) > >> sched->hang_limit) { >> 514 found_guilty = true; >> 515 guilty_context = s_job->s_fence->scheduled.context; >> 516 } >> 517 >> 518 if (found_guilty && s_job->s_fence->scheduled.context == >> guilty_context) >> 519 dma_fence_set_error(&s_fence->finished, -ECANCELED); >> 520 >> 521 dma_fence_put(s_job->s_fence->parent); >> 522 fence = sched->ops->run_job(s_job); >> + 523 i++; >> 524 >> 525 if (IS_ERR_OR_NULL(fence)) { >> 526 if (IS_ERR(fence)) >> 527 dma_fence_set_error(&s_fence->finished, >> PTR_ERR(fence)); >> 528 >> 529 s_job->s_fence->parent = NULL; >> 530 } else { >> 531 s_job->s_fence->parent = fence; >> 532 } >> 533 >> 534 >> 535 } >> 536 } >> 537 EXPORT_SYMBOL(drm_sched_resubmit_jobs); >> 538 >> >> >> >> Thanks >> >> ------------------------------------------ >> Monk Liu | Cloud-GPU Core team >> ------------------------------------------ >> >> -----Original Message----- >> From: Koenig, Christian <christian.koe...@amd.com> >> Sent: Sunday, March 7, 2021 3:03 AM >> To: Zhang, Jack (Jian) <jack.zha...@amd.com>; >> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey >> <andrey.grodzov...@amd.com>; Liu, Monk <monk....@amd.com>; Deng, >> Emily <emily.d...@amd.com> >> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode >> >> Hi Jack, >> >> well first of all please completely drop the affinity group stuff from this >> patch. We should concentrate on one feature at at time. >> >> Then the implementation is way to complicate. All you need to do is insert a >> dma_fence_wait after re-scheduling each job after a reset. >> >> Additional to that this feature is completely AMD specific and shouldn't >> affect the common scheduler in any way. >> >> Regards, >> Christian. >> >> Am 06.03.21 um 18:25 schrieb Jack Zhang: >>> [Why] >>> Previous tdr design treats the first job in job_timeout as the bad job. >>> But sometimes a later bad compute job can block a good gfx job and >>> cause an unexpected gfx job timeout because gfx and compute ring >>> share internal GC HW mutually. >>> >>> [How] >>> This patch implements an advanced tdr mode.It involves an additinal >>> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit >>> step in order to find the real bad job. >>> >>> 1. For Bailing TDR job, re-insert it to mirror_list, don't set it to >>> guilty and leave it to be handled by the main reset thread. >>> >>> 2. Don't set the job to guilty in pre_asic_reset, and leave it to be >>> handled by Step0 Resubmit Stage. >>> >>> 3. At Step0 Resubmit stage, it first resubmit jobs asynchronously, >>> then it iterate each ring mirror_list, synchronously pend for each >>> hw fence being signaled. If the a job's hw fence get timeout, we >>> identify it as guilty and do hw reset to recover hw. After that, we >>> would do the normal resubmit step to resubmit left jobs. >>> >>> 4. For whole gpu reset(vram lost), skip Step0 Resubmit as each job >>> after vram lost was considered as bad job. >>> >>> 5. Involve the concept "Affinity Group". >>> Doing two hw resets is not necessary when there's only one ring that >>> has jobs among some hw-related rings.Thus, we involve "affinity group". >>> Hw-related rings could be added into a common affinity group, such >>> as gfx and compute ring. When tdr happens, we iterate all rings in >>> affinity group, skip Step0 Resubmit stage if there's only one ring's >>> mirror_list that has valid sched jobs. >>> >>> V2: >>> -fix a cherry-pick mistake for bailing TDR handling. >>> >>> -do affinity_group check according to the bad job's sched rather >>> than the default "1" so that there could be multiple affinity >>> groups being pre-defined in future. >>> >>> Signed-off-by: Jack Zhang <jack.zha...@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 102 +++++++++++++++++++-- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 47 ++++++++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 27 ++++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + >>> include/drm/gpu_scheduler.h | 1 + >>> 7 files changed, 173 insertions(+), 9 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index e247c3a2ec08..8632d7071292 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -4188,6 +4188,37 @@ bool amdgpu_device_has_job_running(struct >>> amdgpu_device *adev) >>> return false; >>> } >>> >>> +bool amdgpu_affinity_group_has_only_or_null_working_ring(struct >>> +amdgpu_device *adev, struct drm_sched_job *s_job) { >>> + int i; >>> + int working_ring_num = 0; >>> + >>> +/* >>> + * The job is considered as the real bad one >>> + * if job's sched is not in affinity group >>> + */ >>> +if (s_job->sched.affinity_group == 0) >>> +return true; >>> + >>> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>> + struct amdgpu_ring *ring = adev->rings[i]; >>> + >>> + if (!ring || !ring->sched.thread) >>> + continue; >>> + >>> + /* for non-empty affinity ring, increase working_ring_num */ >>> + if (ring->sched.affinity_group == >>> s_job->sched.affinity_group) { >>> + if (!list_empty(&ring->sched.ring_mirror_list)) >>> + working_ring_num++; >>> + } >>> + } >>> + >>> + if (working_ring_num > 1) { >>> + return false; >>> + } >>> + return true; >>> +} >>> + >>> /** >>> * amdgpu_device_should_recover_gpu - check if we should try GPU >>> recovery >>> * >>> @@ -4310,8 +4341,10 @@ static int amdgpu_device_pre_asic_reset(struct >>> amdgpu_device *adev, >>> amdgpu_fence_driver_force_completion(ring); >>> } >>> >>> -if(job) >>> -drm_sched_increase_karma(&job->base); >>> +if (amdgpu_gpu_recovery != 2) { >>> +if (job) >>> +drm_sched_increase_karma(&job->base); >>> +} >>> >>> /* Don't suspend on bare metal if we are not going to HW reset the ASIC >>> */ >>> if (!amdgpu_sriov_vf(adev)) { >>> @@ -4639,7 +4672,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device >>> *adev, >>> int i, r = 0; >>> bool need_emergency_restart = false; >>> bool audio_suspended = false; >>> - >>> +inttmp_vram_lost_counter; >>> /* >>> * Special case: RAS triggered and full reset isn't supported >>> */ >>> @@ -4690,8 +4723,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device >>> *adev, >>> job ? job->base.id : -1); >>> >>> /* even we skipped this reset, still need to set the job to guilty */ >>> -if (job) >>> -drm_sched_increase_karma(&job->base); >>> +if (job) { >>> +if (amdgpu_gpu_recovery == 2) { >>> +if (&job->base) { >>> +spin_lock(&job->base.sched->job_list_lock); >>> +list_add(&job->base.node, &job->base.sched->ring_mirror_list); >>> +spin_unlock(&job->base.sched->job_list_lock); >>> +} >>> +} else >>> +drm_sched_increase_karma(&job->base); >>> +} >>> goto skip_recovery; >>> } >>> >>> @@ -4788,6 +4829,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device >>> *adev, >>> } >>> } >>> >>> +tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); >>> /* Actual ASIC resets if needed.*/ >>> /* TODO Implement XGMI hive reset logic for SRIOV */ >>> if (amdgpu_sriov_vf(adev)) { >>> @@ -4804,18 +4846,64 @@ int amdgpu_device_gpu_recover(struct >>> amdgpu_device *adev, >>> >>> /* Post ASIC reset for all devs .*/ >>> list_for_each_entry(tmp_adev, device_list_handle, >>> gmc.xgmi.head) { >>> +int step = 1; >>> >>> +if (amdgpu_gpu_recovery == 2) { >>> +if (amdgpu_affinity_group_has_only_or_null_working_ring(adev,&job->base) >>> +|| tmp_vram_lost_counter < atomic_read(&adev->vram_lost_counter)) { >>> +DRM_INFO("Skip Stage0 Resubmit Stage\n"); >>> +/* set guilty */ >>> +drm_sched_increase_karma(&job->base); >>> +step = 1; >>> +} else { >>> +DRM_INFO("Do Stage0 Resubmit Stage\n"); >>> +step = 0; >>> +} >>> +} >>> + >>> +retry_resubmit: >>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>> struct amdgpu_ring *ring = tmp_adev->rings[i]; >>> +int ret = 0; >>> +struct drm_sched_job *s_bad_job = NULL; >>> >>> if (!ring || !ring->sched.thread) >>> continue; >>> >>> /* No point to resubmit jobs if we didn't HW reset*/ >>> -if (!tmp_adev->asic_reset_res && !job_signaled) >>> +if (!tmp_adev->asic_reset_res && !job_signaled) { >>> + >>> drm_sched_resubmit_jobs(&ring->sched); >>> >>> -drm_sched_start(&ring->sched, !tmp_adev->asic_reset_res); >>> +if (amdgpu_gpu_recovery == 2 && step == 0) { >>> +ret = amdgpu_wait_resubmitted_jobs_completion(&ring->sched, >>> ring->sched.timeout, &s_bad_job); >>> +if (ret == -1) { >>> +DRM_ERROR("Found the real bad job! ring:%s, job_id:%llx\n", >>> ring->sched.name, s_bad_job->id); >>> +/* set guilty */ >>> +drm_sched_increase_karma(s_bad_job); >>> + >>> +/* do hw reset */ >>> +if (amdgpu_sriov_vf(adev)) { >>> +amdgpu_virt_fini_data_exchange(adev); >>> +r = amdgpu_device_reset_sriov(adev, false); >>> +if (r) >>> +adev->asic_reset_res = r; >>> +} else { >>> +r = amdgpu_do_asic_reset(hive, device_list_handle, &need_full_reset, >>> false); >>> +if (r && r == -EAGAIN) >>> +goto retry; >>> +} >>> + >>> +/* add reset counter so that the following resubmitted job could flush >>> vmid */ >>> +atomic_inc(&tmp_adev->gpu_reset_counter); >>> +step = 1; >>> +goto retry_resubmit; >>> +} >>> +} >>> +} >>> + >>> +if (step == 1) >>> +drm_sched_start(&ring->sched, !tmp_adev->asic_reset_res); >>> } >>> >>> if (!amdgpu_device_has_dc_support(tmp_adev) && !job_signaled) >>> { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> index 865f924772b0..9c3f4edb7532 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> @@ -509,7 +509,7 @@ module_param_named(compute_multipipe, >>> amdgpu_compute_multipipe, int, 0444); >>> * DOC: gpu_recovery (int) >>> * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The >>> default is -1 (auto, disabled except SRIOV). >>> */ >>> -MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = >>> enable, 0 = disable, -1 = auto)"); >>> +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (2 = >>> +advanced tdr mode, 1 = enable, 0 = disable, -1 = auto)"); >>> module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, >>> 0444); >>> >>> /** >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>> index 759b34799221..28cda321157a 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>> @@ -281,6 +281,53 @@ void amdgpu_job_stop_all_jobs_on_sched(struct >>> drm_gpu_scheduler *sched) >>> } >>> } >>> >>> +int amdgpu_wait_resubmitted_jobs_completion(struct >>> +drm_gpu_scheduler *sched, long timeout, struct drm_sched_job **s_bad_job) { >>> +struct drm_sched_job *s_job, *tmp; >>> +int ret = 0; >>> + >>> +list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) { >>> +struct drm_sched_fence *s_fence = s_job->s_fence; >>> + >>> +if (s_fence->parent == NULL) { /* fail to get a hw fence */ >>> +/* process a job */ >>> +atomic_dec(&sched->num_jobs); >>> +dma_fence_get(&s_fence->finished); >>> +dma_fence_signal(&s_fence->finished); >>> +dma_fence_put(&s_fence->finished); >>> + >>> +/* remove node from mirror_list and free the job */ >>> +spin_lock(&sched->job_list_lock); >>> +list_del_init(&s_job->node); >>> +spin_unlock(&sched->job_list_lock); >>> +sched->ops->free_job(s_job); >>> +continue; >>> +} >>> + >>> +ret = dma_fence_wait_timeout(s_fence->parent, false, timeout); >>> + >>> +if (ret > 0) { /* succeed */ >>> +/* process a job */ >>> +atomic_dec(&sched->num_jobs); >>> +dma_fence_get(&s_fence->finished); >>> +dma_fence_signal(&s_fence->finished); >>> +dma_fence_put(&s_fence->finished); >>> + >>> +/* remove node from mirror_list and free the job */ >>> +spin_lock(&sched->job_list_lock); >>> +list_del_init(&s_job->node); >>> +spin_unlock(&sched->job_list_lock); >>> +sched->ops->free_job(s_job); >>> +continue; >>> +} else if (ret == 0) { >>> +*s_bad_job = s_job; >>> +return -1; /* timeout */ >>> +} >>> +} >>> + >>> +return 0; >>> +} >>> + >>> const struct drm_sched_backend_ops amdgpu_sched_ops = { >>> .dependency = amdgpu_job_dependency, >>> .run_job = amdgpu_job_run, >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >>> index 81caac9b958a..25292f4699fb 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >>> @@ -76,5 +76,5 @@ int amdgpu_job_submit_direct(struct amdgpu_job *job, >>> struct amdgpu_ring *ring, >>> struct dma_fence **fence); >>> >>> void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler >>> *sched); >>> - >>> +int amdgpu_wait_resubmitted_jobs_completion(struct >>> +drm_gpu_scheduler *sched, long timeout, struct drm_sched_job >>> +**s_bad_job); >>> #endif >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>> index b644c78475fd..cb50bfc80bc9 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>> @@ -35,6 +35,11 @@ >>> #include "amdgpu.h" >>> #include "atom.h" >>> >>> +static char *amdgpu_affinity_group[] = { "gfx", "comp" >>> +}; >>> + >>> /* >>> * Rings >>> * Most engines on the GPU are fed via ring buffers. Ring @@ >>> -189,6 >>> +194,7 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct >>> +amdgpu_ring *ring, >>> ring->adev = adev; >>> ring->idx = adev->num_rings++; >>> adev->rings[ring->idx] = ring; >>> +amdgpu_ring_set_affinity_group(ring); >>> r = amdgpu_fence_driver_init_ring(ring, sched_hw_submission); >>> if (r) >>> return r; >>> @@ -459,3 +465,24 @@ int amdgpu_ring_test_helper(struct amdgpu_ring *ring) >>> ring->sched.ready = !r; >>> return r; >>> } >>> + >>> +int amdgpu_ring_set_affinity_group(struct amdgpu_ring *ring) { >>> + struct amdgpu_device *adev = ring->adev; >>> + int i; >>> + >>> + for (i = 0; i < ARRAY_SIZE(amdgpu_affinity_group); i++) { >>> + char *temp_name = amdgpu_affinity_group[i]; >>> + >>> + /* set ring's affinity_group bit if find it in >>> affinity_group list */ >>> + if (strncmp(ring->name, temp_name, strlen(temp_name)) == 0) >>> { >>> + DRM_DEV_INFO(adev->dev, "set ring:%s in >>> affinity_group\n", >>> + ring->name); >>> + ring->sched.affinity_group = 1; >>> + return 0; >>> + } >>> + } >>> + >>> + ring->sched.affinity_group = 0; >>> + return 0; >>> +} >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> index 56acec1075ac..6b0d217e6f5a 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> @@ -350,4 +350,5 @@ int amdgpu_debugfs_ring_init(struct amdgpu_device *adev, >>> struct amdgpu_ring *ring); >>> void amdgpu_debugfs_ring_fini(struct amdgpu_ring *ring); >>> >>> +int amdgpu_ring_set_affinity_group(struct amdgpu_ring *ring); >>> #endif >>> diff --git a/include/drm/gpu_scheduler.h >>> b/include/drm/gpu_scheduler.h index 1c815e0a14ed..589cbaea35dc >>> 100644 >>> --- a/include/drm/gpu_scheduler.h >>> +++ b/include/drm/gpu_scheduler.h >>> @@ -301,6 +301,7 @@ struct drm_gpu_scheduler { >>> atomic_t _score; >>> boolready; >>> boolfree_guilty; >>> +intaffinity_group; >>> }; >>> >>> int drm_sched_init(struct drm_gpu_scheduler *sched, _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx