Hi Andrey, I don't have the bare mental environment, I can only test the SRIOV cases.
Best Regards, JingWen Chen > On Mar 3, 2022, at 01:55, Grodzovsky, Andrey <andrey.grodzov...@amd.com> > wrote: > > The patch is acked-by: Andrey Grodzovsky <andrey.grodzov...@amd.com> > > If you also smoked tested bare metal feel free to apply all the patches, if > no let me know. > > Andrey > > On 2022-03-02 04:51, JingWen Chen wrote: >> Hi Andrey, >> >> Most part of the patches are OK, but the code will introduce a ib test fail >> on the disabled vcn of sienna_cichlid. >> >> In SRIOV use case we will disable one vcn on sienna_cichlid, I have attached >> a patch to fix this issue, please check the attachment. >> >> Best Regards, >> >> Jingwen Chen >> >> >> On 2/26/22 5:22 AM, Andrey Grodzovsky wrote: >>> Hey, patches attached - i applied the patches and resolved merge conflicts >>> but weren't able to test as my on board's network card doesn't work with >>> 5.16 kernel (it does with 5.17, maybe it's Kconfig issue and i need to >>> check more). >>> The patches are on top of 'cababde192b2 Yifan Zhang 2 days ago >>> drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit. >>> >>> Please test and let me know. Maybe by Monday I will be able to resolve the >>> connectivity issue on 5.16. >>> >>> Andrey >>> >>> On 2022-02-24 22:13, JingWen Chen wrote: >>>> Hi Andrey, >>>> >>>> Sorry for the misleading, I mean the whole patch series. We are depending >>>> on this patch series to fix the concurrency issue within SRIOV TDR >>>> sequence. >>>> >>>> >>>> >>>> On 2/25/22 1:26 AM, Andrey Grodzovsky wrote: >>>>> No problem if so but before I do, >>>>> >>>>> >>>>> JingWen - why you think this patch is needed as a standalone now ? It has >>>>> no use without the >>>>> entire feature together with it. Is it some changes you want to do on top >>>>> of that code ? >>>>> >>>>> >>>>> Andrey >>>>> >>>>> >>>>> On 2022-02-24 12:12, Deucher, Alexander wrote: >>>>>> [Public] >>>>>> >>>>>> >>>>>> If it applies cleanly, feel free to drop it in. I'll drop those patches >>>>>> for drm-next since they are already in drm-misc. >>>>>> >>>>>> Alex >>>>>> >>>>>> ------------------------------------------------------------------------ >>>>>> *From:* amd-gfx <amd-gfx-boun...@lists.freedesktop.org> on behalf of >>>>>> Andrey Grodzovsky <andrey.grodzov...@amd.com> >>>>>> *Sent:* Thursday, February 24, 2022 11:24 AM >>>>>> *To:* Chen, JingWen <jingwen.ch...@amd.com>; Christian König >>>>>> <ckoenig.leichtzumer...@gmail.com>; dri-de...@lists.freedesktop.org >>>>>> <dri-de...@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org >>>>>> <amd-gfx@lists.freedesktop.org> >>>>>> *Cc:* Liu, Monk <monk....@amd.com>; Chen, Horace <horace.c...@amd.com>; >>>>>> Lazar, Lijo <lijo.la...@amd.com>; Koenig, Christian >>>>>> <christian.koe...@amd.com>; dan...@ffwll.ch <dan...@ffwll.ch> >>>>>> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after >>>>>> XGMI is ready >>>>>> No because all the patch-set including this patch was landed into >>>>>> drm-misc-next and will reach amd-staging-drm-next on the next upstream >>>>>> rebase i guess. >>>>>> >>>>>> Andrey >>>>>> >>>>>> On 2022-02-24 01:47, JingWen Chen wrote: >>>>>>> Hi Andrey, >>>>>>> >>>>>>> Will you port this patch into amd-staging-drm-next? >>>>>>> >>>>>>> on 2/10/22 2:06 AM, Andrey Grodzovsky wrote: >>>>>>>> All comments are fixed and code pushed. Thanks for everyone >>>>>>>> who helped reviewing. >>>>>>>> >>>>>>>> Andrey >>>>>>>> >>>>>>>> On 2022-02-09 02:53, Christian König wrote: >>>>>>>>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky: >>>>>>>>>> Before we initialize schedulers we must know which reset >>>>>>>>>> domain are we in - for single device there iis a single >>>>>>>>>> domain per device and so single wq per device. For XGMI >>>>>>>>>> the reset domain spans the entire XGMI hive and so the >>>>>>>>>> reset wq is per hive. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com> >>>>>>>>> One more comment below, with that fixed Reviewed-by: Christian König >>>>>>>>> <christian.koe...@amd.com>. >>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 >>>>>>>>>> ++++++++++++++++++++++ >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 34 ++-------------- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + >>>>>>>>>> 3 files changed, 51 insertions(+), 30 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> index 9704b0e1fd82..00123b0013d3 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> @@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct >>>>>>>>>> amdgpu_device *adev) >>>>>>>>>> return r; >>>>>>>>>> } >>>>>>>>>> +static int amdgpu_device_init_schedulers(struct amdgpu_device >>>>>>>>>> *adev) >>>>>>>>>> +{ >>>>>>>>>> + long timeout; >>>>>>>>>> + int r, i; >>>>>>>>>> + >>>>>>>>>> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>>>>>>>>> + struct amdgpu_ring *ring = adev->rings[i]; >>>>>>>>>> + >>>>>>>>>> + /* No need to setup the GPU scheduler for rings that don't >>>>>>>>>> need it */ >>>>>>>>>> + if (!ring || ring->no_scheduler) >>>>>>>>>> + continue; >>>>>>>>>> + >>>>>>>>>> + switch (ring->funcs->type) { >>>>>>>>>> + case AMDGPU_RING_TYPE_GFX: >>>>>>>>>> + timeout = adev->gfx_timeout; >>>>>>>>>> + break; >>>>>>>>>> + case AMDGPU_RING_TYPE_COMPUTE: >>>>>>>>>> + timeout = adev->compute_timeout; >>>>>>>>>> + break; >>>>>>>>>> + case AMDGPU_RING_TYPE_SDMA: >>>>>>>>>> + timeout = adev->sdma_timeout; >>>>>>>>>> + break; >>>>>>>>>> + default: >>>>>>>>>> + timeout = adev->video_timeout; >>>>>>>>>> + break; >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> + r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, >>>>>>>>>> + ring->num_hw_submission, amdgpu_job_hang_limit, >>>>>>>>>> + timeout, adev->reset_domain.wq, >>>>>>>>>> ring->sched_score, ring->name); >>>>>>>>>> + if (r) { >>>>>>>>>> + DRM_ERROR("Failed to create scheduler on ring %s.\n", >>>>>>>>>> + ring->name); >>>>>>>>>> + return r; >>>>>>>>>> + } >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> + return 0; >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>>> + >>>>>>>>>> /** >>>>>>>>>> * amdgpu_device_ip_init - run init for hardware IPs >>>>>>>>>> * >>>>>>>>>> @@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct >>>>>>>>>> amdgpu_device *adev) >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> + r = amdgpu_device_init_schedulers(adev); >>>>>>>>>> + if (r) >>>>>>>>>> + goto init_failed; >>>>>>>>>> + >>>>>>>>>> /* Don't init kfd if whole hive need to be reset during >>>>>>>>>> init */ >>>>>>>>>> if (!adev->gmc.xgmi.pending_reset) >>>>>>>>>> amdgpu_amdkfd_device_init(adev); >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>>>>>> index 45977a72b5dd..fa302540c69a 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>>>>>> @@ -457,8 +457,6 @@ int amdgpu_fence_driver_init_ring(struct >>>>>>>>>> amdgpu_ring *ring, >>>>>>>>>> atomic_t *sched_score) >>>>>>>>>> { >>>>>>>>>> struct amdgpu_device *adev = ring->adev; >>>>>>>>>> - long timeout; >>>>>>>>>> - int r; >>>>>>>>>> if (!adev) >>>>>>>>>> return -EINVAL; >>>>>>>>>> @@ -478,36 +476,12 @@ int amdgpu_fence_driver_init_ring(struct >>>>>>>>>> amdgpu_ring *ring, >>>>>>>>>> spin_lock_init(&ring->fence_drv.lock); >>>>>>>>>> ring->fence_drv.fences = kcalloc(num_hw_submission * 2, >>>>>>>>>> sizeof(void *), >>>>>>>>>> GFP_KERNEL); >>>>>>>>>> - if (!ring->fence_drv.fences) >>>>>>>>>> - return -ENOMEM; >>>>>>>>>> - /* No need to setup the GPU scheduler for rings that don't >>>>>>>>>> need it */ >>>>>>>>>> - if (ring->no_scheduler) >>>>>>>>>> - return 0; >>>>>>>>>> + ring->num_hw_submission = num_hw_submission; >>>>>>>>>> + ring->sched_score = sched_score; >>>>>>>>> Let's move this into the caller and then use ring->num_hw_submission >>>>>>>>> in the fence code as well. >>>>>>>>> >>>>>>>>> The maximum number of jobs on the ring is not really fence specific. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> - switch (ring->funcs->type) { >>>>>>>>>> - case AMDGPU_RING_TYPE_GFX: >>>>>>>>>> - timeout = adev->gfx_timeout; >>>>>>>>>> - break; >>>>>>>>>> - case AMDGPU_RING_TYPE_COMPUTE: >>>>>>>>>> - timeout = adev->compute_timeout; >>>>>>>>>> - break; >>>>>>>>>> - case AMDGPU_RING_TYPE_SDMA: >>>>>>>>>> - timeout = adev->sdma_timeout; >>>>>>>>>> - break; >>>>>>>>>> - default: >>>>>>>>>> - timeout = adev->video_timeout; >>>>>>>>>> - break; >>>>>>>>>> - } >>>>>>>>>> - >>>>>>>>>> - r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, >>>>>>>>>> - num_hw_submission, amdgpu_job_hang_limit, >>>>>>>>>> - timeout, NULL, sched_score, ring->name); >>>>>>>>>> - if (r) { >>>>>>>>>> - DRM_ERROR("Failed to create scheduler on ring %s.\n", >>>>>>>>>> - ring->name); >>>>>>>>>> - return r; >>>>>>>>>> - } >>>>>>>>>> + if (!ring->fence_drv.fences) >>>>>>>>>> + return -ENOMEM; >>>>>>>>>> return 0; >>>>>>>>>> } >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>>>>>>> index fae7d185ad0d..7f20ce73a243 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>>>>>>> @@ -251,6 +251,8 @@ struct amdgpu_ring { >>>>>>>>>> bool has_compute_vm_bug; >>>>>>>>>> bool no_scheduler; >>>>>>>>>> int hw_prio; >>>>>>>>>> + unsigned num_hw_submission; >>>>>>>>>> + atomic_t *sched_score; >>>>>>>>>> }; >>>>>>>>>> #define amdgpu_ring_parse_cs(r, p, ib) >>>>>>>>>> ((r)->funcs->parse_cs((p), (ib)))