[AMD Official Use Only - AMD Internal Distribution Only] Hi Lijo
This is the patch we verified before: diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c index 4843dcb9a5f7..39553c7648eb 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c @@ -308,10 +308,12 @@ int kq_submit_packet(struct kernel_queue *kq) if (kq->dev->kfd->device_info.doorbell_size == 8) { *kq->wptr64_kernel = kq->pending_wptr64; + mb(); write_kernel_doorbell64(kq->queue->properties.doorbell_ptr, kq->pending_wptr64); } else { *kq->wptr_kernel = kq->pending_wptr; + mb(); write_kernel_doorbell(kq->queue->properties.doorbell_ptr, kq->pending_wptr); } This mb() doesn't resolve the problem during customer's testing, I also thought of MB() first in beginning like you and Christian ... The mb() here shall resolve the re-ordering between WPTR and doorbell kicking so GPU won't fetch stalled data from WPTR polling once it receives notification from doorbell kicking. (in SR-IOV we set doorbell mode to force GPU still fetch from WPTR polling area, doorbell kicking is just to notify GPU) And by your theory: mb() shall flush the WC storage buffer to memory, thus, this mb() shall also make sure that the ring buffer is not holding stalled data anymore, right ? But we still hit hang and get stalled data from dump. Maybe we need to put mb() in another place ? can you proposal if you insist the cache mapping is not acceptable to you and we can ask customer to verify again. Thanks Monk Liu | Cloud GPU & Virtualization | AMD -----Original Message----- From: Lazar, Lijo <lijo.la...@amd.com> Sent: Friday, November 8, 2024 7:26 PM To: Liu, Monk <monk....@amd.com>; Koenig, Christian <christian.koe...@amd.com>; Alex Deucher <alexdeuc...@gmail.com>; Zhao, Victor <victor.z...@amd.com> Cc: amd-gfx@lists.freedesktop.org; Yang, Philip <philip.y...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com> Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool On 11/8/2024 4:29 PM, Liu, Monk wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > To be clear for the mb() approach: Even if we insert mb() in prior to > amdgpu_ring_set_wptr(ring), GPU might still fetch stalled data from PQ due to > USWC attributes. > Inserting an mb() doesn't cause WC buffer flush is a wrong assumption. All prior loads/stores are supposed to be globally visible. Hence mb() followed by a write pointer update also should guarantee the same (From Arch manual). The MFENCE instruction establishes a memory fence for both loads and stores. The processor ensures that no load or store after MFENCE will become globally visible *until all loads and stores before MFENCE are globally visible.* Ignoring the amdgpu driver part of it - if mb() is not working as expected as you claim that means something is wrong with the system. USWC or WB for ring type may still be a separate discussion. Thanks, Lijo > The issue here is not the re-ordering but the stalled PQ. > > Monk Liu | Cloud GPU & Virtualization | AMD > > > > -----Original Message----- > From: Liu, Monk > Sent: Friday, November 8, 2024 6:22 PM > To: Koenig, Christian <christian.koe...@amd.com>; Lazar, Lijo > <lijo.la...@amd.com>; Alex Deucher <alexdeuc...@gmail.com>; Zhao, > Victor <victor.z...@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Yang, Philip <philip.y...@amd.com>; > Kuehling, Felix <felix.kuehl...@amd.com> > Subject: RE: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and > wb pool > > Christian/Lijo > > We verified all approaches, and we know what works and not works, obviously > the mb() doesn't work. > > The way of mb() between set wptr_polling and kicking off doorbell is > theoretically correct, and I'm not object to do so... but this won't resolve > the issue we hit. > First of all, USWC will have some chance that the data is still in CPU's WC > storage place and not flushed to the memory and even with MB() get rid of > re-ordering GPU might still have a chance to read stalled data from ring > buffer as long as it is mapped USWC. > > This is why only cache plus snoop memory can get rid of inconsistence issues. > > Besides, do you know what's the rational to keep all GFX KCQ cache+snoop but > only HIQ/KIQ from KFD configured to USWC ? > > For performance concern that looks to me always the second priority compared > to "correct" especially under the case HIQ contributes very little to ROCM > performance when switching to cache mapping. > > > Monk Liu | Cloud GPU & Virtualization | AMD > > > > > > > > -----Original Message----- > From: Koenig, Christian <christian.koe...@amd.com> > Sent: Thursday, November 7, 2024 7:57 PM > To: Lazar, Lijo <lijo.la...@amd.com>; Alex Deucher > <alexdeuc...@gmail.com>; Zhao, Victor <victor.z...@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Liu, Monk <monk....@amd.com>; Yang, > Philip <philip.y...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com> > Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and > wb pool > > Am 07.11.24 um 06:58 schrieb Lazar, Lijo: >> On 11/6/2024 8:42 PM, Alex Deucher wrote: >>> On Wed, Nov 6, 2024 at 1:49 AM Victor Zhao <victor.z...@amd.com> wrote: >>>> From: Monk Liu <monk....@amd.com> >>>> >>>> As cache GTT buffer is snooped, this way the coherence between CPU >>>> write and GPU fetch is guaranteed, but original code uses WC + >>>> unsnooped for HIQ PQ(ring buffer) which introduces coherency issues: >>>> MEC fetches a stall data from PQ and leads to MEC hang. >>> Can you elaborate on this? I can see CPU reads being slower because >>> the memory is uncached, but the ring buffer is mostly writes anyway. >>> IIRC, the driver uses USWC for most if not all of the other ring >>> buffers managed by the kernel. Why aren't those a problem? >> We have this on other rings - >> mb(); >> amdgpu_ring_set_wptr(ring); >> >> I think the solution should be to use barrier before write pointer >> updates rather than relying on PCIe snooping. > > Yeah, completely agree as well. The barrier also takes care of preventing the > compiler from re-ordering writes. > > Regards, > Christian. > >> >> Thanks, >> Lijo >> >>> Alex >>> >>>> Signed-off-by: Monk Liu <monk....@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>>> index 1f1d79ac5e6c..fb087a0ff5bc 100644 >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>>> @@ -779,7 +779,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd, >>>> if (amdgpu_amdkfd_alloc_gtt_mem( >>>> kfd->adev, size, &kfd->gtt_mem, >>>> &kfd->gtt_start_gpu_addr, &kfd->gtt_start_cpu_ptr, >>>> - false, true)) { >>>> + false, false)) { >>>> dev_err(kfd_device, "Could not allocate %d bytes\n", >>>> size); >>>> goto alloc_gtt_mem_failure; >>>> } >>>> -- >>>> 2.34.1 >>>> >