[AMD Official Use Only - AMD Internal Distribution Only]

Hi Lijo

This is the patch we verified before:

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
index 4843dcb9a5f7..39553c7648eb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
@@ -308,10 +308,12 @@ int kq_submit_packet(struct kernel_queue *kq)

        if (kq->dev->kfd->device_info.doorbell_size == 8) {
                *kq->wptr64_kernel = kq->pending_wptr64;
+               mb();
                write_kernel_doorbell64(kq->queue->properties.doorbell_ptr,
                                        kq->pending_wptr64);
        } else {
                *kq->wptr_kernel = kq->pending_wptr;
+               mb();
                write_kernel_doorbell(kq->queue->properties.doorbell_ptr,
                                        kq->pending_wptr);
        }


This mb() doesn't resolve the problem during customer's testing, I also thought 
of MB() first in beginning like you and Christian ...
The mb() here shall resolve the re-ordering between WPTR and doorbell kicking 
so GPU won't fetch stalled data from WPTR polling once it receives notification 
from doorbell kicking.
(in SR-IOV we set doorbell mode to force GPU still fetch from WPTR polling 
area, doorbell kicking is just to notify GPU)

And by your theory: mb() shall flush the WC storage buffer to memory, thus, 
this mb() shall also make sure that the ring buffer is not holding stalled data 
anymore, right ?
But we still hit hang and get stalled data from dump.

Maybe we need to put mb() in another place ? can you proposal if you insist the 
cache mapping is not acceptable to you and we can ask customer to verify again.

Thanks

Monk Liu | Cloud GPU & Virtualization | AMD







-----Original Message-----
From: Lazar, Lijo <lijo.la...@amd.com>
Sent: Friday, November 8, 2024 7:26 PM
To: Liu, Monk <monk....@amd.com>; Koenig, Christian <christian.koe...@amd.com>; 
Alex Deucher <alexdeuc...@gmail.com>; Zhao, Victor <victor.z...@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Yang, Philip <philip.y...@amd.com>; 
Kuehling, Felix <felix.kuehl...@amd.com>
Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool



On 11/8/2024 4:29 PM, Liu, Monk wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> To be clear for the mb() approach: Even if we insert mb() in prior to 
> amdgpu_ring_set_wptr(ring), GPU might still fetch stalled data from PQ due to 
> USWC attributes.
>
Inserting an mb() doesn't cause WC buffer flush is a wrong assumption.

All prior loads/stores are supposed to be globally visible. Hence mb() followed 
by a write pointer update also should guarantee the same (From Arch manual).

        The MFENCE instruction establishes a memory fence for both loads and 
stores. The processor ensures that no load or store after MFENCE will become 
globally visible *until all loads and stores before MFENCE are globally 
visible.*


Ignoring the amdgpu driver part of it - if mb() is not working as expected as 
you claim that means something is wrong with the system.

USWC or WB for ring type may still be a separate discussion.

Thanks,
Lijo

> The issue here is not the re-ordering but the stalled PQ.
>
> Monk Liu | Cloud GPU & Virtualization | AMD
>
>
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Friday, November 8, 2024 6:22 PM
> To: Koenig, Christian <christian.koe...@amd.com>; Lazar, Lijo
> <lijo.la...@amd.com>; Alex Deucher <alexdeuc...@gmail.com>; Zhao,
> Victor <victor.z...@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Yang, Philip <philip.y...@amd.com>;
> Kuehling, Felix <felix.kuehl...@amd.com>
> Subject: RE: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and
> wb pool
>
> Christian/Lijo
>
> We verified all approaches, and we know what works and not works, obviously 
> the mb() doesn't work.
>
> The way of mb() between set wptr_polling and kicking off doorbell is 
> theoretically correct, and I'm not object to do so... but this won't resolve 
> the issue we hit.
> First of all, USWC will have some chance that the data is still in CPU's WC 
> storage place and not flushed to the memory and even with MB() get rid of 
> re-ordering GPU might still have a chance to read stalled data from ring 
> buffer as long as it is mapped USWC.
>
> This is why only cache plus snoop memory can get rid of inconsistence issues.
>
> Besides, do you know what's the rational to keep all GFX KCQ cache+snoop but 
> only HIQ/KIQ from KFD configured to USWC ?
>
> For performance concern that looks to me always the second priority compared 
> to "correct" especially under the case HIQ contributes very little to ROCM 
> performance when switching to cache mapping.
>
>
> Monk Liu | Cloud GPU & Virtualization | AMD
>
>
>
>
>
>
>
> -----Original Message-----
> From: Koenig, Christian <christian.koe...@amd.com>
> Sent: Thursday, November 7, 2024 7:57 PM
> To: Lazar, Lijo <lijo.la...@amd.com>; Alex Deucher
> <alexdeuc...@gmail.com>; Zhao, Victor <victor.z...@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Liu, Monk <monk....@amd.com>; Yang,
> Philip <philip.y...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>
> Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and
> wb pool
>
> Am 07.11.24 um 06:58 schrieb Lazar, Lijo:
>> On 11/6/2024 8:42 PM, Alex Deucher wrote:
>>> On Wed, Nov 6, 2024 at 1:49 AM Victor Zhao <victor.z...@amd.com> wrote:
>>>> From: Monk Liu <monk....@amd.com>
>>>>
>>>> As cache GTT buffer is snooped, this way the coherence between CPU
>>>> write and GPU fetch is guaranteed, but original code uses WC +
>>>> unsnooped for HIQ PQ(ring buffer) which introduces coherency issues:
>>>> MEC fetches a stall data from PQ and leads to MEC hang.
>>> Can you elaborate on this?  I can see CPU reads being slower because
>>> the memory is uncached, but the ring buffer is mostly writes anyway.
>>> IIRC, the driver uses USWC for most if not all of the other ring
>>> buffers managed by the kernel.  Why aren't those a problem?
>> We have this on other rings -
>>          mb();
>>          amdgpu_ring_set_wptr(ring);
>>
>> I think the solution should be to use barrier before write pointer
>> updates rather than relying on PCIe snooping.
>
> Yeah, completely agree as well. The barrier also takes care of preventing the 
> compiler from re-ordering writes.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Lijo
>>
>>> Alex
>>>
>>>> Signed-off-by: Monk Liu <monk....@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> index 1f1d79ac5e6c..fb087a0ff5bc 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> @@ -779,7 +779,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>          if (amdgpu_amdkfd_alloc_gtt_mem(
>>>>                          kfd->adev, size, &kfd->gtt_mem,
>>>>                          &kfd->gtt_start_gpu_addr, &kfd->gtt_start_cpu_ptr,
>>>> -                       false, true)) {
>>>> +                       false, false)) {
>>>>                  dev_err(kfd_device, "Could not allocate %d bytes\n", 
>>>> size);
>>>>                  goto alloc_gtt_mem_failure;
>>>>          }
>>>> --
>>>> 2.34.1
>>>>
>

Reply via email to