amdkfd: use cache GTT buffer for PQ and wb pool

Lazar, Lijo Fri, 08 Nov 2024 03:26:44 -0800

On 11/8/2024 4:29 PM, Liu, Monk wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> To be clear for the mb() approach: Even if we insert mb() in prior to 
> amdgpu_ring_set_wptr(ring), GPU might still fetch stalled data from PQ due to 
> USWC attributes.
> 
Inserting an mb() doesn't cause WC buffer flush is a wrong assumption.

All prior loads/stores are supposed to be globally visible. Hence mb()
followed by a write pointer update also should guarantee the same (From
Arch manual).

        The MFENCE instruction establishes a memory fence for both loads and
stores. The processor ensures that no load
or store after MFENCE will become globally visible *until all loads and
stores before MFENCE are globally visible.*


Ignoring the amdgpu driver part of it - if mb() is not working as
expected as you claim that means something is wrong with the system.

USWC or WB for ring type may still be a separate discussion.

Thanks,
Lijo

> The issue here is not the re-ordering but the stalled PQ.
> 
> Monk Liu | Cloud GPU & Virtualization | AMD
> 
> 
> 
> -----Original Message-----
> From: Liu, Monk
> Sent: Friday, November 8, 2024 6:22 PM
> To: Koenig, Christian <christian.koe...@amd.com>; Lazar, Lijo 
> <lijo.la...@amd.com>; Alex Deucher <alexdeuc...@gmail.com>; Zhao, Victor 
> <victor.z...@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Yang, Philip <philip.y...@amd.com>; 
> Kuehling, Felix <felix.kuehl...@amd.com>
> Subject: RE: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool
> 
> Christian/Lijo
> 
> We verified all approaches, and we know what works and not works, obviously 
> the mb() doesn't work.
> 
> The way of mb() between set wptr_polling and kicking off doorbell is 
> theoretically correct, and I'm not object to do so... but this won't resolve 
> the issue we hit.
> First of all, USWC will have some chance that the data is still in CPU's WC 
> storage place and not flushed to the memory and even with MB() get rid of 
> re-ordering GPU might still have a chance to read stalled data from ring 
> buffer as long as it is mapped USWC.
> 
> This is why only cache plus snoop memory can get rid of inconsistence issues.
> 
> Besides, do you know what's the rational to keep all GFX KCQ cache+snoop but 
> only HIQ/KIQ from KFD configured to USWC ?
> 
> For performance concern that looks to me always the second priority compared 
> to "correct" especially under the case HIQ contributes very little to ROCM 
> performance when switching to cache mapping.
> 
> 
> Monk Liu | Cloud GPU & Virtualization | AMD
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Koenig, Christian <christian.koe...@amd.com>
> Sent: Thursday, November 7, 2024 7:57 PM
> To: Lazar, Lijo <lijo.la...@amd.com>; Alex Deucher <alexdeuc...@gmail.com>; 
> Zhao, Victor <victor.z...@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Liu, Monk <monk....@amd.com>; Yang, Philip 
> <philip.y...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>
> Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool
> 
> Am 07.11.24 um 06:58 schrieb Lazar, Lijo:
>> On 11/6/2024 8:42 PM, Alex Deucher wrote:
>>> On Wed, Nov 6, 2024 at 1:49 AM Victor Zhao <victor.z...@amd.com> wrote:
>>>> From: Monk Liu <monk....@amd.com>
>>>>
>>>> As cache GTT buffer is snooped, this way the coherence between CPU
>>>> write and GPU fetch is guaranteed, but original code uses WC +
>>>> unsnooped for HIQ PQ(ring buffer) which introduces coherency issues:
>>>> MEC fetches a stall data from PQ and leads to MEC hang.
>>> Can you elaborate on this?  I can see CPU reads being slower because
>>> the memory is uncached, but the ring buffer is mostly writes anyway.
>>> IIRC, the driver uses USWC for most if not all of the other ring
>>> buffers managed by the kernel.  Why aren't those a problem?
>> We have this on other rings -
>>          mb();
>>          amdgpu_ring_set_wptr(ring);
>>
>> I think the solution should be to use barrier before write pointer
>> updates rather than relying on PCIe snooping.
> 
> Yeah, completely agree as well. The barrier also takes care of preventing the 
> compiler from re-ordering writes.
> 
> Regards,
> Christian.
> 
>>
>> Thanks,
>> Lijo
>>
>>> Alex
>>>
>>>> Signed-off-by: Monk Liu <monk....@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> index 1f1d79ac5e6c..fb087a0ff5bc 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> @@ -779,7 +779,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>          if (amdgpu_amdkfd_alloc_gtt_mem(
>>>>                          kfd->adev, size, &kfd->gtt_mem,
>>>>                          &kfd->gtt_start_gpu_addr, &kfd->gtt_start_cpu_ptr,
>>>> -                       false, true)) {
>>>> +                       false, false)) {
>>>>                  dev_err(kfd_device, "Could not allocate %d bytes\n", 
>>>> size);
>>>>                  goto alloc_gtt_mem_failure;
>>>>          }
>>>> --
>>>> 2.34.1
>>>>
>
Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool

Reply via email to