On 11/8/2024 4:29 PM, Liu, Monk wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> To be clear for the mb() approach: Even if we insert mb() in prior to
> amdgpu_ring_set_wptr(ring), GPU might still fetch stalled data from PQ due to
> USWC attributes.
>
Inserting an mb() doesn't cause WC buffer flush is a wrong assumption.
All prior loads/stores are supposed to be globally visible. Hence mb()
followed by a write pointer update also should guarantee the same (From
Arch manual).
The MFENCE instruction establishes a memory fence for both loads and
stores. The processor ensures that no load
or store after MFENCE will become globally visible *until all loads and
stores before MFENCE are globally visible.*
Ignoring the amdgpu driver part of it - if mb() is not working as
expected as you claim that means something is wrong with the system.
USWC or WB for ring type may still be a separate discussion.
Thanks,
Lijo
> The issue here is not the re-ordering but the stalled PQ.
>
> Monk Liu | Cloud GPU & Virtualization | AMD
>
>
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Friday, November 8, 2024 6:22 PM
> To: Koenig, Christian <christian.koe...@amd.com>; Lazar, Lijo
> <lijo.la...@amd.com>; Alex Deucher <alexdeuc...@gmail.com>; Zhao, Victor
> <victor.z...@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Yang, Philip <philip.y...@amd.com>;
> Kuehling, Felix <felix.kuehl...@amd.com>
> Subject: RE: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool
>
> Christian/Lijo
>
> We verified all approaches, and we know what works and not works, obviously
> the mb() doesn't work.
>
> The way of mb() between set wptr_polling and kicking off doorbell is
> theoretically correct, and I'm not object to do so... but this won't resolve
> the issue we hit.
> First of all, USWC will have some chance that the data is still in CPU's WC
> storage place and not flushed to the memory and even with MB() get rid of
> re-ordering GPU might still have a chance to read stalled data from ring
> buffer as long as it is mapped USWC.
>
> This is why only cache plus snoop memory can get rid of inconsistence issues.
>
> Besides, do you know what's the rational to keep all GFX KCQ cache+snoop but
> only HIQ/KIQ from KFD configured to USWC ?
>
> For performance concern that looks to me always the second priority compared
> to "correct" especially under the case HIQ contributes very little to ROCM
> performance when switching to cache mapping.
>
>
> Monk Liu | Cloud GPU & Virtualization | AMD
>
>
>
>
>
>
>
> -----Original Message-----
> From: Koenig, Christian <christian.koe...@amd.com>
> Sent: Thursday, November 7, 2024 7:57 PM
> To: Lazar, Lijo <lijo.la...@amd.com>; Alex Deucher <alexdeuc...@gmail.com>;
> Zhao, Victor <victor.z...@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Liu, Monk <monk....@amd.com>; Yang, Philip
> <philip.y...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>
> Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool
>
> Am 07.11.24 um 06:58 schrieb Lazar, Lijo:
>> On 11/6/2024 8:42 PM, Alex Deucher wrote:
>>> On Wed, Nov 6, 2024 at 1:49 AM Victor Zhao <victor.z...@amd.com> wrote:
>>>> From: Monk Liu <monk....@amd.com>
>>>>
>>>> As cache GTT buffer is snooped, this way the coherence between CPU
>>>> write and GPU fetch is guaranteed, but original code uses WC +
>>>> unsnooped for HIQ PQ(ring buffer) which introduces coherency issues:
>>>> MEC fetches a stall data from PQ and leads to MEC hang.
>>> Can you elaborate on this? I can see CPU reads being slower because
>>> the memory is uncached, but the ring buffer is mostly writes anyway.
>>> IIRC, the driver uses USWC for most if not all of the other ring
>>> buffers managed by the kernel. Why aren't those a problem?
>> We have this on other rings -
>> mb();
>> amdgpu_ring_set_wptr(ring);
>>
>> I think the solution should be to use barrier before write pointer
>> updates rather than relying on PCIe snooping.
>
> Yeah, completely agree as well. The barrier also takes care of preventing the
> compiler from re-ordering writes.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Lijo
>>
>>> Alex
>>>
>>>> Signed-off-by: Monk Liu <monk....@amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> index 1f1d79ac5e6c..fb087a0ff5bc 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> @@ -779,7 +779,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>> if (amdgpu_amdkfd_alloc_gtt_mem(
>>>> kfd->adev, size, &kfd->gtt_mem,
>>>> &kfd->gtt_start_gpu_addr, &kfd->gtt_start_cpu_ptr,
>>>> - false, true)) {
>>>> + false, false)) {
>>>> dev_err(kfd_device, "Could not allocate %d bytes\n",
>>>> size);
>>>> goto alloc_gtt_mem_failure;
>>>> }
>>>> --
>>>> 2.34.1
>>>>
>