amdkfd: Add support for non-4K page size systems

Yat Sin, David Wed, 17 Dec 2025 13:31:44 -0800

[AMD Official Use Only - AMD Internal Distribution Only]

HDP flush is done in ROCm using these 3 methods:


1. For AQL packets, this is done by setting the system-scope acquire and 
release fences in the packet header.
     For example, it is set here:
     
https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_kernel.cpp#L878

     And when the headers are defined here:
     
https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L85


2. Via a SDMA packet. This is done before doing a memory copy:
     The function is called here:
        
https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L484
     And the packet (POLL_REGMEM) is generated here:
        
https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L1154


3. By writing to a MMIO remapped address:
            The address is stored in rocclr here:
        
https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocdevice.cpp#L607

            And the flush is triggered by writing a 1, e.g here:
        
https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L3831


Regards,
David


> -----Original Message-----
> From: Alex Deucher <[email protected]>
> Sent: Wednesday, December 17, 2025 9:23 AM
> To: Donet Tom <[email protected]>; Yat Sin, David
> <[email protected]>
> Cc: Koenig, Christian <[email protected]>; Ritesh Harjani (IBM)
> <[email protected]>; [email protected]; Kuehling, Felix
> <[email protected]>; Deucher, Alexander
> <[email protected]>; Russell, Kent <[email protected]>;
> Vaidyanathan Srinivasan <[email protected]>; Mukesh Kumar Chaurasiya
> <[email protected]>
> Subject: Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page
> size systems
>
> On Wed, Dec 17, 2025 at 4:03 AM Donet Tom <[email protected]> wrote:
> >
> >
> > On 12/16/25 7:32 PM, Alex Deucher wrote:
> > > On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <[email protected]>
> wrote:
> > >>
> > >> On 12/15/25 7:39 PM, Alex Deucher wrote:
> > >>> On Mon, Dec 15, 2025 at 4:47 AM Christian König
> > >>> <[email protected]> wrote:
> > >>>> On 12/12/25 18:24, Alex Deucher wrote:
> > >>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
> > >>>>> <[email protected]> wrote:
> > >>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
> > >>>>>>> Christian König <[email protected]> writes:
> > >>>>>>>>> Setup details:
> > >>>>>>>>> ============
> > >>>>>>>>> System details: Power10 LPAR using 64K pagesize.
> > >>>>>>>>> AMD GPU:
> > >>>>>>>>>     Name:                    gfx90a
> > >>>>>>>>>     Marketing Name:          AMD Instinct MI210
> > >>>>>>>>>
> > >>>>>>>>> Queries:
> > >>>>>>>>> =======
> > >>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit 
> > >>>>>>>>> tests [2]
> to test
> > >>>>>>>>>      these changes. Is there anything else that you would suggest 
> > >>>>>>>>> us
> to run to
> > >>>>>>>>>      shake out any other page size related issues w.r.t the kernel
> driver?
> > >>>>>>>> The ROCm team needs to answer that.
> > >>>>>>>>
> > >>>>>>> Is there any separate mailing list or list of people whom we
> > >>>>>>> can cc then?
> > >>>>>> With Felix on CC you already got the right person, but he's on 
> > >>>>>> vacation
> and will not be back before the end of the year.
> > >>>>>>
> > >>>>>> I can check on Monday if some people are still around which could
> answer a couple of questions, but in general don't expect a quick response.
> > >>>>>>
> > >>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this 
> > >>>>>>>>> eop
> ring buffer
> > >>>>>>>>>      size HW dependent? Should it be made PAGE_SIZE?
> > >>>>>>>> Yes and no.
> > >>>>>>>>
> > >>>>>>> If you could more elaborate on this please? I am assuming you
> > >>>>>>> would anyway respond with more context / details on Patch-1
> > >>>>>>> itself. If yes, that would be great!
> > >>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring 
> > >>>>>> buffer of
> all the events and actions the CP should execute when shaders and cache 
> flushes
> finish.
> > >>>>>>
> > >>>>>> The size depends on the HW generation and configuration of the GPU
> etc..., but don't ask me for details how that is calculated.
> > >>>>>>
> > >>>>>> The point is that the size is completely unrelated to the CPU, so 
> > >>>>>> using
> PAGE_SIZE is clearly incorrect.
> > >>>>>>
> > >>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system
> page size > 4K.
> > >>>>>>>>>      Do we need to lift this restriction and add MMIO remap 
> > >>>>>>>>> support
> for systems with
> > >>>>>>>>>      non-4K page sizes?
> > >>>>>>>> The problem is the HW can't do this.
> > >>>>>>>>
> > >>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to
> > >>>>>>> understand what functionality will be unsupported due to this HW
> limitation then?
> > >>>>>> The problem is that the CPU must map some of the registers/resources
> of the GPU into the address space of the application and you run into security
> issues when you map more than 4k at a time.
> > >>>>> Right.  There are some 4K pages with the MMIO register BAR which
> > >>>>> are empty and registers can be remapped into them.  In this case
> > >>>>> we remap the HDP flush registers into one of those register
> > >>>>> pages.  This allows applications to flush the HDP write FIFO
> > >>>>> from either the CPU or another device.  This is needed to flush
> > >>>>> data written by the CPU or another device to the VRAM BAR out to
> > >>>>> VRAM (i.e., so the GPU can see it).  This is flushed internally
> > >>>>> as part of the shader dispatch packets,
> > >>>> As far as I know this is only done for graphics shader submissions to 
> > >>>> the
> classic CS interface, but not for compute dispatches through ROCm queues.
> > >>> There is an explicit PM4 packet to flush the HDP cache for userqs
> > >>> and for AQL the flush is handled via one of the flags in the
> > >>> dispatch packet.  The MMIO remap is needed for more fine grained
> > >>> use cases where you might have the CPU or another device operating
> > >>> in a gang like scenario with the GPU.
> > >>
> > >> Thank you, Alex.
> > >>
> > >> We were encountering an issue while running the RCCL unit tests.
> > >> With 2 GPUs, all tests passed successfully; however, when running
> > >> with more than 2 GPUs, the tests began to fail at random points
> > >> with the following
> > >> errors:
> > >>
> > >> [  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
> > >> for queue with doorbell_id: 80030008 [  606.696820] amdgpu
> > >> 0048:0f:00.0: amdgpu: Failed to evict process queues [  606.696826]
> > >> amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 [
> > >> 610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
> > >> for queue with doorbell_id: 80030008 [  610.696869] amdgpu
> > >> 0048:0f:00.0: amdgpu: Failed to evict process queues [  610.696942]
> > >> amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
> > >>
> > >>
> > >> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
> > >>
> > >> One question I have is: we only started observing this problem when
> > >> the number of GPUs increased. Could this be related to MMIO
> > >> remapping not being available?
> > > It could be.  E.g., if the CPU or a GPU writes data to VRAM on
> > > another GPU, you will need to flush the HDP to make sure that data
> > > hits VRAM before the GPU attached to the VRAM can see it.
> >
> >
> > Thanks Alex
> >
> > I am now suspecting that the queue preemption issue may be related to
> > the unavailability of MMIO remapping. I am not very familiar with this area.
> >
> > Could you please point me to the relevant code path where the PM4
> > packet is issued to flush the HDP cache?
>
> + David who is more familiar with the ROCm runtime.
>
> PM4 has a packet called HDP_FLUSH which flushes the HDP.  For AQL, it's
> handled by one of the flags I think.  Most things in ROCm use AQL.
>
> @David Yat Sin Can you point to how HDP flushes are handled in the ROCm
> runtime?
>
> Alex
>
> >
> > I am consistently able to reproduce this issue on my system when using
> > more than three GPUs if patches 7/8 and 8/8 are not applied. In your
> > opinion, is there anything that can be done to speed up the HDP flush
> > or to avoid this situation altogether?
> >
> >
> >
> > >
> > > Alex
> > >
> > >>
> > >>> Alex
> > >>>
> > >>>> That's the reason why ROCm needs the remapped MMIO register BAR.
> > >>>>
> > >>>>> but there are certain cases where an application may want more
> > >>>>> control.  This is probably not a showstopper for most ROCm apps.
> > >>>> Well the problem is that you absolutely need the HDP 
> > >>>> flush/invalidation for
> 100% correctness. It does work most of the time without it, but you then risk 
> data
> corruption.
> > >>>>
> > >>>> Apart from making the flush/invalidate an IOCTL I think we could also 
> > >>>> just
> use a global dummy page in VRAM.
> > >>>>
> > >>>> If you make two 32bit writes which are apart from each other and then a
> read back a 32bit value from VRAM that should invalidate the HDP as well. 
> It's less
> efficient than the MMIO BAR remap but still much better than going though an
> IOCTL.
> > >>>>
> > >>>> The only tricky part is that you need to get the HW barriers with the 
> > >>>> doorbell
> write right.....
> > >>>>
> > >>>>> That said, the region is only 4K so if you allow applications to
> > >>>>> map a larger region they would get access to GPU register pages
> > >>>>> which they shouldn't have access to.
> > >>>> But don't we also have problems with the doorbell? E.g. the global
> aggregated one needs to be 4k as well, or is it ok to over allocate there?
> > >>>>
> > >>>> Thinking more about it there is also a major problem with page tables.
> Those are 4k by default on modern systems as well and while over allocating 
> them
> to 64k is possible that not only wastes some VRAM but can also result in OOM
> situations because we can't allocate the necessary page tables to switch from 
> 2MiB
> to 4k pages in some cases.
> > >>>>
> > >>>> Christian.
> > >>>>
> > >>>>> Alex
> > >>>>>
> > >>>>>>>>> [1] ROCr debug agent tests:
> > >>>>>>>>> https://github.com/ROCm/rocr_debug_agent
> > >>>>>>>>> [2] RCCL tests:
> > >>>>>>>>> https://github.com/ROCm/rccl/tree/develop/test
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Please note that the changes in this series are on a best
> > >>>>>>>>> effort basis from our end. Therefore, requesting the amd-gfx
> > >>>>>>>>> community (who have deeper knowledge of the HW & SW stack)
> > >>>>>>>>> to kindly help with the review and provide feedback /
> > >>>>>>>>> comments on these patches. The idea here is, to also have non-4K
> pagesize (e.g. 64K) well supported with amd gpu kernel driver.
> > >>>>>>>> Well this is generally nice to have, but there are unfortunately 
> > >>>>>>>> some
> HW limitations which makes ROCm pretty much unusable on non 4k page size
> systems.
> > >>>>>>> That's a bummer :(
> > >>>>>>> - Do we have some HW documentation around what are these
> limitations around non-4K pagesize? Any links to such please?
> > >>>>>> You already mentioned MMIO remap which obviously has that problem,
> but if I'm not completely mistaken the PCIe doorbell BAR and some global seq
> counter resources will also cause problems here.
> > >>>>>>
> > >>>>>> This can all be worked around by delegating those MMIO accesses into
> the kernel, but that means tons of extra IOCTL overhead.
> > >>>>>>
> > >>>>>> Especially the cache flushes which are necessary to avoid corruption
> are really bad for performance in such an approach.
> > >>>>>>
> > >>>>>>> - Are there any latest AMD GPU versions which maybe lifts such
> restrictions?
> > >>>>>> Not that I know off any.
> > >>>>>>
> > >>>>>>>> What we can do is to support graphics and MM, but that should
> already work out of the box.
> > >>>>>>>>
> > >>>>>>> - Maybe we should also document, what will work and what won't work
> due to these HW limitations.
> > >>>>>> Well pretty much everything, I need to double check how ROCm does
> HDP flushing/invalidating when the MMIO remap isn't available.
> > >>>>>>
> > >>>>>> Could be that there is already a fallback path and that's the reason 
> > >>>>>> why
> this approach actually works at all.
> > >>>>>>
> > >>>>>>>> What we can do is to support graphics and MM, but that should
> > >>>>>>>> already work out of the box.>
> > >>>>>>> So these patches helped us resolve most of the issues like
> > >>>>>>> SDMA hangs and GPU kernel page faults which we saw with rocr
> > >>>>>>> and rccl tests with 64K pagesize. Meaning, we didn't see this
> > >>>>>>> working out of box perhaps due to 64K pagesize.
> > >>>>>> Yeah, but this is all for ROCm and not the graphics side.
> > >>>>>>
> > >>>>>> To be honest I'm not sure how ROCm even works when you have 64k
> pages at the moment. I would expect much more issue lurking in the kernel 
> driver.
> > >>>>>>
> > >>>>>>> AFAIU, some of these patches may require re-work based on
> > >>>>>>> reviews, but at least with these changes, we were able to see all 
> > >>>>>>> the
> tests passing.
> > >>>>>>>
> > >>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds
> can be implemented for those issues.
> > >>>>>>>>
> > >>>>>>> Thanks a lot! That would be super helpful!
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Christian.
> > >>>>>>>>
> > >>>>>>> Thanks again for the quick response on the patch series.
> > >>>>>> You are welcome, but since it's so near to the end of the year not 
> > >>>>>> all
> people are available any more.
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Christian.
> > >>>>>>
> > >>>>>>> -ritesh

RE: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems

Reply via email to