amdkfd: Add support for non-4K page size systems

Alex Deucher Wed, 17 Dec 2025 06:23:43 -0800

On Wed, Dec 17, 2025 at 4:03 AM Donet Tom <[email protected]> wrote:
>
>
> On 12/16/25 7:32 PM, Alex Deucher wrote:
> > On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <[email protected]> wrote:
> >>
> >> On 12/15/25 7:39 PM, Alex Deucher wrote:
> >>> On Mon, Dec 15, 2025 at 4:47 AM Christian König
> >>> <[email protected]> wrote:
> >>>> On 12/12/25 18:24, Alex Deucher wrote:
> >>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
> >>>>> <[email protected]> wrote:
> >>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
> >>>>>>> Christian König <[email protected]> writes:
> >>>>>>>>> Setup details:
> >>>>>>>>> ============
> >>>>>>>>> System details: Power10 LPAR using 64K pagesize.
> >>>>>>>>> AMD GPU:
> >>>>>>>>>     Name:                    gfx90a
> >>>>>>>>>     Marketing Name:          AMD Instinct MI210
> >>>>>>>>>
> >>>>>>>>> Queries:
> >>>>>>>>> =======
> >>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests 
> >>>>>>>>> [2] to test
> >>>>>>>>>      these changes. Is there anything else that you would suggest 
> >>>>>>>>> us to run to
> >>>>>>>>>      shake out any other page size related issues w.r.t the kernel 
> >>>>>>>>> driver?
> >>>>>>>> The ROCm team needs to answer that.
> >>>>>>>>
> >>>>>>> Is there any separate mailing list or list of people whom we can cc
> >>>>>>> then?
> >>>>>> With Felix on CC you already got the right person, but he's on 
> >>>>>> vacation and will not be back before the end of the year.
> >>>>>>
> >>>>>> I can check on Monday if some people are still around which could 
> >>>>>> answer a couple of questions, but in general don't expect a quick 
> >>>>>> response.
> >>>>>>
> >>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this 
> >>>>>>>>> eop ring buffer
> >>>>>>>>>      size HW dependent? Should it be made PAGE_SIZE?
> >>>>>>>> Yes and no.
> >>>>>>>>
> >>>>>>> If you could more elaborate on this please? I am assuming you would
> >>>>>>> anyway respond with more context / details on Patch-1 itself. If yes,
> >>>>>>> that would be great!
> >>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring 
> >>>>>> buffer of all the events and actions the CP should execute when 
> >>>>>> shaders and cache flushes finish.
> >>>>>>
> >>>>>> The size depends on the HW generation and configuration of the GPU 
> >>>>>> etc..., but don't ask me for details how that is calculated.
> >>>>>>
> >>>>>> The point is that the size is completely unrelated to the CPU, so 
> >>>>>> using PAGE_SIZE is clearly incorrect.
> >>>>>>
> >>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system 
> >>>>>>>>> page size > 4K.
> >>>>>>>>>      Do we need to lift this restriction and add MMIO remap support 
> >>>>>>>>> for systems with
> >>>>>>>>>      non-4K page sizes?
> >>>>>>>> The problem is the HW can't do this.
> >>>>>>>>
> >>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to 
> >>>>>>> understand
> >>>>>>> what functionality will be unsupported due to this HW limitation then?
> >>>>>> The problem is that the CPU must map some of the registers/resources 
> >>>>>> of the GPU into the address space of the application and you run into 
> >>>>>> security issues when you map more than 4k at a time.
> >>>>> Right.  There are some 4K pages with the MMIO register BAR which are
> >>>>> empty and registers can be remapped into them.  In this case we remap
> >>>>> the HDP flush registers into one of those register pages.  This allows
> >>>>> applications to flush the HDP write FIFO from either the CPU or
> >>>>> another device.  This is needed to flush data written by the CPU or
> >>>>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
> >>>>> it).  This is flushed internally as part of the shader dispatch
> >>>>> packets,
> >>>> As far as I know this is only done for graphics shader submissions to 
> >>>> the classic CS interface, but not for compute dispatches through ROCm 
> >>>> queues.
> >>> There is an explicit PM4 packet to flush the HDP cache for userqs and
> >>> for AQL the flush is handled via one of the flags in the dispatch
> >>> packet.  The MMIO remap is needed for more fine grained use cases
> >>> where you might have the CPU or another device operating in a gang
> >>> like scenario with the GPU.
> >>
> >> Thank you, Alex.
> >>
> >> We were encountering an issue while running the RCCL unit tests. With 2
> >> GPUs, all tests passed successfully; however, when running with more
> >> than 2 GPUs, the tests began to fail at random points with the following
> >> errors:
> >>
> >> [  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
> >> queue with doorbell_id: 80030008
> >> [  606.696820] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> >> [  606.696826] amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
> >> [  610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
> >> queue with doorbell_id: 80030008
> >> [  610.696869] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> >> [  610.696942] amdgpu 0048:0f:00.0: amdgpu: Failed to restore process 
> >> queues
> >>
> >>
> >> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
> >>
> >> One question I have is: we only started observing this problem when the
> >> number of GPUs increased. Could this be related to MMIO remapping not
> >> being available?
> > It could be.  E.g., if the CPU or a GPU writes data to VRAM on another
> > GPU, you will need to flush the HDP to make sure that data hits VRAM
> > before the GPU attached to the VRAM can see it.
>
>
> Thanks Alex
>
> I am now suspecting that the queue preemption issue may be related to
> the unavailability of MMIO remapping. I am not very familiar with this area.
>
> Could you please point me to the relevant code path where the PM4 packet
> is issued to flush the HDP cache?


+ David who is more familiar with the ROCm runtime.

PM4 has a packet called HDP_FLUSH which flushes the HDP.  For AQL,
it's handled by one of the flags I think.  Most things in ROCm use
AQL.

@David Yat Sin Can you point to how HDP flushes are handled in the ROCm runtime?

Alex

>
> I am consistently able to reproduce this issue on my system when using
> more than three GPUs if patches 7/8 and 8/8 are not applied. In your
> opinion, is there anything that can be done to speed up the HDP flush or
> to avoid this situation altogether?
>
>
>
> >
> > Alex
> >
> >>
> >>> Alex
> >>>
> >>>> That's the reason why ROCm needs the remapped MMIO register BAR.
> >>>>
> >>>>> but there are certain cases where an application may want
> >>>>> more control.  This is probably not a showstopper for most ROCm apps.
> >>>> Well the problem is that you absolutely need the HDP flush/invalidation 
> >>>> for 100% correctness. It does work most of the time without it, but you 
> >>>> then risk data corruption.
> >>>>
> >>>> Apart from making the flush/invalidate an IOCTL I think we could also 
> >>>> just use a global dummy page in VRAM.
> >>>>
> >>>> If you make two 32bit writes which are apart from each other and then a 
> >>>> read back a 32bit value from VRAM that should invalidate the HDP as 
> >>>> well. It's less efficient than the MMIO BAR remap but still much better 
> >>>> than going though an IOCTL.
> >>>>
> >>>> The only tricky part is that you need to get the HW barriers with the 
> >>>> doorbell write right.....
> >>>>
> >>>>> That said, the region is only 4K so if you allow applications to map a
> >>>>> larger region they would get access to GPU register pages which they
> >>>>> shouldn't have access to.
> >>>> But don't we also have problems with the doorbell? E.g. the global 
> >>>> aggregated one needs to be 4k as well, or is it ok to over allocate 
> >>>> there?
> >>>>
> >>>> Thinking more about it there is also a major problem with page tables. 
> >>>> Those are 4k by default on modern systems as well and while over 
> >>>> allocating them to 64k is possible that not only wastes some VRAM but 
> >>>> can also result in OOM situations because we can't allocate the 
> >>>> necessary page tables to switch from 2MiB to 4k pages in some cases.
> >>>>
> >>>> Christian.
> >>>>
> >>>>> Alex
> >>>>>
> >>>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
> >>>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Please note that the changes in this series are on a best effort 
> >>>>>>>>> basis from our
> >>>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper 
> >>>>>>>>> knowledge of the
> >>>>>>>>> HW & SW stack) to kindly help with the review and provide feedback 
> >>>>>>>>> / comments on
> >>>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 
> >>>>>>>>> 64K) well
> >>>>>>>>> supported with amd gpu kernel driver.
> >>>>>>>> Well this is generally nice to have, but there are unfortunately 
> >>>>>>>> some HW limitations which makes ROCm pretty much unusable on non 4k 
> >>>>>>>> page size systems.
> >>>>>>> That's a bummer :(
> >>>>>>> - Do we have some HW documentation around what are these limitations 
> >>>>>>> around non-4K pagesize? Any links to such please?
> >>>>>> You already mentioned MMIO remap which obviously has that problem, but 
> >>>>>> if I'm not completely mistaken the PCIe doorbell BAR and some global 
> >>>>>> seq counter resources will also cause problems here.
> >>>>>>
> >>>>>> This can all be worked around by delegating those MMIO accesses into 
> >>>>>> the kernel, but that means tons of extra IOCTL overhead.
> >>>>>>
> >>>>>> Especially the cache flushes which are necessary to avoid corruption 
> >>>>>> are really bad for performance in such an approach.
> >>>>>>
> >>>>>>> - Are there any latest AMD GPU versions which maybe lifts such 
> >>>>>>> restrictions?
> >>>>>> Not that I know off any.
> >>>>>>
> >>>>>>>> What we can do is to support graphics and MM, but that should 
> >>>>>>>> already work out of the box.
> >>>>>>>>
> >>>>>>> - Maybe we should also document, what will work and what won't work 
> >>>>>>> due to these HW limitations.
> >>>>>> Well pretty much everything, I need to double check how ROCm does HDP 
> >>>>>> flushing/invalidating when the MMIO remap isn't available.
> >>>>>>
> >>>>>> Could be that there is already a fallback path and that's the reason 
> >>>>>> why this approach actually works at all.
> >>>>>>
> >>>>>>>> What we can do is to support graphics and MM, but that should 
> >>>>>>>> already work out of the box.>
> >>>>>>> So these patches helped us resolve most of the issues like SDMA hangs
> >>>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
> >>>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
> >>>>>>> due to 64K pagesize.
> >>>>>> Yeah, but this is all for ROCm and not the graphics side.
> >>>>>>
> >>>>>> To be honest I'm not sure how ROCm even works when you have 64k pages 
> >>>>>> at the moment. I would expect much more issue lurking in the kernel 
> >>>>>> driver.
> >>>>>>
> >>>>>>> AFAIU, some of these patches may require re-work based on reviews, but
> >>>>>>> at least with these changes, we were able to see all the tests 
> >>>>>>> passing.
> >>>>>>>
> >>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds 
> >>>>>>>> can be implemented for those issues.
> >>>>>>>>
> >>>>>>> Thanks a lot! That would be super helpful!
> >>>>>>>
> >>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Christian.
> >>>>>>>>
> >>>>>>> Thanks again for the quick response on the patch series.
> >>>>>> You are welcome, but since it's so near to the end of the year not all 
> >>>>>> people are available any more.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Christian.
> >>>>>>
> >>>>>>> -ritesh

Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems

Reply via email to