On Wed, Dec 17, 2025 at 4:03 AM Donet Tom <[email protected]> wrote: > > > On 12/16/25 7:32 PM, Alex Deucher wrote: > > On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <[email protected]> wrote: > >> > >> On 12/15/25 7:39 PM, Alex Deucher wrote: > >>> On Mon, Dec 15, 2025 at 4:47 AM Christian König > >>> <[email protected]> wrote: > >>>> On 12/12/25 18:24, Alex Deucher wrote: > >>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König > >>>>> <[email protected]> wrote: > >>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote: > >>>>>>> Christian König <[email protected]> writes: > >>>>>>>>> Setup details: > >>>>>>>>> ============ > >>>>>>>>> System details: Power10 LPAR using 64K pagesize. > >>>>>>>>> AMD GPU: > >>>>>>>>> Name: gfx90a > >>>>>>>>> Marketing Name: AMD Instinct MI210 > >>>>>>>>> > >>>>>>>>> Queries: > >>>>>>>>> ======= > >>>>>>>>> 1. We currently ran rocr-debug agent tests [1] and rccl unit tests > >>>>>>>>> [2] to test > >>>>>>>>> these changes. Is there anything else that you would suggest > >>>>>>>>> us to run to > >>>>>>>>> shake out any other page size related issues w.r.t the kernel > >>>>>>>>> driver? > >>>>>>>> The ROCm team needs to answer that. > >>>>>>>> > >>>>>>> Is there any separate mailing list or list of people whom we can cc > >>>>>>> then? > >>>>>> With Felix on CC you already got the right person, but he's on > >>>>>> vacation and will not be back before the end of the year. > >>>>>> > >>>>>> I can check on Monday if some people are still around which could > >>>>>> answer a couple of questions, but in general don't expect a quick > >>>>>> response. > >>>>>> > >>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this > >>>>>>>>> eop ring buffer > >>>>>>>>> size HW dependent? Should it be made PAGE_SIZE? > >>>>>>>> Yes and no. > >>>>>>>> > >>>>>>> If you could more elaborate on this please? I am assuming you would > >>>>>>> anyway respond with more context / details on Patch-1 itself. If yes, > >>>>>>> that would be great! > >>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring > >>>>>> buffer of all the events and actions the CP should execute when > >>>>>> shaders and cache flushes finish. > >>>>>> > >>>>>> The size depends on the HW generation and configuration of the GPU > >>>>>> etc..., but don't ask me for details how that is calculated. > >>>>>> > >>>>>> The point is that the size is completely unrelated to the CPU, so > >>>>>> using PAGE_SIZE is clearly incorrect. > >>>>>> > >>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system > >>>>>>>>> page size > 4K. > >>>>>>>>> Do we need to lift this restriction and add MMIO remap support > >>>>>>>>> for systems with > >>>>>>>>> non-4K page sizes? > >>>>>>>> The problem is the HW can't do this. > >>>>>>>> > >>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to > >>>>>>> understand > >>>>>>> what functionality will be unsupported due to this HW limitation then? > >>>>>> The problem is that the CPU must map some of the registers/resources > >>>>>> of the GPU into the address space of the application and you run into > >>>>>> security issues when you map more than 4k at a time. > >>>>> Right. There are some 4K pages with the MMIO register BAR which are > >>>>> empty and registers can be remapped into them. In this case we remap > >>>>> the HDP flush registers into one of those register pages. This allows > >>>>> applications to flush the HDP write FIFO from either the CPU or > >>>>> another device. This is needed to flush data written by the CPU or > >>>>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see > >>>>> it). This is flushed internally as part of the shader dispatch > >>>>> packets, > >>>> As far as I know this is only done for graphics shader submissions to > >>>> the classic CS interface, but not for compute dispatches through ROCm > >>>> queues. > >>> There is an explicit PM4 packet to flush the HDP cache for userqs and > >>> for AQL the flush is handled via one of the flags in the dispatch > >>> packet. The MMIO remap is needed for more fine grained use cases > >>> where you might have the CPU or another device operating in a gang > >>> like scenario with the GPU. > >> > >> Thank you, Alex. > >> > >> We were encountering an issue while running the RCCL unit tests. With 2 > >> GPUs, all tests passed successfully; however, when running with more > >> than 2 GPUs, the tests began to fail at random points with the following > >> errors: > >> > >> [ 598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for > >> queue with doorbell_id: 80030008 > >> [ 606.696820] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues > >> [ 606.696826] amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 > >> [ 610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for > >> queue with doorbell_id: 80030008 > >> [ 610.696869] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues > >> [ 610.696942] amdgpu 0048:0f:00.0: amdgpu: Failed to restore process > >> queues > >> > >> > >> After applying patches 7/8 and 8/8, we are no longer seeing this issue. > >> > >> One question I have is: we only started observing this problem when the > >> number of GPUs increased. Could this be related to MMIO remapping not > >> being available? > > It could be. E.g., if the CPU or a GPU writes data to VRAM on another > > GPU, you will need to flush the HDP to make sure that data hits VRAM > > before the GPU attached to the VRAM can see it. > > > Thanks Alex > > I am now suspecting that the queue preemption issue may be related to > the unavailability of MMIO remapping. I am not very familiar with this area. > > Could you please point me to the relevant code path where the PM4 packet > is issued to flush the HDP cache?
+ David who is more familiar with the ROCm runtime. PM4 has a packet called HDP_FLUSH which flushes the HDP. For AQL, it's handled by one of the flags I think. Most things in ROCm use AQL. @David Yat Sin Can you point to how HDP flushes are handled in the ROCm runtime? Alex > > I am consistently able to reproduce this issue on my system when using > more than three GPUs if patches 7/8 and 8/8 are not applied. In your > opinion, is there anything that can be done to speed up the HDP flush or > to avoid this situation altogether? > > > > > > > Alex > > > >> > >>> Alex > >>> > >>>> That's the reason why ROCm needs the remapped MMIO register BAR. > >>>> > >>>>> but there are certain cases where an application may want > >>>>> more control. This is probably not a showstopper for most ROCm apps. > >>>> Well the problem is that you absolutely need the HDP flush/invalidation > >>>> for 100% correctness. It does work most of the time without it, but you > >>>> then risk data corruption. > >>>> > >>>> Apart from making the flush/invalidate an IOCTL I think we could also > >>>> just use a global dummy page in VRAM. > >>>> > >>>> If you make two 32bit writes which are apart from each other and then a > >>>> read back a 32bit value from VRAM that should invalidate the HDP as > >>>> well. It's less efficient than the MMIO BAR remap but still much better > >>>> than going though an IOCTL. > >>>> > >>>> The only tricky part is that you need to get the HW barriers with the > >>>> doorbell write right..... > >>>> > >>>>> That said, the region is only 4K so if you allow applications to map a > >>>>> larger region they would get access to GPU register pages which they > >>>>> shouldn't have access to. > >>>> But don't we also have problems with the doorbell? E.g. the global > >>>> aggregated one needs to be 4k as well, or is it ok to over allocate > >>>> there? > >>>> > >>>> Thinking more about it there is also a major problem with page tables. > >>>> Those are 4k by default on modern systems as well and while over > >>>> allocating them to 64k is possible that not only wastes some VRAM but > >>>> can also result in OOM situations because we can't allocate the > >>>> necessary page tables to switch from 2MiB to 4k pages in some cases. > >>>> > >>>> Christian. > >>>> > >>>>> Alex > >>>>> > >>>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent > >>>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Please note that the changes in this series are on a best effort > >>>>>>>>> basis from our > >>>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper > >>>>>>>>> knowledge of the > >>>>>>>>> HW & SW stack) to kindly help with the review and provide feedback > >>>>>>>>> / comments on > >>>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. > >>>>>>>>> 64K) well > >>>>>>>>> supported with amd gpu kernel driver. > >>>>>>>> Well this is generally nice to have, but there are unfortunately > >>>>>>>> some HW limitations which makes ROCm pretty much unusable on non 4k > >>>>>>>> page size systems. > >>>>>>> That's a bummer :( > >>>>>>> - Do we have some HW documentation around what are these limitations > >>>>>>> around non-4K pagesize? Any links to such please? > >>>>>> You already mentioned MMIO remap which obviously has that problem, but > >>>>>> if I'm not completely mistaken the PCIe doorbell BAR and some global > >>>>>> seq counter resources will also cause problems here. > >>>>>> > >>>>>> This can all be worked around by delegating those MMIO accesses into > >>>>>> the kernel, but that means tons of extra IOCTL overhead. > >>>>>> > >>>>>> Especially the cache flushes which are necessary to avoid corruption > >>>>>> are really bad for performance in such an approach. > >>>>>> > >>>>>>> - Are there any latest AMD GPU versions which maybe lifts such > >>>>>>> restrictions? > >>>>>> Not that I know off any. > >>>>>> > >>>>>>>> What we can do is to support graphics and MM, but that should > >>>>>>>> already work out of the box. > >>>>>>>> > >>>>>>> - Maybe we should also document, what will work and what won't work > >>>>>>> due to these HW limitations. > >>>>>> Well pretty much everything, I need to double check how ROCm does HDP > >>>>>> flushing/invalidating when the MMIO remap isn't available. > >>>>>> > >>>>>> Could be that there is already a fallback path and that's the reason > >>>>>> why this approach actually works at all. > >>>>>> > >>>>>>>> What we can do is to support graphics and MM, but that should > >>>>>>>> already work out of the box.> > >>>>>>> So these patches helped us resolve most of the issues like SDMA hangs > >>>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with > >>>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps > >>>>>>> due to 64K pagesize. > >>>>>> Yeah, but this is all for ROCm and not the graphics side. > >>>>>> > >>>>>> To be honest I'm not sure how ROCm even works when you have 64k pages > >>>>>> at the moment. I would expect much more issue lurking in the kernel > >>>>>> driver. > >>>>>> > >>>>>>> AFAIU, some of these patches may require re-work based on reviews, but > >>>>>>> at least with these changes, we were able to see all the tests > >>>>>>> passing. > >>>>>>> > >>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds > >>>>>>>> can be implemented for those issues. > >>>>>>>> > >>>>>>> Thanks a lot! That would be super helpful! > >>>>>>> > >>>>>>> > >>>>>>>> Regards, > >>>>>>>> Christian. > >>>>>>>> > >>>>>>> Thanks again for the quick response on the patch series. > >>>>>> You are welcome, but since it's so near to the end of the year not all > >>>>>> people are available any more. > >>>>>> > >>>>>> Regards, > >>>>>> Christian. > >>>>>> > >>>>>>> -ritesh
