On 12/15/25 9:41 PM, Christian König wrote:
On 12/15/25 11:11, Donet Tom wrote:
On 12/15/25 3:17 PM, Christian König wrote:
On 12/12/25 18:24, Alex Deucher wrote:
On Fri, Dec 12, 2025 at 8:19 AM Christian König
<[email protected]> wrote:
On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
Christian König <[email protected]> writes:
Setup details:
============
System details: Power10 LPAR using 64K pagesize.
AMD GPU:
    Name:                    gfx90a
    Marketing Name:          AMD Instinct MI210

Queries:
=======
1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
     these changes. Is there anything else that you would suggest us to run to
     shake out any other page size related issues w.r.t the kernel driver?
The ROCm team needs to answer that.

Is there any separate mailing list or list of people whom we can cc
then?
With Felix on CC you already got the right person, but he's on vacation and 
will not be back before the end of the year.

I can check on Monday if some people are still around which could answer a 
couple of questions, but in general don't expect a quick response.

2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
     size HW dependent? Should it be made PAGE_SIZE?
Yes and no.

If you could more elaborate on this please? I am assuming you would
anyway respond with more context / details on Patch-1 itself. If yes,
that would be great!
Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all 
the events and actions the CP should execute when shaders and cache flushes 
finish.

The size depends on the HW generation and configuration of the GPU etc..., but 
don't ask me for details how that is calculated.

The point is that the size is completely unrelated to the CPU, so using 
PAGE_SIZE is clearly incorrect.

3. Patch 5/8: also have a query w.r.t the error paths when system page size > 
4K.
     Do we need to lift this restriction and add MMIO remap support for systems 
with
     non-4K page sizes?
The problem is the HW can't do this.

We aren't that familiar with the HW / SW stack here. Wanted to understand
what functionality will be unsupported due to this HW limitation then?
The problem is that the CPU must map some of the registers/resources of the GPU 
into the address space of the application and you run into security issues when 
you map more than 4k at a time.
Right.  There are some 4K pages with the MMIO register BAR which are
empty and registers can be remapped into them.  In this case we remap
the HDP flush registers into one of those register pages.  This allows
applications to flush the HDP write FIFO from either the CPU or
another device.  This is needed to flush data written by the CPU or
another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
it).  This is flushed internally as part of the shader dispatch
packets,
As far as I know this is only done for graphics shader submissions to the 
classic CS interface, but not for compute dispatches through ROCm queues.

That's the reason why ROCm needs the remapped MMIO register BAR.

but there are certain cases where an application may want
more control.  This is probably not a showstopper for most ROCm apps.
Well the problem is that you absolutely need the HDP flush/invalidation for 
100% correctness. It does work most of the time without it, but you then risk 
data corruption.

Apart from making the flush/invalidate an IOCTL I think we could also just use 
a global dummy page in VRAM.

If you make two 32bit writes which are apart from each other and then a read 
back a 32bit value from VRAM that should invalidate the HDP as well. It's less 
efficient than the MMIO BAR remap but still much better than going though an 
IOCTL.

The only tricky part is that you need to get the HW barriers with the doorbell 
write right.....

That said, the region is only 4K so if you allow applications to map a
larger region they would get access to GPU register pages which they
shouldn't have access to.
But don't we also have problems with the doorbell? E.g. the global aggregated 
one needs to be 4k as well, or is it ok to over allocate there?

Thinking more about it there is also a major problem with page tables. Those 
are 4k by default on modern systems as well and while over allocating them to 
64k is possible that not only wastes some VRAM but can also result in OOM 
situations because we can't allocate the necessary page tables to switch from 
2MiB to 4k pages in some cases.

Sorry, Cristian — I may be misunderstanding this point, so I would appreciate 
some clarification.

If the CPU page size is 64K and the GPU page size is 4K, then from the GPU side 
the page table entries are created and mapped at 4K granularity, while on the 
CPU side the pages remain 64K. To map a single CPU page to the GPU, we 
therefore need to create multiple GPU page table entries for that CPU page.
The GPU page tables are 4k in size no matter what the CPU page size is and 
there is some special handling so that we can allocate them even under memory 
pressure. Background is that you sometimes need to split up higher order pages 
(1G, 2M) into lower order pages (2M, 4k) to be able to swap things to system 
memory for example and for that you need some an extra layer of page tables.

The problem is now that those 4k pages are rounded up to your CPU page size, 
resulting in both wasting quite some memory as well as messing up the special 
handling to not run into OOM situations when swapping things to system 
memory....


Thank you, Christian, for the clarification.

When you say swapping to system memory, does that mean SVM migration to DRAM?

From my understanding of the code, SVM pages are tracked in system page–size PFNs, which on our system is 64 KB. With a 64 KB base page size, buffer objects (BOs) are allocated in 64 KB–aligned chunks, both in VRAM and GTT, while the GPU page-table mappings are still created using 4 KB pages.

During SVM migration from VRAM to system memory, I observed that an entire 64 KB page is migrated. Similarly, when XNACK is enabled, if the GPU accesses a 4 KB page, my understanding is that the entire 64 KB page is migrated.

If my understanding is correct, allocating 4 KB memory on a 64 KB page–size system results in a 64 KB BO allocation, meaning that around 60 KB is effectively wasted. Are you referring to this kind of over-allocation potentially leading to OOM situations under memory pressure?

Since I am still getting familiar with the AMDGPU codebase, could you please point me to the locations where special handling is implemented to avoid OOM conditions during swapping or migration?



What we could potentially do is to switch to 64k pages on the GPU as well (the 
HW is flexible enough to be re-configurable), but that is tons of changes and 
probably not easily testable.

Regards,
Christian.

We found that this was not being handled correctly in the SVM path and 
addressed it with the change in patch 2/8.

Given this, if the memory is allocated and mapped in GPU page-size (4K) 
granularity on the GPU side, could you please clarify how memory waste occurs 
in this scenario?

Thank you for your time and guidance.


Christian.

Alex

[1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
[2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test


Please note that the changes in this series are on a best effort basis from our
end. Therefore, requesting the amd-gfx community (who have deeper knowledge of 
the
HW & SW stack) to kindly help with the review and provide feedback / comments on
these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
supported with amd gpu kernel driver.
Well this is generally nice to have, but there are unfortunately some HW 
limitations which makes ROCm pretty much unusable on non 4k page size systems.
That's a bummer :(
- Do we have some HW documentation around what are these limitations around 
non-4K pagesize? Any links to such please?
You already mentioned MMIO remap which obviously has that problem, but if I'm 
not completely mistaken the PCIe doorbell BAR and some global seq counter 
resources will also cause problems here.

This can all be worked around by delegating those MMIO accesses into the 
kernel, but that means tons of extra IOCTL overhead.

Especially the cache flushes which are necessary to avoid corruption are really 
bad for performance in such an approach.

- Are there any latest AMD GPU versions which maybe lifts such restrictions?
Not that I know off any.

What we can do is to support graphics and MM, but that should already work out 
of the box.

- Maybe we should also document, what will work and what won't work due to 
these HW limitations.
Well pretty much everything, I need to double check how ROCm does HDP 
flushing/invalidating when the MMIO remap isn't available.

Could be that there is already a fallback path and that's the reason why this 
approach actually works at all.

What we can do is to support graphics and MM, but that should already work out of 
the box.>
So these patches helped us resolve most of the issues like SDMA hangs
and GPU kernel page faults which we saw with rocr and rccl tests with
64K pagesize. Meaning, we didn't see this working out of box perhaps
due to 64K pagesize.
Yeah, but this is all for ROCm and not the graphics side.

To be honest I'm not sure how ROCm even works when you have 64k pages at the 
moment. I would expect much more issue lurking in the kernel driver.

AFAIU, some of these patches may require re-work based on reviews, but
at least with these changes, we were able to see all the tests passing.

I need to talk with Alex and the ROCm team about it if workarounds can be 
implemented for those issues.

Thanks a lot! That would be super helpful!


Regards,
Christian.

Thanks again for the quick response on the patch series.
You are welcome, but since it's so near to the end of the year not all people 
are available any more.

Regards,
Christian.

-ritesh

Reply via email to