On 12/17/25 10:46, Donet Tom wrote: >>>> But don't we also have problems with the doorbell? E.g. the global >>>> aggregated one needs to be 4k as well, or is it ok to over allocate there? >>>> >>>> Thinking more about it there is also a major problem with page tables. >>>> Those are 4k by default on modern systems as well and while over >>>> allocating them to 64k is possible that not only wastes some VRAM but can >>>> also result in OOM situations because we can't allocate the necessary page >>>> tables to switch from 2MiB to 4k pages in some cases. >>> >>> Sorry, Cristian — I may be misunderstanding this point, so I would >>> appreciate some clarification. >>> >>> If the CPU page size is 64K and the GPU page size is 4K, then from the GPU >>> side the page table entries are created and mapped at 4K granularity, while >>> on the CPU side the pages remain 64K. To map a single CPU page to the GPU, >>> we therefore need to create multiple GPU page table entries for that CPU >>> page. >> The GPU page tables are 4k in size no matter what the CPU page size is and >> there is some special handling so that we can allocate them even under >> memory pressure. Background is that you sometimes need to split up higher >> order pages (1G, 2M) into lower order pages (2M, 4k) to be able to swap >> things to system memory for example and for that you need some an extra >> layer of page tables. >> >> The problem is now that those 4k pages are rounded up to your CPU page size, >> resulting in both wasting quite some memory as well as messing up the >> special handling to not run into OOM situations when swapping things to >> system memory.... > > > Thank you, Christian, for the clarification. > > When you say swapping to system memory, does that mean SVM migration to DRAM?
Yes and no. It's mostly the normal BO based swapping of TTM. SVM is still an experimental and extremely rarely used feature. > > From my understanding of the code, SVM pages are tracked in system page–size > PFNs, which on our system is 64 KB. With a 64 KB base page size, buffer > objects (BOs) are allocated in 64 KB–aligned chunks, both in VRAM and GTT, > while the GPU page-table mappings are still created using 4 KB pages. > > During SVM migration from VRAM to system memory, I observed that an entire 64 > KB page is migrated. Similarly, when XNACK is enabled, if the GPU accesses a > 4 KB page, my understanding is that the entire 64 KB page is migrated. > > If my understanding is correct, allocating 4 KB memory on a 64 KB page–size > system results in a 64 KB BO allocation, meaning that around 60 KB is > effectively wasted. Are you referring to this kind of over-allocation > potentially leading to OOM situations under memory pressure? Correct, yes. > Since I am still getting familiar with the AMDGPU codebase, could you please > point me to the locations where special handling is implemented to avoid OOM > conditions during swapping or migration? See AMDGPU_VM_RESERVED_VRAM. Regards, Christian. > > >> >> What we could potentially do is to switch to 64k pages on the GPU as well >> (the HW is flexible enough to be re-configurable), but that is tons of >> changes and probably not easily testable. >> >> Regards, >> Christian. >> >>> We found that this was not being handled correctly in the SVM path and >>> addressed it with the change in patch 2/8. >>> >>> Given this, if the memory is allocated and mapped in GPU page-size (4K) >>> granularity on the GPU side, could you please clarify how memory waste >>> occurs in this scenario? >>> >>> Thank you for your time and guidance. >>> >>> >>>> Christian. >>>> >>>>> Alex >>>>> >>>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent >>>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test >>>>>>>>> >>>>>>>>> >>>>>>>>> Please note that the changes in this series are on a best effort >>>>>>>>> basis from our >>>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper >>>>>>>>> knowledge of the >>>>>>>>> HW & SW stack) to kindly help with the review and provide feedback / >>>>>>>>> comments on >>>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. >>>>>>>>> 64K) well >>>>>>>>> supported with amd gpu kernel driver. >>>>>>>> Well this is generally nice to have, but there are unfortunately some >>>>>>>> HW limitations which makes ROCm pretty much unusable on non 4k page >>>>>>>> size systems. >>>>>>> That's a bummer :( >>>>>>> - Do we have some HW documentation around what are these limitations >>>>>>> around non-4K pagesize? Any links to such please? >>>>>> You already mentioned MMIO remap which obviously has that problem, but >>>>>> if I'm not completely mistaken the PCIe doorbell BAR and some global seq >>>>>> counter resources will also cause problems here. >>>>>> >>>>>> This can all be worked around by delegating those MMIO accesses into the >>>>>> kernel, but that means tons of extra IOCTL overhead. >>>>>> >>>>>> Especially the cache flushes which are necessary to avoid corruption are >>>>>> really bad for performance in such an approach. >>>>>> >>>>>>> - Are there any latest AMD GPU versions which maybe lifts such >>>>>>> restrictions? >>>>>> Not that I know off any. >>>>>> >>>>>>>> What we can do is to support graphics and MM, but that should already >>>>>>>> work out of the box. >>>>>>>> >>>>>>> - Maybe we should also document, what will work and what won't work due >>>>>>> to these HW limitations. >>>>>> Well pretty much everything, I need to double check how ROCm does HDP >>>>>> flushing/invalidating when the MMIO remap isn't available. >>>>>> >>>>>> Could be that there is already a fallback path and that's the reason why >>>>>> this approach actually works at all. >>>>>> >>>>>>>> What we can do is to support graphics and MM, but that should already >>>>>>>> work out of the box.> >>>>>>> So these patches helped us resolve most of the issues like SDMA hangs >>>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with >>>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps >>>>>>> due to 64K pagesize. >>>>>> Yeah, but this is all for ROCm and not the graphics side. >>>>>> >>>>>> To be honest I'm not sure how ROCm even works when you have 64k pages at >>>>>> the moment. I would expect much more issue lurking in the kernel driver. >>>>>> >>>>>>> AFAIU, some of these patches may require re-work based on reviews, but >>>>>>> at least with these changes, we were able to see all the tests passing. >>>>>>> >>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds can >>>>>>>> be implemented for those issues. >>>>>>>> >>>>>>> Thanks a lot! That would be super helpful! >>>>>>> >>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>> Thanks again for the quick response on the patch series. >>>>>> You are welcome, but since it's so near to the end of the year not all >>>>>> people are available any more. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> -ritesh
