On Fri, Apr 04, 2025 at 04:53:13PM +0300, Mike Rapoport wrote: > > Maybe change the reserved regions code to put the region list in a > > folio and preserve the folio instead of using FDT as a "demo" for the > > functionality. > > Folios are not available when we restore reserved regions, this just won't > work.
You don't need the folio at that point, you just need the data in the page. The folio would be freed after starting up the buddy allocator. > > We know what the future use case is for the folio preservation, all > > the drivers and the iommu are going to rely on this. > > We don't know how much of the preservation will be based on folios. I think almost all of it. Where else does memory come from for drivers? > Most drivers do not use folios Yes they do, either through kmalloc or through alloc_page/etc. "folio" here is just some generic word meaning memory from the buddy allocator. The big question on my mind is if we need a way to preserve slab objects as well.. > and for preserving memfd* and hugetlb we'd need to have some dance > around that memory anyway. memfd is all folios - what do you mean? hugetlb is moving toward folios.. eg guestmemfd is supposed to be taking the hugetlb special stuff and turning it into folios. > So I think kho_preserve_folio() would be a part of the fdbox or > whatever that functionality will be called. It is part of KHO. Preserving the folios has to be sequenced with starting the buddy allocator, and that is KHO's entire responsibility. I could see something like preserving slab being in a different layer, built on preserving folios. > Are they? > The purpose of basic KHO is to make sure the memory we want to preserve is > not trampled over. Preserving folios with their orders means we need to > make sure memory range of the folio is preserved and we carry additional > information to actually recreate the folio object, in case it is needed and > in case it is possible. Hughetlb, for instance has its own way initializing > folios and just keeping the order won't be enough for that. I expect many things will need a side-car datastructure to record that additional meta-data. hugetlb can start with folios, then switch them over to its non-folio stuff based on its metadata. The point is the basic low level KHO mechanism is simple folios - memory from the buddy allocator with an neutral struct folio that the caller can then customize to its own memory descriptor type on restore. Eventually restore would allocate a caller specific memdesc and it wouldn't be "folios" at all. We just don't have the right words yet to describe this. > As for the optimizations of memblock reserve path, currently it what hurts > the most in my and Pratyush experiments. They are not very representative, > but still, preserving lots of pages/folios spread all over would have it's > toll on the mm initialization. > And I don't think invasive changes to how > buddy and memory map initialization are the best way to move forward and > optimize that. I'm pretty sure this is going to be the best performance path, but I have no idea how invasive it would be to the buddy alloactor to make it work. > Quite possibly we'd want to be able to minimize amount of *ranges* > that we preserve. I'm not sure, that seems backwards to me, we really don't want to have KHO mem zones! So I think optimizing for, and thinking about ranges doesn't make sense. The big ranges will arise naturally beacuse things like hugetlb reservations should all be contiguous and the resulting folios should all be allocated for the VM and also all be contigous. So vast, vast amounts of memory will be high order and contiguous. > Preserving folio orders with it is really straighforward and until we see > some real data of how the entire KHO machinery is used, I'd prefer simple > over anything else. mapletree may not even work as it has a very high bound on memory usage if the preservation workload is small and fragmented. This is why I didn't want to use list of ranges in the first place. It also doesn't work so well if you need to preserve the order too :\ Until we know the workload(s) and cost how much memory the maple tree version will use I don't think it is a good general starting point. Jason