On Tue, Jan 24, 2023 at 12:45:38PM +0000, Dr. David Alan Gilbert wrote: > * Peter Xu (pet...@redhat.com) wrote: > > Add a new cap to allow mapping hugetlbfs backed RAMs in small page sizes. > > > > Signed-off-by: Peter Xu <pet...@redhat.com> > > > Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
Thanks. > > although, I'm curious if the protocol actually changes Yes it does. It differs not in the form of a changed header or any frame definitions, but in the format of how huge pages are sent. The old binary can only send a huge page by sending all the small pages sequentially starting from index 0 to index N_HUGE-1; while the new binary can send the huge page out of order. For the latter it's the same as when huge page is not used. > or whether a doublepage enabled destination would work with an unmodified > source? This is an interesting question. I would expect old -> new work as usual, because the page frames are not modified so the dest node will just see pages being migrated in a sequential manner. The latency of page request will be the same as old binary though because even if dest host can handle small pages it won't be able to get asap on the pages it wants - src host decides which page to send. Meanwhile new -> old shouldn't work I think as described above, because the dest host should see weird things happening, e.g., a huge page was sent not starting fron index 0 but index X (0<X<N_HUGE-1). It should quickly bail out assuming there's something wrong. > I guess potentially you can get away without the dirty clearing > of the partially sent hugepages that the source normally does. Good point. It's actually more relevant to the other patch later on reworking the discard logic. I kept it as-is for majorly two reasons: 1) It is still not 100% confirmed on how MADV_DONTNEED should behave on HGM enabled memory ranges where huge pages used to be mapped. It's part of the discussion upstream on the kernel patchset. I think it's settling, but in the current series I kept it in a form so it'll work in all cases. 2) Not dirtying the partially sent huge pages can always reduce small pages being migrated, but it can also change the content of discard messages due to the frame format of MIG_CMD_POSTCOPY_RAM_DISCARD, in that we can have a lot more scattered ranges, so a lot more messaging can be needed. While when with the existing logic, since we'll always re-dirty the partial sent pages, the ranges are more likely to be efficient. * CMD_POSTCOPY_RAM_DISCARD consist of: * byte version (0) * byte Length of name field (not including 0) * n x byte RAM block name * byte 0 terminator (just for safety) * n x Byte ranges within the named RAMBlock * be64 Start of the range * be64 Length I think 1) may not hold as the kernel series evolves, so it may not be true anymore. 2) may still be true, but I think worth some testing (especially on 1G pages) to see how it could interfere the discard procedure. Maybe it won't be as bad as I think. Even if it could, we can evaluate the tradeoff between "slower discard sync" and "less page need to send". E.g., we can consider changing the frame layout by boosting postcopy_ram_discard_version. I'll take a note on this one and provide more update in the next version. -- Peter Xu