On Tue, Jan 24, 2023 at 12:45:38PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (pet...@redhat.com) wrote:
> > Add a new cap to allow mapping hugetlbfs backed RAMs in small page sizes.
> > 
> > Signed-off-by: Peter Xu <pet...@redhat.com>
> 
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com>

Thanks.

> 
> although, I'm curious if the protocol actually changes

Yes it does.

It differs not in the form of a changed header or any frame definitions,
but in the format of how huge pages are sent.  The old binary can only send
a huge page by sending all the small pages sequentially starting from index
0 to index N_HUGE-1; while the new binary can send the huge page out of
order.  For the latter it's the same as when huge page is not used.

> or whether a doublepage enabled destination would work with an unmodified
> source?

This is an interesting question.

I would expect old -> new work as usual, because the page frames are not
modified so the dest node will just see pages being migrated in a
sequential manner.  The latency of page request will be the same as old
binary though because even if dest host can handle small pages it won't be
able to get asap on the pages it wants - src host decides which page to
send.

Meanwhile new -> old shouldn't work I think as described above, because the
dest host should see weird things happening, e.g., a huge page was sent not
starting fron index 0 but index X (0<X<N_HUGE-1).  It should quickly bail
out assuming there's something wrong.

> I guess potentially you can get away without the dirty clearing
> of the partially sent hugepages that the source normally does.

Good point. It's actually more relevant to the other patch later on
reworking the discard logic.  I kept it as-is for majorly two reasons:

 1) It is still not 100% confirmed on how MADV_DONTNEED should behave on
    HGM enabled memory ranges where huge pages used to be mapped.  It's
    part of the discussion upstream on the kernel patchset.  I think it's
    settling, but in the current series I kept it in a form so it'll work
    in all cases.

 2) Not dirtying the partially sent huge pages can always reduce small
    pages being migrated, but it can also change the content of discard
    messages due to the frame format of MIG_CMD_POSTCOPY_RAM_DISCARD, in
    that we can have a lot more scattered ranges, so a lot more messaging
    can be needed.  While when with the existing logic, since we'll always
    re-dirty the partial sent pages, the ranges are more likely to be
    efficient.
    
        * CMD_POSTCOPY_RAM_DISCARD consist of:
        *      byte   version (0)
        *      byte   Length of name field (not including 0)
        *  n x byte   RAM block name
        *      byte   0 terminator (just for safety)
        *  n x        Byte ranges within the named RAMBlock
        *      be64   Start of the range
        *      be64   Length

I think 1) may not hold as the kernel series evolves, so it may not be true
anymore.  2) may still be true, but I think worth some testing (especially
on 1G pages) to see how it could interfere the discard procedure.  Maybe it
won't be as bad as I think.  Even if it could, we can evaluate the tradeoff
between "slower discard sync" and "less page need to send".  E.g., we can
consider changing the frame layout by boosting postcopy_ram_discard_version.

I'll take a note on this one and provide more update in the next version.

-- 
Peter Xu


Reply via email to