Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning

Leonardo Brás Mon, 20 Jun 2022 20:36:35 -0700

On Mon, 2022-06-20 at 11:44 -0400, Peter Xu wrote:
> On Mon, Jun 20, 2022 at 11:23:53AM +0200, Juan Quintela wrote:
> > Once discussed this, what I asked in the past is that you are having too
> > much dirty memory on zero_copy.  When you have a Multiterabyte guest, in
> > a single round you have a "potentially" dirty memory on each channel of:
> > 
> >    total_amount_memory / number of channels.
> > 
> > In a Multiterabyte guest, this is going to be more that probably in the
> > dozens of gigabytes.  As far as I know there is no card/driver that will
> > benefit for so many pages in zero_copy, and kernel will move to
> > synchronous copy at some point.  (In older threads, daniel showed how to
> > test for this case).
> 
> I was wondering whether the kernel needs to cache a lot of messages for
> zero copy if we don't flush it for a long time, as recvmsg(MSG_ERRQUEUE)
> seems to be fetching one message from the kernel one at a time.  And,
> whether that queue has a limit in length or something.


IIRC, if all messages look the same, it 'merges' them in a single message, like,
'this range has these flags and output'.

So, if no issue happens, we should have a single message with the confirmation
of all sent buffers, meaning just a little memory is used for that.

> 
> Does it mean that when the kernel could have cached enough of these
> messages then it'll fallback to the no-zero-copy mode?  And probably that's
> the way how kernel protects itself from using too much buffer for the error
> msgs?

Since it merges the messages, I don't think it uses a lot of space for that.

IIRC, the kernel will fall back to copying only if the network adapter / driver
does not support MSG_ZEROCOPY, like when it does not support scatter-gather.

> 
> This reminded me - Leo, have you considered adding the patch altogether to
> detect the "fallback to non-zero-copy" condition?  Because when with it and
> when the fallback happens at some point (e.g. when the guest memory is
> larger than some value) we'll know.

I still did not consider that, but sure, how do you see that working?

We can't just disable zero-copy-send because the user actually opted in, so we
could instead add a one time error message for when it falls back to copying, as
it should happen in the first try of zero-copy send.

Or we could fail the migration, stating the interface does not support
MSG_ZEROCOPY, since it should happen in the first sendmsg().

I would personally opt for the last option.

What do you think?

> 
> Thanks,
> 

Thanks Peter!

Best regards,
Leo

Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning

Reply via email to