On Thu, Feb 27, 2025 at 10:30:31PM +0530, Manish wrote: > Again really sorry, missed this due to some issue with my mail filters and > came to know about it via qemu-devel weblink. :) > > On 25/02/25 2:37 pm, Daniel P. Berrangé wrote: > > !-------------------------------------------------------------------| > > CAUTION: External Email > > > > |-------------------------------------------------------------------! > > > > On Fri, Feb 21, 2025 at 04:44:48AM -0500, Manish Mishra wrote: > > > We allocate extra metadata SKBs in case of zerocopy send. This metadata > > > memory > > > is accounted for in the OPTMEM limit. If there is any error with sending > > > zerocopy data or if zerocopy was skipped, these metadata SKBs are queued > > > in the > > > socket error queue. This error queue is freed when userspace reads it. > > > > > > Usually, if there are continuous failures, we merge the metadata into a > > > single > > > SKB and free another one. However, if there is any out-of-order > > > processing or > > > an intermittent zerocopy failures, this error chain can grow > > > significantly, > > > exhausting the OPTMEM limit. As a result, all new sendmsg requests fail to > > > allocate any new SKB, leading to an ENOBUF error. > > IIUC, you are effectively saying that the migration code is calling > > qio_channel_write() too many times, before it calls qio_channel_flush(.) > > > > Can you clarify what yu mean by the "OPTMEM limit" here ? I'm wondering > > if this is potentially triggered by suboptimal tuning of the deployment > > environment or we need to document tuning better. > > > I replied it on other thread. Posting it again. > > We allocate some memory for zerocopy metadata, this is not accounted in > tcp_send_queue but it is accounted in optmem_limit. > > https://github.com/torvalds/linux/blob/dd83757f6e686a2188997cb58b5975f744bb7786/net/core/skbuff.c#L1607 > > Also when the zerocopy data is sent and acked, we try to free this > allocated skb as we can see in below code. > > https://github.com/torvalds/linux/blob/dd83757f6e686a2188997cb58b5975f744bb7786/net/core/skbuff.c#L1751 > > In case, we get __msg_zerocopy_callback() on continous ranges and > skb_zerocopy_notify_extend() passes, we merge the ranges and free up the > current skb. But if that is not the case, we insert that skb in error > queue and it won't be freed until we do error flush from userspace. This > is possible when either zerocopy packets are skipped in between or it is > always skipped but we get out of order acks on packets. As a result this > error chain keeps growing, exhausthing the optmem_limit. As a result > when new zerocopy sendmsg request comes, it won't be able to allocate > the metadata and returns with ENOBUF. > > I understand there is another bug of why zerocopy pakcets are getting > skipped and which could be our deployment specific. But anyway live > migrations should not fail, it is fine to mark zerocopy skipped but not > fail? > > > > > To workaround this, if we encounter an ENOBUF error with a zerocopy > > > sendmsg, > > > we flush the error queue and retry once more. > > > > > > Signed-off-by: Manish Mishra<manish.mis...@nutanix.com> > > > --- > > > include/io/channel-socket.h | 1 + > > > io/channel-socket.c | 52 ++++++++++++++++++++++++++++++++----- > > > 2 files changed, 46 insertions(+), 7 deletions(-) > > > > > > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h > > > index ab15577d38..6cfc66eb5b 100644 > > > --- a/include/io/channel-socket.h > > > +++ b/include/io/channel-socket.h > > > @@ -49,6 +49,7 @@ struct QIOChannelSocket { > > > socklen_t remoteAddrLen; > > > ssize_t zero_copy_queued; > > > ssize_t zero_copy_sent; > > > + bool new_zero_copy_sent_success; > > > }; > > > diff --git a/io/channel-socket.c b/io/channel-socket.c > > > index 608bcf066e..c7f576290f 100644 > > > --- a/io/channel-socket.c > > > +++ b/io/channel-socket.c > > > @@ -37,6 +37,11 @@ > > > #define SOCKET_MAX_FDS 16 > > > +#ifdef QEMU_MSG_ZEROCOPY > > > +static int qio_channel_socket_flush_internal(QIOChannel *ioc, > > > + Error **errp); > > > +#endif > > > + > > > SocketAddress * > > > qio_channel_socket_get_local_address(QIOChannelSocket *ioc, > > > Error **errp) > > > @@ -65,6 +70,7 @@ qio_channel_socket_new(void) > > > sioc->fd = -1; > > > sioc->zero_copy_queued = 0; > > > sioc->zero_copy_sent = 0; > > > + sioc->new_zero_copy_sent_success = FALSE; > > > ioc = QIO_CHANNEL(sioc); > > > qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); > > > @@ -566,6 +572,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel > > > *ioc, > > > size_t fdsize = sizeof(int) * nfds; > > > struct cmsghdr *cmsg; > > > int sflags = 0; > > > + bool zero_copy_flush_pending = TRUE; > > > memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)); > > > @@ -612,9 +619,21 @@ static ssize_t qio_channel_socket_writev(QIOChannel > > > *ioc, > > > goto retry; > > > case ENOBUFS: > > > if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { > > > - error_setg_errno(errp, errno, > > > - "Process can't lock enough memory for > > > using MSG_ZEROCOPY"); > > > - return -1; > > > + if (zero_copy_flush_pending) { > > > + ret = qio_channel_socket_flush_internal(ioc, errp); > > Calling this is problematic, because qio_channel_socket_flush is > > designed to block execution until all buffers are flushed. When > > ioc is in non-blocking mode, this breaks the required semantics > > of qio_channel_socket_writev which must return EAGAIN and not > > block. > > > > IOW, if we need to be able to flush at this point, we must be > > able to do a partial flush, rather than blocking on a full > > flush > > sure > > > > + if (ret < 0) { > > > + error_setg_errno(errp, errno, > > > + "Zerocopy flush failed"); > > > + return -1; > > > + } > > > + zero_copy_flush_pending = FALSE; > > > + goto retry; > > > + } else { > > > + error_setg_errno(errp, errno, > > > + "Process can't lock enough memory > > > for " > > > + "using MSG_ZEROCOPY"); > > > + return -1; > > > + } > > > } > > > break; > > > } > > > @@ -725,8 +744,8 @@ static ssize_t qio_channel_socket_writev(QIOChannel > > > *ioc, > > > #ifdef QEMU_MSG_ZEROCOPY > > > -static int qio_channel_socket_flush(QIOChannel *ioc, > > > - Error **errp) > > > +static int qio_channel_socket_flush_internal(QIOChannel *ioc, > > > + Error **errp) > > > { > > > QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); > > > struct msghdr msg = {}; > > > @@ -791,15 +810,34 @@ static int qio_channel_socket_flush(QIOChannel *ioc, > > > /* No errors, count successfully finished sendmsg()*/ > > > sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1; > > > - /* If any sendmsg() succeeded using zero copy, return 0 at the > > > end */ > > > + /* If any sendmsg() succeeded using zero copy, mark zerocopy > > > success */ > > > if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) { > > > - ret = 0; > > > + sioc->new_zero_copy_sent_success = TRUE; > > > } > > > } > > > return ret; > > > } > > > +static int qio_channel_socket_flush(QIOChannel *ioc, > > > + Error **errp) > > > +{ > > > + QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); > > > + int ret; > > > + > > > + ret = qio_channel_socket_flush_internal(ioc, errp); > > > + if (ret < 0) { > > > + return ret; > > > + } > > > + > > > + if (sioc->new_zero_copy_sent_success) { > > > + sioc->new_zero_copy_sent_success = FALSE; > > > + ret = 0; > > > + } > > > + > > > + return ret; > > > +} > > I don't see the point in these changes adding new_zero_copy_sent_success. > > > > IIUC, you seem to be trying to make it so that qio_channel_socket_flush > > will return 0, if qio_channel_socket_writev had called > > qio_channel_socket_flush_internal behind the scenes. > > That should already be working though, as the first thing we do in flush > > is to check if anything was pending and, if not, then return zero: > > > > if (sioc->zero_copy_queued == sioc->zero_copy_sent) { > > return 0; > > } > > ....do the real flush logic.... > > > > > > With regards, > > Daniel > > > yes but current logic is, if there was any successful zerocopy send in the > iteration, return 0. I did not want to change the behavior. Now > qio_channel_socket_flush_internal() can be called at any point of time and > as many times depending on when the optmem_limit is full. So we may or may > not have any data to process when actual qio_channel_socket_flush() comes.
IIUC the whole point of that flag was to guarantee the current stat counter (dirty_sync_missed_zero_copy) keeps working. That counter looks like pretty much for debugging purpose. If we want, I think we can drop that but replacing it with a tracepoint: diff --git a/io/channel-socket.c b/io/channel-socket.c index 608bcf066e..bccb464c9b 100644 --- a/io/channel-socket.c +++ b/io/channel-socket.c @@ -791,10 +791,9 @@ static int qio_channel_socket_flush(QIOChannel *ioc, /* No errors, count successfully finished sendmsg()*/ sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1; - /* If any sendmsg() succeeded using zero copy, return 0 at the end */ - if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) { - ret = 0; - } + /* Enable this tracepoint if one wants to track zerocopy fallbacks */ + trace_qio_channel_socket_zerocopy_fallback( + serr->ee_code == SO_EE_CODE_ZEROCOPY_COPIED); } Then we deprecate this entry in QAPI: # @dirty-sync-missed-zero-copy: Number of times dirty RAM # synchronization could not avoid copying dirty pages. This is # between 0 and @dirty-sync-count * @multifd-channels. (since # 7.1) I doubt any mgmt consumes it at all. Probably we shouldn't ever expose that to QAPI.. -- Peter Xu