> From: Ananyev, Konstantin [mailto:konstantin.anan...@intel.com] > Sent: Friday, November 6, 2020 12:54 PM > > > > > > > > > > > > > > > > >> Hi Olivier, > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >>> m->nb_seg must be reset on mbuf free > > > whatever > > > > > the > > > > > > > value > > > > > > > > > of m->next, > > > > > > > > > > > > > > > >>> because it can happen that m->nb_seg is > != > > > 1. > > > > > For > > > > > > > > > instance in this > > > > > > > > > > > > > > > >>> case: > > > > > > > > > > > > > > > >>> > > > > > > > > > > > > > > > >>> m1 = rte_pktmbuf_alloc(mp); > > > > > > > > > > > > > > > >>> rte_pktmbuf_append(m1, 500); > > > > > > > > > > > > > > > >>> m2 = rte_pktmbuf_alloc(mp); > > > > > > > > > > > > > > > >>> rte_pktmbuf_append(m2, 500); > > > > > > > > > > > > > > > >>> rte_pktmbuf_chain(m1, m2); > > > > > > > > > > > > > > > >>> m0 = rte_pktmbuf_alloc(mp); > > > > > > > > > > > > > > > >>> rte_pktmbuf_append(m0, 500); > > > > > > > > > > > > > > > >>> rte_pktmbuf_chain(m0, m1); > > > > > > > > > > > > > > > >>> > > > > > > > > > > > > > > > >>> As rte_pktmbuf_chain() does not reset > > > nb_seg in > > > > > the > > > > > > > > > initial m1 > > > > > > > > > > > > > > > >>> segment (this is not required), after > this > > > code > > > > > the > > > > > > > > > mbuf chain > > > > > > > > > > > > > > > >>> have 3 segments: > > > > > > > > > > > > > > > >>> - m0: next=m1, nb_seg=3 > > > > > > > > > > > > > > > >>> - m1: next=m2, nb_seg=2 > > > > > > > > > > > > > > > >>> - m2: next=NULL, nb_seg=1 > > > > > > > > > > > > > > > >>> > > > > > > > > > > > > > > > >>> Freeing this mbuf chain will not > restore > > > > > nb_seg=1 > > > > > > > in > > > > > > > > > the second > > > > > > > > > > > > > > > >>> segment. > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> Hmm, not sure why is that? > > > > > > > > > > > > > > > >> You are talking about freeing m1, right? > > > > > > > > > > > > > > > >> rte_pktmbuf_prefree_seg(struct rte_mbuf > *m) > > > > > > > > > > > > > > > >> { > > > > > > > > > > > > > > > >> ... > > > > > > > > > > > > > > > >> if (m->next != NULL) { > > > > > > > > > > > > > > > >> m->next = NULL; > > > > > > > > > > > > > > > >> m->nb_segs = 1; > > > > > > > > > > > > > > > >> } > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> m1->next != NULL, so it will enter the > if() > > > > > block, > > > > > > > > > > > > > > > >> and will reset both next and nb_segs. > > > > > > > > > > > > > > > >> What I am missing here? > > > > > > > > > > > > > > > >> Thinking in more generic way, that > change: > > > > > > > > > > > > > > > >> - if (m->next != NULL) { > > > > > > > > > > > > > > > >> - m->next = NULL; > > > > > > > > > > > > > > > >> - m->nb_segs = 1; > > > > > > > > > > > > > > > >> - } > > > > > > > > > > > > > > > >> + m->next = NULL; > > > > > > > > > > > > > > > >> + m->nb_segs = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ah, sorry. I oversimplified the example > and > > > now > > > > > it > > > > > > > does > > > > > > > > > not > > > > > > > > > > > > > > > > show the issue... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The full example also adds a split() to > break > > > the > > > > > > > mbuf > > > > > > > > > chain > > > > > > > > > > > > > > > > between m1 and m2. The kind of thing that > > > would > > > > > be > > > > > > > done > > > > > > > > > for > > > > > > > > > > > > > > > > software TCP segmentation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If so, may be the right solution is to care > > > about > > > > > > > nb_segs > > > > > > > > > > > > > > > when next is set to NULL on split? Any > place > > > when > > > > > next > > > > > > > is > > > > > > > > > set > > > > > > > > > > > > > > > to NULL. Just to keep the optimization in a > > > more > > > > > > > generic > > > > > > > > > place. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The problem with that approach is that there > are > > > > > already > > > > > > > > > several > > > > > > > > > > > > > > existing split() or trim() implementations in > > > > > different > > > > > > > DPDK- > > > > > > > > > based > > > > > > > > > > > > > > applications. For instance, we have some in > > > > > 6WINDGate. If > > > > > > > we > > > > > > > > > force > > > > > > > > > > > > > > applications to set nb_seg to 1 when > resetting > > > next, > > > > > it > > > > > > > has > > > > > > > > > to be > > > > > > > > > > > > > > documented because it is not straightforward. > > > > > > > > > > > > > > > > > > > > > > > > > > I think it is better to go that way. > > > > > > > > > > > > > From my perspective it seems natural to reset > > > nb_seg at > > > > > > > same > > > > > > > > > time > > > > > > > > > > > > > we reset next, otherwise inconsistency will > occur. > > > > > > > > > > > > > > > > > > > > > > > > While it is not explicitly stated for nb_segs, to > me > > > it > > > > > was > > > > > > > clear > > > > > > > > > that > > > > > > > > > > > > nb_segs is only valid in the first segment, like > for > > > many > > > > > > > fields > > > > > > > > > (port, > > > > > > > > > > > > ol_flags, vlan, rss, ...). > > > > > > > > > > > > > > > > > > > > > > > > If we say that nb_segs has to be valid in any > > > segments, > > > > > it > > > > > > > means > > > > > > > > > that > > > > > > > > > > > > chain() or split() will have to update it in all > > > > > segments, > > > > > > > which > > > > > > > > > is not > > > > > > > > > > > > efficient. > > > > > > > > > > > > > > > > > > > > > > Why in all? > > > > > > > > > > > We can state that nb_segs on non-first segment > should > > > > > always > > > > > > > equal > > > > > > > > > 1. > > > > > > > > > > > As I understand in that case, both split() and > chain() > > > have > > > > > to > > > > > > > > > update nb_segs > > > > > > > > > > > only for head mbufs, rest ones will remain > untouched. > > > > > > > > > > > > > > > > > > > > Well, anyway, I think it's strange to have a > constraint > > > on m- > > > > > > > >nb_segs > > > > > > > > > for > > > > > > > > > > non-first segment. We don't have that kind of > constraints > > > for > > > > > > > other > > > > > > > > > fields. > > > > > > > > > > > > > > > > > > True, we don't. But this is one of the fields we > consider > > > > > critical > > > > > > > > > for proper work of mbuf alloc/free mechanism. > > > > > > > > > > > > > > > > > > > > > > > > > I am not sure that requiring m->nb_segs == 1 on non-first > > > > > segments > > > > > > > will provide any benefits. > > > > > > > > > > > > > > It would make this patch unneeded. > > > > > > > So, for direct, non-segmented mbufs pktmbuf_free() will > remain > > > > > write- > > > > > > > free. > > > > > > > > > > > > I see. Then I agree with Konstantin that alternative > solutions > > > should > > > > > be considered. > > > > > > > > > > > > The benefit regarding free()'ing non-segmented mbufs - which > is a > > > > > very common operation - certainly outweighs the cost of > requiring > > > > > split()/chain() operations to set the new head mbuf's nb_segs = > 1. > > > > > > > > > > > > Nonetheless, the bug needs to be fixed somehow. > > > > > > > > > > > > If we can't come up with a better solution that doesn't break > the > > > > > ABI, we are forced to accept the patch. > > > > > > > > > > > > Unless the techboard accepts to break the ABI in order to > avoid > > > the > > > > > performance cost of this patch. > > > > > > > > > > Did someone notice a performance drop with this patch? > > > > > On my side, I don't see any regression on a L3 use case. > > > > > > > > I am afraid that the DPDK performance regression tests are based > on > > > TX immediately following RX, so cache misses in TX may go by > unnoticed > > > because RX warmed up the cache for TX already. And similarly for RX > > > reusing mbufs that have been warmed up by the preceding free() at > TX. > > > > > > > > Please consider testing the performance difference with the mbuf > > > being completely cold at TX, and going completely cold again before > > > being reused for RX. > > > > > > > > > > > > > > Let's sumarize: splitting a mbuf chain and freeing it causes > > > subsequent > > > > > mbuf > > > > > allocation to return a mbuf which is not correctly initialized. > > > There > > > > > are 2 > > > > > options to fix it: > > > > > > > > > > 1/ change the mbuf free function (this patch) > > > > > > > > > > - m->nb_segs would behave like many other field: valid in > the > > > first > > > > > segment, ignored in other segments > > > > > - may impact performance (suspected) > > > > > > > > > > 2/ change all places where a mbuf chain is split, or trimmed > > > > > > > > > > - m->nb_segs would have a specific behavior: count the > number of > > > > > segments in the first mbuf, should be 1 in the last > segment, > > > > > ignored in other ones. > > > > > - no code change in mbuf library, so no performance impact > > > > > - need to patch all places where we do a mbuf split or trim. > > > From > > > > > afar, > > > > > I see at least mbuf_cut_seg_ofs() in DPDK. Some external > > > > > applications > > > > > may have to be patched (for instance, I already found 3 > places > > > in > > > > > 6WIND code base without a deep search). > > > > > > > > > > In my opinion, 1/ is better, except we notice a significant > > > > > performance, > > > > > because the (implicit) behavior is unchanged. > > > > > > > > > > Whatever the solution, some documentation has to be added. > > > > > > > > > > Olivier > > > > > > > > > > > > > Unfortunately, I don't think that anything but the first option > will > > > go into 20.11 and stable releases of older versions, so I stand by > my > > > acknowledgment of the patch. > > > > > > If we are affraid about 20.11 performance (it is legitimate, few > days > > > before the release), we can target 21.02. After all, everybody > lives > > > with this bug since 2017, so there is no urgency. If accepted and > well > > > tested, it can be backported in stable branches. > > > > +1 > > > > Good thinking, Olivier! > > Looking at the changes once again, it probably can be reworked a bit: > > - if (m->next != NULL) { > - m->next = NULL; > - m->nb_segs = 1; > - } > > + if (m->next != NULL) > + m->next = NULL; > + if (m->nb_segs != 1) > + m->nb_segs = 1; > > That way we add one more condition checking, but I suppose it > shouldn't be that perf critical. > That way for direct,non-segmented mbuf it still should be write-free. > Except cases as you described above: chain(), then split(). > > Of-course we still need to do perf testing for that approach too. > So if your preference it to postpone it till 21.02 - that's ok for me. > Konstantin
With this suggestion, I cannot imagine any performance drop for direct, non-segmented mbufs: It now reads m->nb_segs, residing in the mbuf's first cache line, but the function already reads m->refcnt in the first cache line; so no cache misses are introduced.