> >> > >>>>>>>>>>>>>>>>>> Hi Olivier, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> m->nb_seg must be reset on mbuf free > >>>> whatever > >>>>>> the > >>>>>>>> value > >>>>>>>>>> of m->next, > >>>>>>>>>>>>>>>>>>> because it can happen that m->nb_seg is > >> != > >>>> 1. > >>>>>> For > >>>>>>>>>> instance in this > >>>>>>>>>>>>>>>>>>> case: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> m1 = rte_pktmbuf_alloc(mp); > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_append(m1, 500); > >>>>>>>>>>>>>>>>>>> m2 = rte_pktmbuf_alloc(mp); > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_append(m2, 500); > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_chain(m1, m2); > >>>>>>>>>>>>>>>>>>> m0 = rte_pktmbuf_alloc(mp); > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_append(m0, 500); > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_chain(m0, m1); > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> As rte_pktmbuf_chain() does not reset > >>>> nb_seg in > >>>>>> the > >>>>>>>>>> initial m1 > >>>>>>>>>>>>>>>>>>> segment (this is not required), after > >> this > >>>> code > >>>>>> the > >>>>>>>>>> mbuf chain > >>>>>>>>>>>>>>>>>>> have 3 segments: > >>>>>>>>>>>>>>>>>>> - m0: next=m1, nb_seg=3 > >>>>>>>>>>>>>>>>>>> - m1: next=m2, nb_seg=2 > >>>>>>>>>>>>>>>>>>> - m2: next=NULL, nb_seg=1 > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Freeing this mbuf chain will not > >> restore > >>>>>> nb_seg=1 > >>>>>>>> in > >>>>>>>>>> the second > >>>>>>>>>>>>>>>>>>> segment. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hmm, not sure why is that? > >>>>>>>>>>>>>>>>>> You are talking about freeing m1, right? > >>>>>>>>>>>>>>>>>> rte_pktmbuf_prefree_seg(struct rte_mbuf > >> *m) > >>>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>>> ... > >>>>>>>>>>>>>>>>>> if (m->next != NULL) { > >>>>>>>>>>>>>>>>>> m->next = NULL; > >>>>>>>>>>>>>>>>>> m->nb_segs = 1; > >>>>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> m1->next != NULL, so it will enter the > >> if() > >>>>>> block, > >>>>>>>>>>>>>>>>>> and will reset both next and nb_segs. > >>>>>>>>>>>>>>>>>> What I am missing here? > >>>>>>>>>>>>>>>>>> Thinking in more generic way, that > >> change: > >>>>>>>>>>>>>>>>>> - if (m->next != NULL) { > >>>>>>>>>>>>>>>>>> - m->next = NULL; > >>>>>>>>>>>>>>>>>> - m->nb_segs = 1; > >>>>>>>>>>>>>>>>>> - } > >>>>>>>>>>>>>>>>>> + m->next = NULL; > >>>>>>>>>>>>>>>>>> + m->nb_segs = 1; > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Ah, sorry. I oversimplified the example > >> and > >>>> now > >>>>>> it > >>>>>>>> does > >>>>>>>>>> not > >>>>>>>>>>>>>>>>> show the issue... > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> The full example also adds a split() to > >> break > >>>> the > >>>>>>>> mbuf > >>>>>>>>>> chain > >>>>>>>>>>>>>>>>> between m1 and m2. The kind of thing that > >>>> would > >>>>>> be > >>>>>>>> done > >>>>>>>>>> for > >>>>>>>>>>>>>>>>> software TCP segmentation. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> If so, may be the right solution is to care > >>>> about > >>>>>>>> nb_segs > >>>>>>>>>>>>>>>> when next is set to NULL on split? Any > >> place > >>>> when > >>>>>> next > >>>>>>>> is > >>>>>>>>>> set > >>>>>>>>>>>>>>>> to NULL. Just to keep the optimization in a > >>>> more > >>>>>>>> generic > >>>>>>>>>> place. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The problem with that approach is that there > >> are > >>>>>> already > >>>>>>>>>> several > >>>>>>>>>>>>>>> existing split() or trim() implementations in > >>>>>> different > >>>>>>>> DPDK- > >>>>>>>>>> based > >>>>>>>>>>>>>>> applications. For instance, we have some in > >>>>>> 6WINDGate. If > >>>>>>>> we > >>>>>>>>>> force > >>>>>>>>>>>>>>> applications to set nb_seg to 1 when > >> resetting > >>>> next, > >>>>>> it > >>>>>>>> has > >>>>>>>>>> to be > >>>>>>>>>>>>>>> documented because it is not straightforward. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I think it is better to go that way. > >>>>>>>>>>>>>> From my perspective it seems natural to reset > >>>> nb_seg at > >>>>>>>> same > >>>>>>>>>> time > >>>>>>>>>>>>>> we reset next, otherwise inconsistency will > >> occur. > >>>>>>>>>>>>> > >>>>>>>>>>>>> While it is not explicitly stated for nb_segs, to > >> me > >>>> it > >>>>>> was > >>>>>>>> clear > >>>>>>>>>> that > >>>>>>>>>>>>> nb_segs is only valid in the first segment, like > >> for > >>>> many > >>>>>>>> fields > >>>>>>>>>> (port, > >>>>>>>>>>>>> ol_flags, vlan, rss, ...). > >>>>>>>>>>>>> > >>>>>>>>>>>>> If we say that nb_segs has to be valid in any > >>>> segments, > >>>>>> it > >>>>>>>> means > >>>>>>>>>> that > >>>>>>>>>>>>> chain() or split() will have to update it in all > >>>>>> segments, > >>>>>>>> which > >>>>>>>>>> is not > >>>>>>>>>>>>> efficient. > >>>>>>>>>>>> > >>>>>>>>>>>> Why in all? > >>>>>>>>>>>> We can state that nb_segs on non-first segment > >> should > >>>>>> always > >>>>>>>> equal > >>>>>>>>>> 1. > >>>>>>>>>>>> As I understand in that case, both split() and > >> chain() > >>>> have > >>>>>> to > >>>>>>>>>> update nb_segs > >>>>>>>>>>>> only for head mbufs, rest ones will remain > >> untouched. > >>>>>>>>>>> > >>>>>>>>>>> Well, anyway, I think it's strange to have a > >> constraint > >>>> on m- > >>>>>>>>> nb_segs > >>>>>>>>>> for > >>>>>>>>>>> non-first segment. We don't have that kind of > >> constraints > >>>> for > >>>>>>>> other > >>>>>>>>>> fields. > >>>>>>>>>> > >>>>>>>>>> True, we don't. But this is one of the fields we > >> consider > >>>>>> critical > >>>>>>>>>> for proper work of mbuf alloc/free mechanism. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> I am not sure that requiring m->nb_segs == 1 on non-first > >>>>>> segments > >>>>>>>> will provide any benefits. > >>>>>>>> > >>>>>>>> It would make this patch unneeded. > >>>>>>>> So, for direct, non-segmented mbufs pktmbuf_free() will > >> remain > >>>>>> write- > >>>>>>>> free. > >>>>>>> > >>>>>>> I see. Then I agree with Konstantin that alternative > >> solutions > >>>> should > >>>>>> be considered. > >>>>>>> > >>>>>>> The benefit regarding free()'ing non-segmented mbufs - which > >> is a > >>>>>> very common operation - certainly outweighs the cost of > >> requiring > >>>>>> split()/chain() operations to set the new head mbuf's nb_segs = > >> 1. > >>>>>>> > >>>>>>> Nonetheless, the bug needs to be fixed somehow. > >>>>>>> > >>>>>>> If we can't come up with a better solution that doesn't break > >> the > >>>>>> ABI, we are forced to accept the patch. > >>>>>>> > >>>>>>> Unless the techboard accepts to break the ABI in order to > >> avoid > >>>> the > >>>>>> performance cost of this patch. > >>>>>> > >>>>>> Did someone notice a performance drop with this patch? > >>>>>> On my side, I don't see any regression on a L3 use case. > >>>>> > >>>>> I am afraid that the DPDK performance regression tests are based > >> on > >>>> TX immediately following RX, so cache misses in TX may go by > >> unnoticed > >>>> because RX warmed up the cache for TX already. And similarly for RX > >>>> reusing mbufs that have been warmed up by the preceding free() at > >> TX. > >>>>> > >>>>> Please consider testing the performance difference with the mbuf > >>>> being completely cold at TX, and going completely cold again before > >>>> being reused for RX. > >>>>> > >>>>>> > >>>>>> Let's sumarize: splitting a mbuf chain and freeing it causes > >>>> subsequent > >>>>>> mbuf > >>>>>> allocation to return a mbuf which is not correctly initialized. > >>>> There > >>>>>> are 2 > >>>>>> options to fix it: > >>>>>> > >>>>>> 1/ change the mbuf free function (this patch) > >>>>>> > >>>>>> - m->nb_segs would behave like many other field: valid in > >> the > >>>> first > >>>>>> segment, ignored in other segments > >>>>>> - may impact performance (suspected) > >>>>>> > >>>>>> 2/ change all places where a mbuf chain is split, or trimmed > >>>>>> > >>>>>> - m->nb_segs would have a specific behavior: count the > >> number of > >>>>>> segments in the first mbuf, should be 1 in the last > >> segment, > >>>>>> ignored in other ones. > >>>>>> - no code change in mbuf library, so no performance impact > >>>>>> - need to patch all places where we do a mbuf split or trim. > >>>> From > >>>>>> afar, > >>>>>> I see at least mbuf_cut_seg_ofs() in DPDK. Some external > >>>>>> applications > >>>>>> may have to be patched (for instance, I already found 3 > >> places > >>>> in > >>>>>> 6WIND code base without a deep search). > >>>>>> > >>>>>> In my opinion, 1/ is better, except we notice a significant > >>>>>> performance, > >>>>>> because the (implicit) behavior is unchanged. > >>>>>> > >>>>>> Whatever the solution, some documentation has to be added. > >>>>>> > >>>>>> Olivier > >>>>>> > >>>>> > >>>>> Unfortunately, I don't think that anything but the first option > >> will > >>>> go into 20.11 and stable releases of older versions, so I stand by > >> my > >>>> acknowledgment of the patch. > >>>> > >>>> If we are affraid about 20.11 performance (it is legitimate, few > >> days > >>>> before the release), we can target 21.02. After all, everybody > >> lives > >>>> with this bug since 2017, so there is no urgency. If accepted and > >> well > >>>> tested, it can be backported in stable branches. > >>> > >>> +1 > >>> > >>> Good thinking, Olivier! > >> > >> Looking at the changes once again, it probably can be reworked a bit: > >> > >> - if (m->next != NULL) { > >> - m->next = NULL; > >> - m->nb_segs = 1; > >> - } > >> > >> + if (m->next != NULL) > >> + m->next = NULL; > >> + if (m->nb_segs != 1) > >> + m->nb_segs = 1; > >> > >> That way we add one more condition checking, but I suppose it > >> shouldn't be that perf critical. > >> That way for direct,non-segmented mbuf it still should be write-free. > >> Except cases as you described above: chain(), then split(). > >> > >> Of-course we still need to do perf testing for that approach too. > >> So if your preference it to postpone it till 21.02 - that's ok for me. > >> Konstantin > > > > With this suggestion, I cannot imagine any performance drop for direct, > > non-segmented mbufs: It now reads m->nb_segs, residing in the > mbuf's first cache line, but the function already reads m->refcnt in the > first cache line; so no cache misses are introduced. > > +1
I don't expect perf drop with that approach either. But some perf testing still needs to be done, just in case 😊