> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > Sent: Wednesday, 29 June 2022 23.58 > > <snip> > > > > >>>> > > > >>>> 16/05/2022 07:10, Feifei Wang пишет: > > > >>>>> > > > >>>>>>> Currently, the transmit side frees the buffers into the > lcore > > > >>>>>>> cache and the receive side allocates buffers from the lcore > > > cache. > > > >>>>>>> The transmit side typically frees 32 buffers resulting in > > > >>>>>>> 32*8=256B of stores to lcore cache. The receive side > allocates > > > 32 > > > >>>>>>> buffers and stores them in the receive side software ring, > > > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the > > > lcore cache. > > > >>>>>>> > > > >>>>>>> This patch proposes a mechanism to avoid freeing > to/allocating > > > >>>>>>> from the lcore cache. i.e. the receive side will free the > > > buffers > > > >>>>>>> from transmit side directly into it's software ring. This > will > > > >>>>>>> avoid the 256B of loads and stores introduced by the lcore > > > cache. > > > >>>>>>> It also frees up the cache lines used by the lcore cache. > > > >>>>>>> > > > >>>>>>> However, this solution poses several constraints: > > > >>>>>>> > > > >>>>>>> 1)The receive queue needs to know which transmit queue it > > > should > > > >>>>>>> take the buffers from. The application logic decides which > > > >>>>>>> transmit port to use to send out the packets. In many use > > > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in > > > >>>>>>> which case > > > a > > > >>>>>>> given transmit queue is always mapped to a single receive > > > >>>>>>> queue > > > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure. > > > >>>>>>> > > > >>>>>>> If the NIC has 2 ports (there are several references), then > we > > > >>>>>>> will have > > > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to > > > configure. > > > >>>>>>> However, if this is generalized to 'N' ports, the > > > >>>>>>> configuration can be long. More over the PMD would have to > > > >>>>>>> scan a list of transmit queues to pull the buffers from. > > > >>>>> > > > >>>>>> Just to re-iterate some generic concerns about this > proposal: > > > >>>>>> - We effectively link RX and TX queues - when this > feature > > > is enabled, > > > >>>>>> user can't stop TX queue without stopping linked RX > queue > > > first. > > > >>>>>> Right now user is free to start/stop any queues at his > > > will. > > > >>>>>> If that feature will allow to link queues from > different > > > ports, > > > >>>>>> then even ports will become dependent and user will > have > > > to pay extra > > > >>>>>> care when managing such ports. > > > >>>>> > > > >>>>> [Feifei] When direct rearm enabled, there are two path for > > > >>>>> thread > > > to > > > >>>>> choose. If there are enough Tx freed buffers, Rx can put > buffers > > > >>>>> from Tx. > > > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus, > > > >>>>> users > > > do > > > >>>>> not need to pay much attention managing ports. > > > >>>> > > > >>>> What I am talking about: right now different port or different > > > queues > > > >>>> of the same port can be treated as independent entities: > > > >>>> in general user is free to start/stop (and even reconfigure in > > > some > > > >>>> cases) one entity without need to stop other entity. > > > >>>> I.E user can stop and re-configure TX queue while keep > receiving > > > >>>> packets from RX queue. > > > >>>> With direct re-arm enabled, I think it wouldn't be possible > any > > > more: > > > >>>> before stopping/reconfiguring TX queue user would have make > sure > > > that > > > >>>> corresponding RX queue wouldn't be used by datapath. > > > >>> I am trying to understand the problem better. For the TX queue > to > > > be stopped, > > > >> the user must have blocked the data plane from accessing the TX > > > queue. > > > >> > > > >> Surely it is user responsibility tnot to call tx_burst() for > > > stopped/released queue. > > > >> The problem is that while TX for that queue is stopped, RX for > > > related queue still > > > >> can continue. > > > >> So rx_burst() will try to read/modify TX queue data, that might > be > > > already freed, > > > >> or simultaneously modified by control path. > > > > Understood, agree on the issue > > > > > > > >> > > > >> Again, it all can be mitigated by carefully re-designing and > > > modifying control and > > > >> data-path inside user app - by doing extra checks and > > > synchronizations, etc. > > > >> But from practical point - I presume most of users simply would > > > avoid using this > > > >> feature due all potential problems it might cause. > > > > That is subjective, it all depends on the performance > improvements > > > users see in their application. > > > > IMO, the performance improvement seen with this patch is worth > few > > > changes. > > > > > > Yes, it is subjective till some extent, though my feeling that it > > > might end-up being sort of synthetic improvement used only by some > > > show-case benchmarks. > > > > I believe that one specific important use case has already been > mentioned, so I > > don't think this is a benchmark only feature. > +1 > > > > > > From my perspective, it would be much more plausible, if we can > > > introduce some sort of generic improvement, that doesn't impose all > > > these extra constraints and implications. > > > Like one, discussed below in that thread with ZC mempool approach. > > > > > > > Considering this feature from a high level perspective, I agree with > Konstantin's > > concerns, so I'll also support his views. > We did hack the ZC mempool approach [1], level of improvement is pretty > small compared with this patch. > > [1] http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784- > 1-feifei.wa...@arm.com/ > > > > > If this patch is supposed to be a generic feature, please add support > for it in all > > NIC PMDs, not just one. (Regardless if the feature is defined as 1:1 > mapping or > > N:M mapping.) It is purely software, so it should be available for > all PMDs, not > > just your favorite hardware! Consider the "fast mbuf free" feature, > which is > > pure software; why is that feature not implemented in all PMDs? > Agree, it is good to have it supported in all the drivers. We do not > have a favorite hardware, just picked a PMD which we are more familiar > with. We do plan to implement in other prominent PMDs. > > > > > A secondary point I'm making here is that this specific feature will > lead to an > > enormous amount of copy-paste code, instead of a generic library > function > > easily available for all PMDs. > Are you talking about the i40e driver code in specific? If yes, agree > we should avoid copy-paste and we will look to reduce that.
Yes, I am talking about the code that needs to be copied into all prominent PMDs. Perhaps you can move the majority of it into a common directory, if not in a generic library, so the modification per PMD becomes smaller. (I see the same copy-paste issue with the "fast mbuf free" feature, if to be supported by other than the i40e PMD.) Please note that I do not expect you to implement this feature in other PMDs than you need. I was trying to make the point that implementing a software feature in a PMD requires copy-pasting to other PMDs, which can require a big effort; while implementing it in a library and calling the library from the PMDs require a smaller effort per PMD. I intentionally phrased it somewhat provokingly, and was lucky not to offend anyone. :-) > > > > > > > > > > >> > > > >>> Like Feifei says, the RX side has the normal packet allocation > > > >>> path > > > still available. > > > >>> Also this sounds like a corner case to me, we can handle this > > > through checks in > > > >> the queue_stop API. > > > >> > > > >> Depends. > > > >> if it would be allowed to link queues only from the same port, > then > > > yes, extra > > > >> checks for queue-stop might be enough. > > > >> As right now DPDK doesn't allow user to change number of queues > > > without > > > >> dev_stop() first. > > > >> Though if it would be allowed to link queues from different > ports, > > > then situation > > > >> will be much worse. > > > >> Right now ports are totally independent entities (except some > > > special cases like > > > >> link-bonding, etc.). > > > >> As one port can keep doing RX/TX, second one can be stopped, re- > > > confgured, > > > >> even detached, and newly attached device might re-use same port > > > number. > > > > I see this as a similar restriction to the one discussed above. > > > > > > Yes, they are similar in principal, though I think that the case > with > > > queues from different port would make things much more complex. > > > > > > > Do you see any issues if we enforce this with checks? > > > > > > Hard to tell straightway, a lot will depend how smart such > > > implementation would be. > > > Usually DPDK tends not to avoid heavy > > > synchronizations within its data-path functions. > > > > Certainly! Implementing more and more of such features in the PMDs > will lead > > to longer and longer data plane code paths in the PMDs. It is the > "salami > > method", where each small piece makes no performance difference, but > they all > > add up, and eventually the sum of them does impact the performance of > the > > general use case negatively. > It would be good to have a test running in UNH that shows the > performance trend. +1 > > > > > > > > > >> > > > >> > > > >>>> > > > >>>>> > > > >>>>>> - very limited usage scenario - it will have a positive > effect > > > only > > > >>>>>> when we have a fixed forwarding mapping: all (or nearly > > > all) packets > > > >>>>>> from the RX queue are forwarded into the same TX queue. > > > >>>>> > > > >>>>> [Feifei] Although the usage scenario is limited, this usage > > > scenario > > > >>>>> has a wide range of applications, such as NIC with one port. > > > >>>> > > > >>>> yes, there are NICs with one port, but no guarantee there > > > >>>> wouldn't > > > be > > > >>>> several such NICs within the system. > > > >>> What I see in my interactions is, a single NIC/DPU is under > > > utilized for a 2 > > > >> socket system. Some are adding more sockets to the system to > better > > > utilize the > > > >> DPU. The NIC bandwidth continues to grow significantly. I do not > > > think there will > > > >> be a multi-DPU per server scenario. > > > >> > > > >> > > > >> Interesting... from my experience it is visa-versa: > > > >> in many cases 200Gb/s is not that much these days to saturate > > > >> modern > > > 2 socket > > > >> x86 server. > > > >> Though I suppose a lot depends on particular HW and actual > workload. > > > >> > > > >>> > > > >>>> > > > >>>>> Furtrhermore, I think this is a tradeoff between performance > and > > > >>>>> flexibility. > > > >>>>> Our goal is to achieve best performance, this means we need > to > > > give > > > >>>>> up some flexibility decisively. For example of 'FAST_FREE > Mode', > > > it > > > >>>>> deletes most of the buffer check (refcnt > 1, external > buffer, > > > chain > > > >>>>> buffer), chooses a shorest path, and then achieve significant > > > >>>>> performance > > > >>>> improvement. > > > >>>>>> Wonder did you had a chance to consider mempool-cache ZC > API, > > > >>>>>> similar to one we have for the ring? > > > >>>>>> It would allow us on TX free path to avoid copying mbufs to > > > >>>>>> temporary array on the stack. > > > >>>>>> Instead we can put them straight from TX SW ring to the > mempool > > > cache. > > > >>>>>> That should save extra store/load for mbuf and might help to > > > >>>>>> achieve some performance gain without by-passing mempool. > > > >>>>>> It probably wouldn't be as fast as what you proposing, but > > > >>>>>> might > > > be > > > >>>>>> fast enough to consider as alternative. > > > >>>>>> Again, it would be a generic one, so we can avoid all these > > > >>>>>> implications and limitations. > > > >>>>> > > > >>>>> [Feifei] I think this is a good try. However, the most > important > > > >>>>> thing is that if we can bypass the mempool decisively to > pursue > > > the > > > >>>>> significant performance gains. > > > >>>> > > > >>>> I understand the intention, and I personally think this is > wrong > > > and > > > >>>> dangerous attitude. > > > >>>> We have mempool abstraction in place for very good reason. > > > >>>> So we need to try to improve mempool performance (and API if > > > >>>> necessary) at first place, not to avoid it and break our own > > > >>>> rules > > > and > > > >> recommendations. > > > >>> The abstraction can be thought of at a higher level. i.e. the > > > driver manages the > > > >> buffer allocation/free and is hidden from the application. The > > > application does > > > >> not need to be aware of how these changes are implemented. > > > >>> > > > >>>> > > > >>>> > > > >>>>> For ZC, there maybe a problem for it in i40e. The reason for > > > >>>>> that put Tx buffers into temporary is that i40e_tx_entry > > > >>>>> includes > > > buffer > > > >>>>> pointer and index. > > > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we > > > need > > > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, > we > > > still > > > >>>>> can't avoid using a temporary stack to extract Tx buffer > > > pointers. > > > >>>> > > > >>>> When talking about ZC API for mempool cache I meant something > > > like: > > > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache > *mc, > > > >>>> uint32_t *nb_elem, uint32_t flags); void > > > >>>> mempool_cache_put_zc_finish(struct > > > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will > > > >>>> return user a pointer inside mp-cache where to put free elems > and > > > >>>> max > > > number > > > >>>> of slots that can be safely filled. > > > >>>> _finish_ will update mc->len. > > > >>>> As an example: > > > >>>> > > > >>>> /* expect to free N mbufs */ > > > >>>> uint32_t n = N; > > > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...); > > > >>>> > > > >>>> /* free up to n elems */ > > > >>>> for (i = 0; i != n; i++) { > > > >>>> > > > >>>> /* get next free mbuf from somewhere */ > > > >>>> mb = extract_and_prefree_mbuf(...); > > > >>>> > > > >>>> /* no more free mbufs for now */ > > > >>>> if (mb == NULL) > > > >>>> break; > > > >>>> > > > >>>> p[i] = mb; > > > >>>> } > > > >>>> > > > >>>> /* finalize ZC put, with _i_ freed elems */ > > > >>>> mempool_cache_put_zc_finish(mc, i); > > > >>>> > > > >>>> That way, I think we can overcome the issue with i40e_tx_entry > > > >>>> you mentioned above. Plus it might be useful in other similar > places. > > > >>>> > > > >>>> Another alternative is obviously to split i40e_tx_entry into > two > > > >>>> structs (one for mbuf, second for its metadata) and have a > > > separate > > > >>>> array for each of them. > > > >>>> Though with that approach we need to make sure no perf drops > will > > > be > > > >>>> introduced, plus probably more code changes will be required. > > > >>> Commit '5171b4ee6b6" already does this (in a different way), > but > > > just for > > > >> AVX512. Unfortunately, it does not record any performance > > > improvements. We > > > >> could port this to Arm NEON and look at the performance. > > > > > > >