<snip> > > >>> > > >>>> > > >>>> [konstantin.v.anan...@yandex.ru appears similar to someone who > > >>>> previously sent you email, but may not be that person. Learn why > > this > > >>>> could be a risk at > > >>>> https://aka.ms/LearnAboutSenderIdentification.] > > >>>> > > >>>> 16/05/2022 07:10, Feifei Wang пишет: > > >>>>> > > >>>>>>> Currently, the transmit side frees the buffers into the lcore > > >>>>>>> cache and the receive side allocates buffers from the lcore > > cache. > > >>>>>>> The transmit side typically frees 32 buffers resulting in > > >>>>>>> 32*8=256B of stores to lcore cache. The receive side allocates > > 32 > > >>>>>>> buffers and stores them in the receive side software ring, > > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the > > lcore cache. > > >>>>>>> > > >>>>>>> This patch proposes a mechanism to avoid freeing to/allocating > > >>>>>>> from the lcore cache. i.e. the receive side will free the > > buffers > > >>>>>>> from transmit side directly into it's software ring. This will > > >>>>>>> avoid the 256B of loads and stores introduced by the lcore > > cache. > > >>>>>>> It also frees up the cache lines used by the lcore cache. > > >>>>>>> > > >>>>>>> However, this solution poses several constraints: > > >>>>>>> > > >>>>>>> 1)The receive queue needs to know which transmit queue it > > should > > >>>>>>> take the buffers from. The application logic decides which > > >>>>>>> transmit port to use to send out the packets. In many use > > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in > > >>>>>>> which case > > a > > >>>>>>> given transmit queue is always mapped to a single receive > > >>>>>>> queue > > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure. > > >>>>>>> > > >>>>>>> If the NIC has 2 ports (there are several references), then we > > >>>>>>> will have > > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to > > configure. > > >>>>>>> However, if this is generalized to 'N' ports, the > > >>>>>>> configuration can be long. More over the PMD would have to > > >>>>>>> scan a list of transmit queues to pull the buffers from. > > >>>>> > > >>>>>> Just to re-iterate some generic concerns about this proposal: > > >>>>>> - We effectively link RX and TX queues - when this feature > > is enabled, > > >>>>>> user can't stop TX queue without stopping linked RX queue > > first. > > >>>>>> Right now user is free to start/stop any queues at his > > will. > > >>>>>> If that feature will allow to link queues from different > > ports, > > >>>>>> then even ports will become dependent and user will have > > to pay extra > > >>>>>> care when managing such ports. > > >>>>> > > >>>>> [Feifei] When direct rearm enabled, there are two path for > > >>>>> thread > > to > > >>>>> choose. If there are enough Tx freed buffers, Rx can put buffers > > >>>>> from Tx. > > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus, > > >>>>> users > > do > > >>>>> not need to pay much attention managing ports. > > >>>> > > >>>> What I am talking about: right now different port or different > > queues > > >>>> of the same port can be treated as independent entities: > > >>>> in general user is free to start/stop (and even reconfigure in > > some > > >>>> cases) one entity without need to stop other entity. > > >>>> I.E user can stop and re-configure TX queue while keep receiving > > >>>> packets from RX queue. > > >>>> With direct re-arm enabled, I think it wouldn't be possible any > > more: > > >>>> before stopping/reconfiguring TX queue user would have make sure > > that > > >>>> corresponding RX queue wouldn't be used by datapath. > > >>> I am trying to understand the problem better. For the TX queue to > > be stopped, > > >> the user must have blocked the data plane from accessing the TX > > queue. > > >> > > >> Surely it is user responsibility tnot to call tx_burst() for > > stopped/released queue. > > >> The problem is that while TX for that queue is stopped, RX for > > related queue still > > >> can continue. > > >> So rx_burst() will try to read/modify TX queue data, that might be > > already freed, > > >> or simultaneously modified by control path. > > > Understood, agree on the issue > > > > > >> > > >> Again, it all can be mitigated by carefully re-designing and > > modifying control and > > >> data-path inside user app - by doing extra checks and > > synchronizations, etc. > > >> But from practical point - I presume most of users simply would > > avoid using this > > >> feature due all potential problems it might cause. > > > That is subjective, it all depends on the performance improvements > > users see in their application. > > > IMO, the performance improvement seen with this patch is worth few > > changes. > > > > Yes, it is subjective till some extent, though my feeling that it > > might end-up being sort of synthetic improvement used only by some > > show-case benchmarks. > > I believe that one specific important use case has already been mentioned, so > I > don't think this is a benchmark only feature. +1
> > > From my perspective, it would be much more plausible, if we can > > introduce some sort of generic improvement, that doesn't impose all > > these extra constraints and implications. > > Like one, discussed below in that thread with ZC mempool approach. > > > > Considering this feature from a high level perspective, I agree with > Konstantin's > concerns, so I'll also support his views. We did hack the ZC mempool approach [1], level of improvement is pretty small compared with this patch. [1] http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-1-feifei.wa...@arm.com/ > > If this patch is supposed to be a generic feature, please add support for it > in all > NIC PMDs, not just one. (Regardless if the feature is defined as 1:1 mapping > or > N:M mapping.) It is purely software, so it should be available for all PMDs, > not > just your favorite hardware! Consider the "fast mbuf free" feature, which is > pure software; why is that feature not implemented in all PMDs? Agree, it is good to have it supported in all the drivers. We do not have a favorite hardware, just picked a PMD which we are more familiar with. We do plan to implement in other prominent PMDs. > > A secondary point I'm making here is that this specific feature will lead to > an > enormous amount of copy-paste code, instead of a generic library function > easily available for all PMDs. Are you talking about the i40e driver code in specific? If yes, agree we should avoid copy-paste and we will look to reduce that. > > > > > >> > > >>> Like Feifei says, the RX side has the normal packet allocation > > >>> path > > still available. > > >>> Also this sounds like a corner case to me, we can handle this > > through checks in > > >> the queue_stop API. > > >> > > >> Depends. > > >> if it would be allowed to link queues only from the same port, then > > yes, extra > > >> checks for queue-stop might be enough. > > >> As right now DPDK doesn't allow user to change number of queues > > without > > >> dev_stop() first. > > >> Though if it would be allowed to link queues from different ports, > > then situation > > >> will be much worse. > > >> Right now ports are totally independent entities (except some > > special cases like > > >> link-bonding, etc.). > > >> As one port can keep doing RX/TX, second one can be stopped, re- > > confgured, > > >> even detached, and newly attached device might re-use same port > > number. > > > I see this as a similar restriction to the one discussed above. > > > > Yes, they are similar in principal, though I think that the case with > > queues from different port would make things much more complex. > > > > > Do you see any issues if we enforce this with checks? > > > > Hard to tell straightway, a lot will depend how smart such > > implementation would be. > > Usually DPDK tends not to avoid heavy > > synchronizations within its data-path functions. > > Certainly! Implementing more and more of such features in the PMDs will lead > to longer and longer data plane code paths in the PMDs. It is the "salami > method", where each small piece makes no performance difference, but they all > add up, and eventually the sum of them does impact the performance of the > general use case negatively. It would be good to have a test running in UNH that shows the performance trend. > > > > > >> > > >> > > >>>> > > >>>>> > > >>>>>> - very limited usage scenario - it will have a positive effect > > only > > >>>>>> when we have a fixed forwarding mapping: all (or nearly > > all) packets > > >>>>>> from the RX queue are forwarded into the same TX queue. > > >>>>> > > >>>>> [Feifei] Although the usage scenario is limited, this usage > > scenario > > >>>>> has a wide range of applications, such as NIC with one port. > > >>>> > > >>>> yes, there are NICs with one port, but no guarantee there > > >>>> wouldn't > > be > > >>>> several such NICs within the system. > > >>> What I see in my interactions is, a single NIC/DPU is under > > utilized for a 2 > > >> socket system. Some are adding more sockets to the system to better > > utilize the > > >> DPU. The NIC bandwidth continues to grow significantly. I do not > > think there will > > >> be a multi-DPU per server scenario. > > >> > > >> > > >> Interesting... from my experience it is visa-versa: > > >> in many cases 200Gb/s is not that much these days to saturate > > >> modern > > 2 socket > > >> x86 server. > > >> Though I suppose a lot depends on particular HW and actual workload. > > >> > > >>> > > >>>> > > >>>>> Furtrhermore, I think this is a tradeoff between performance and > > >>>>> flexibility. > > >>>>> Our goal is to achieve best performance, this means we need to > > give > > >>>>> up some flexibility decisively. For example of 'FAST_FREE Mode', > > it > > >>>>> deletes most of the buffer check (refcnt > 1, external buffer, > > chain > > >>>>> buffer), chooses a shorest path, and then achieve significant > > >>>>> performance > > >>>> improvement. > > >>>>>> Wonder did you had a chance to consider mempool-cache ZC API, > > >>>>>> similar to one we have for the ring? > > >>>>>> It would allow us on TX free path to avoid copying mbufs to > > >>>>>> temporary array on the stack. > > >>>>>> Instead we can put them straight from TX SW ring to the mempool > > cache. > > >>>>>> That should save extra store/load for mbuf and might help to > > >>>>>> achieve some performance gain without by-passing mempool. > > >>>>>> It probably wouldn't be as fast as what you proposing, but > > >>>>>> might > > be > > >>>>>> fast enough to consider as alternative. > > >>>>>> Again, it would be a generic one, so we can avoid all these > > >>>>>> implications and limitations. > > >>>>> > > >>>>> [Feifei] I think this is a good try. However, the most important > > >>>>> thing is that if we can bypass the mempool decisively to pursue > > the > > >>>>> significant performance gains. > > >>>> > > >>>> I understand the intention, and I personally think this is wrong > > and > > >>>> dangerous attitude. > > >>>> We have mempool abstraction in place for very good reason. > > >>>> So we need to try to improve mempool performance (and API if > > >>>> necessary) at first place, not to avoid it and break our own > > >>>> rules > > and > > >> recommendations. > > >>> The abstraction can be thought of at a higher level. i.e. the > > driver manages the > > >> buffer allocation/free and is hidden from the application. The > > application does > > >> not need to be aware of how these changes are implemented. > > >>> > > >>>> > > >>>> > > >>>>> For ZC, there maybe a problem for it in i40e. The reason for > > >>>>> that put Tx buffers into temporary is that i40e_tx_entry > > >>>>> includes > > buffer > > >>>>> pointer and index. > > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we > > need > > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, we > > still > > >>>>> can't avoid using a temporary stack to extract Tx buffer > > pointers. > > >>>> > > >>>> When talking about ZC API for mempool cache I meant something > > like: > > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc, > > >>>> uint32_t *nb_elem, uint32_t flags); void > > >>>> mempool_cache_put_zc_finish(struct > > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will > > >>>> return user a pointer inside mp-cache where to put free elems and > > >>>> max > > number > > >>>> of slots that can be safely filled. > > >>>> _finish_ will update mc->len. > > >>>> As an example: > > >>>> > > >>>> /* expect to free N mbufs */ > > >>>> uint32_t n = N; > > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...); > > >>>> > > >>>> /* free up to n elems */ > > >>>> for (i = 0; i != n; i++) { > > >>>> > > >>>> /* get next free mbuf from somewhere */ > > >>>> mb = extract_and_prefree_mbuf(...); > > >>>> > > >>>> /* no more free mbufs for now */ > > >>>> if (mb == NULL) > > >>>> break; > > >>>> > > >>>> p[i] = mb; > > >>>> } > > >>>> > > >>>> /* finalize ZC put, with _i_ freed elems */ > > >>>> mempool_cache_put_zc_finish(mc, i); > > >>>> > > >>>> That way, I think we can overcome the issue with i40e_tx_entry > > >>>> you mentioned above. Plus it might be useful in other similar places. > > >>>> > > >>>> Another alternative is obviously to split i40e_tx_entry into two > > >>>> structs (one for mbuf, second for its metadata) and have a > > separate > > >>>> array for each of them. > > >>>> Though with that approach we need to make sure no perf drops will > > be > > >>>> introduced, plus probably more code changes will be required. > > >>> Commit '5171b4ee6b6" already does this (in a different way), but > > just for > > >> AVX512. Unfortunately, it does not record any performance > > improvements. We > > >> could port this to Arm NEON and look at the performance. > > > > >