<snip> > >> > >> > >>> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache > >>> out of API to free buffers directly. There are two changes different > >>> with previous version: > >>> 1. change txep from "i40e_entry" to "i40e_vec_entry" > >>> 2. put cache out of "mempool_bulk" API to copy buffers into it > >>> directly > >>> > >>> Performance Test with l3fwd neon path: > >>> with this patch > >>> n1sdp: no perforamnce change > >>> amper-altra: +4.0% > >>> > >> > >> > >> Thanks for RFC, appreciate your effort. > >> So, as I understand - bypassing mempool put/get itself gives about > >> 7-10% speedup for RX/TX on ARM platforms, correct? > > It is 7% to 14% on N1SDP, 14% to 17% on Ampere machines The current > > RFC needs to be converted to a patch (albeit with zero-copy mempool APIs) as > it applies for a more generic use case. > > > >> > >> About direct-rearm RX approach you propose: > >> After another thought, probably it is possible to re-arrange it in a > >> way that would help avoid related negatives. > > Thanks for the idea. > > > >> The basic idea as follows: > >> > >> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues. > >> Also make sw_ring de-coupled from RXQ itself, i.e: > >> when RXQ is stopped or even destroyed, related sw_ring may still > >> exist (probably ref-counter or RCU would be sufficient here). > >> All that means we need a common layout/api for rxq_sw_ring > >> and PMDs that would like to support direct-rearming will have to > >> use/obey it. > > This would mean, we will have additional for loop to fill the descriptors > > on the > RX side. > > Yes, rx will go though avaiable entries in sw_ring and fill actual HW > descriptors. > > > May be we could keep the queue when the RXQ is stopped, and free it only > when RXQ is destroyed. Don't have to allocate one more if the RXQ is started > back again? This means, the mbufs will be in the sw_ring if the RXQ is stopped > but not destroyed (same as what you have mentioned below). > > My thought here was - yes sw_ring can stay at queue stop (even probably at > destroy) to avoid related TXQ operation abortion. > Though actual HW desc ring will be de-init/destroyed as usual. > That's why I think we need to decouple sw_ring from queue and it's HW ring. > > > > >> > >> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e: > >> at txq_free_bufs() try to store released mbufs inside attached > >> sw_ring directly. If there is no attached sw_ring, or not enough > >> free space in it - continue with mempool_put() as usual. > >> Note that actual arming of HW RXDs still remains responsibility > > What problems do you see if we arm the HW RXDs (assuming that we do not > have to support this across multiple devices)? > > Same as visa-versa scenario: > - RXQ might be already stopped, while TXQ still running. > - Also it doesn't sound right from design point of view to allow > TX code to manipulate RX HW queue directly and visa-versa. > > > > >> of RX code-path: > >> rxq_rearm(rxq) { > >> ... > >> - check are there are N already filled entries inside rxq_sw_ring. > >> if not, populate them from mempool (usual mempool_get()). > >> - arm related RXDs and mark these sw_ring entries as managed by HW. > >> ... > >> } > >> > >> > >> So rxq_sw_ring will serve two purposes: > >> - track mbufs that are managed by HW (that what it does now) > >> - private (per RXQ) mbuf cache > >> > >> Now, if TXQ is stopped while RXQ is running - no extra > >> synchronization is required, RXQ would just use > >> mempool_get() to rearm its sw_ring itself. > > There would be some synchronization required which tells the data plane > threads not to access the TXQ before the TXQ is stopped. Other than this, no > extra synchronization is required. > > > > For the current patch, we could use a similar approach. i.e. when TXQ is > stopped, it does not free the mbufs from sw_ring, we just let the RXQ consume > all the mbufs from TX side sw_ring till it is empty. > > > >> > >> If RXQ is stopped while TXQ is still running - TXQ can still continue > >> to populate related sw_ring till it gets full. > >> Then it will continue with mempool_put() as usual. > >> Of-course it means that user who wants to use this feature should > >> probably account some extra mbufs for such case, or might be > >> rxq_sw_ring can have enable/disable flag to mitigate such situation. > >> > >> As another benefit here - such approach makes possible to use several > >> TXQs (even from different devices) to rearm same RXQ. > > Being able to use several TXQs is a practical use case. But, I am not sure > > if > different devices in the same server is a practical scenario that we need to > address. > > > >> > >> Have to say, that I am still not sure that 10% RX/TX improvement is > >> worth bypassing mempool completely and introducing all this extra > >> complexity in RX/TX path. > > It is more, clarified above. > > > >> But, if we'll still decide to go ahead with direct-rearming, this > >> re-arrangement, I think, should help to keep things clear and avoid > >> introducing new limitations in existing functionality. > >> > >> WDYT? > > I had another idea. This requires the application to change, which might be > > ok > given that it is new feature/optimization. > > > > We could expose a new API to the application which allows RX side to take > buffers from TX side. The application would call this additional API just > before > calling the eth_rx_burst API. > > > > The advantages I see with this approach is: > > 1) The static mapping of the ports is not required. The mapping is dynamic. > This allows for the application to make decisions based on current conditions > in > the data plane. > > 2) Code is simple, because it does not have to check for many conditions. > > The > application can make better decisions as it knows the scenario well. Not all > applications have to take the hit of conditionals. > > 2) The existing synchronization used by the applications (to stop RXQ/TXQ) > > is > sufficient. No new synchronization required. > > > That sounds like a very good idea to me. > I think it will allow to avoid all short-comings that are present in original > approach. > Again, if we like we can even allow user to feed RXQ from some other sources > (dropped packets). > Just to confirm that we are talking about the same thing: > > 1. rte_ethdev will have an API for the user to get sw_ring handle from the > RXQ, > something like: > struct rxq_sw_ring * > rte_eth_rxq_get_sw_ring(uint16_t port_id, uint16_t rxq_id); > > We probbly can even keep 'struct rxq_sw_ring' as internal one, and use it as a > handle. > > 2. rte_ethdev will have an API that will ask TXQ to free mbufs and fill given > rxq_sw_ring: > int > rte_eth_txq_fill_sw_ring(uint16_t port_id, uint16_t txq_id, struct * > rxq_sw_ring); > > > So the app code will look something like: > > rxq_sw_ring = rxq_get_sw_ring(rx_port, rxq); > ... > while (...) { > > n = eth_tx_burst(tx_port, txq, pkts, N); > ... > eth_txq_fill_sw_ring(tx_port, txq, rxq_sw_ring); > n = eth_rx_burst(rx_port, rxq, pkts, N); > ... > } I think the above loop can be re-written as follows:
while (...) { eth_txq_fill_sw_ring(tx_port, txq, rxq_sw_ring); n = eth_rx_burst(rx_port, rxq, pkts, N); ... n = eth_tx_burst(tx_port, txq, pkts, N); ... } > > > Did I get your idea right? > > > > > The disadvantage is that it is another function pointer call which will > > reduce > some performance. > > > >> > >> Konstantin > >> > >> > >> > >> > >> > >> > >> > >