> On 2021/6/9 18:25, Ananyev, Konstantin wrote: > >>> On 4/23/21 12:46 PM, Chengchang Tang wrote: > >>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx > >>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do > >>>> some adjustment to the packets before sending them (e.g. processing > >>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare > >>>> callback of the bond driver is not implemented. Therefore, related > >>>> offloads can not be used unless the upper layer users process the packet > >>>> properly in their own application. But it is bad for the > >>>> transplantability. > >>>> > >>>> However, it is difficult to design the tx_prepare callback for bonding > >>>> driver. Because when a bonded device sends packets, the bonded device > >>>> allocates the packets to different slave devices based on the real-time > >>>> link status and bonding mode. That is, it is very difficult for the > >>>> bonding device to determine which slave device's prepare function should > >>>> be invoked. In addition, if the link status changes after the packets are > >>>> prepared, the packets may fail to be sent because packets allocation may > >>>> change. > >>>> > >>>> So, in this patch, the tx_prepare callback of bonding driver is not > >>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for > >>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all > >>>> tx_offloads can be processed correctly for all NIC devices in these > >>>> modes. > >>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare > >>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP. > >>>> In these cases, the impact on performance will be very limited. It is > >>>> the responsibility of the slave PMDs to decide when the real tx_prepare > >>>> needs to be used. The information from dev_config/queue_setup is > >>>> sufficient for them to make these decisions. > >>>> > >>>> Note: > >>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is > >>>> because in broadcast mode, a packet needs to be sent by all slave ports. > >>>> Different PMDs process the packets differently in tx_prepare. As a > >>>> result, > >>>> the sent packet may be incorrect. > >>>> > >>>> Signed-off-by: Chengchang Tang <tangchengch...@huawei.com> > >>>> --- > >>>> drivers/net/bonding/rte_eth_bond.h | 1 - > >>>> drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++---- > >>>> 2 files changed, 24 insertions(+), 5 deletions(-) > >>>> > >>>> diff --git a/drivers/net/bonding/rte_eth_bond.h > >>>> b/drivers/net/bonding/rte_eth_bond.h > >>>> index 874aa91..1e6cc6d 100644 > >>>> --- a/drivers/net/bonding/rte_eth_bond.h > >>>> +++ b/drivers/net/bonding/rte_eth_bond.h > >>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t > >>>> bonded_port_id, > >>>> int > >>>> rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id); > >>>> > >>>> - > >>>> #ifdef __cplusplus > >>>> } > >>>> #endif > >>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c > >>>> b/drivers/net/bonding/rte_eth_bond_pmd.c > >>>> index 2e9cea5..84af348 100644 > >>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c > >>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c > >>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, > >>>> struct rte_mbuf **bufs, > >>>> /* Send packet burst on each slave device */ > >>>> for (i = 0; i < num_of_slaves; i++) { > >>>> if (slave_nb_pkts[i] > 0) { > >>>> + int nb_prep_pkts; > >>>> + > >>>> + nb_prep_pkts = rte_eth_tx_prepare(slaves[i], > >>>> + bd_tx_q->queue_id, > >>>> slave_bufs[i], > >>>> + slave_nb_pkts[i]); > >>>> + > >>> > >>> Shouldn't it be called iff queue Tx offloads are not zero? > >>> It will allow to decrease performance degradation if no > >>> Tx offloads are enabled. Same in all cases below. > >> > >> Regarding this point, it has been discussed in the previous RFC: > >> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912e...@huawei.com/ > >> > >> According to the TX_OFFLOAD status of the current device, PMDs can > >> determine > >> whether tx_prepare is currently needed. If it is not needed, set > >> pkt_tx_prepare > >> to NULL, so that the actual tx_prepare processing will be skipped directly > >> in > >> rte_eth_tx_prepare(). > >> > >>> > >>>> num_tx_slave = rte_eth_tx_burst(slaves[i], > >>>> bd_tx_q->queue_id, > >>>> - slave_bufs[i], > >>>> slave_nb_pkts[i]); > >>>> + slave_bufs[i], nb_prep_pkts); > >>> > >>> In fact it is a problem here and really big problems. > >>> Tx prepare may fail and return less packets. Tx prepare > >>> of some packet may always fail. If application tries to > >>> send packets in a loop until success, it will be a > >>> forever loop here. Since application calls Tx burst, > >>> it is 100% legal behaviour of the function to return 0 > >>> if Tx ring is full. It is not an error indication. > >>> However, in the case of Tx prepare it is an error > >>> indication. > > > > Yes, that sounds like a problem and existing apps might be affected. > > > >>> > >>> Should we change Tx burst description and enforce callers > >>> to check for rte_errno? It sounds like a major change... > >>> > > > > Agree, rte_errno for tx_burst() is probably a simplest and sanest way, > > but yes, it is a change in behaviour and apps will need to be updated. > > Another option for bond PMD - just silently free mbufs for which prepare() > > fails (and probably update some stats counter). > > Again it is a change in behaviour, but now just for one PMD, with tx > > offloads enabled. > > Also as, I can see some tx_burst() function for that PMD already free > > packets silently: > > bond_ethdev_tx_burst_alb(), bond_ethdev_tx_burst_broadcast(). > > > > Actually another question - why the patch adds tx_prepare() only to some > > TX modes but not all? > > Is that itended? > > > > Yes. Currently, I have no ideal to perform tx_prepare() in broadcast mode > with limited > impact on performance. In broadcast mode, same packets will be send in > several devices. > In this process, we only update the ref_cnt of mbufs, but no copy of packets. > As we know, > tx_prepare() may change the data, so it may cause some problem if we perform > tx_prepare() > several times on the same packet.
You mean tx_prepare() for second dev can void changes made by tx_prepare() for first dev? I suppose in theory it is possible, even if it is probably not the case right now in practise (at least I am not aware about such cases). Actually that's an interesting topic - same can happen even with user implementing multicast on his own (see examples/ipv4_multicast/). I think these new limitations have to be documented clearly (at least). Also probably we need extra changes fo bond device dev_confgiure()/dev_get_info(): to check currently selected mode and based on that allow/reject tx offloads. The question arises (again) how to figure out for which tx offloads dev->tx_prepare() modifies the packet, for which not? Any thoughts here? > > >> > >> I agree that if the failure is caused by Tx ring full, it is a legal > >> behaviour. > >> But what about the failure caused by other reasons? At present, it is > >> possible > >> for some PMDs to fail during tx_burst due to other reasons. In this case, > >> repeated tries to send will also fail. > >> > >> I'm not sure if all PMDs need to support the behavior of sending packets > >> in a > >> loop until it succeeds. If not, I think the current problem can be > >> reminded to > >> the user by adding a description to the bonding. If it is necessary, I > >> think the > >> description of tx_burst should also add related instructions, so that the > >> developers > >> of PMDs can better understand how tx_burst should be designed, such as > >> putting all > >> hardware-related constraint checks into tx_prepare. And another > >> prerequisite for > >> the above behavior is that the packets must be prepared (i.e. checked by > >> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means > >> that we have > >> to use rte_eth_tx_prepare() in more scenarios. > >> > >> What's Ferruh's opinion on this? > >> > >>> [snip] > >>> > >>> . > >>> > >