[PATCH v5] kni: allow configuring the kni thread granularity
The Kni kthreads seem to be re-scheduled at a granularity of roughly 1 millisecond right now, which seems to be insufficient for performing tests involving a lot of control plane traffic. Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it seems that the existing code cannot reschedule at the desired granularily, due to precision constraints of schedule_timeout_interruptible(). In our use case, we leverage the Linux Kernel for control plane, and it is not uncommon to have 60K - 100K pps for some signaling protocols. Since we are not in atomic context, the usleep_range() function seems to be more appropriate for being able to introduce smaller controlled delays, in the range of 5-10 microseconds. Upon reading the existing code, it would seem that this was the original intent. Adding sub-millisecond delays, seems unfeasible with a call to schedule_timeout_interruptible(). KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */ schedule_timeout_interruptible( usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL)); Below, we attempted a brief comparison between the existing implementation, which uses schedule_timeout_interruptible() and usleep_range(). We attempt to measure the CPU usage, and RTT between two Kni interfaces, which are created on top of vmxnet3 adapters, connected by a vSwitch. insmod rte_kni.ko kthread_mode=single carrier=on schedule_timeout_interruptible(usecs_to_jiffies(5)) kni_single CPU Usage: 2-4 % [root@localhost ~]# ping 1.1.1.2 -I eth1 PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data. 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms usleep_range(5, 10) kni_single CPU usage: 50% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms usleep_range(20, 50) kni_single CPU usage: 24% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms usleep_range(50, 100) kni_single CPU usage: 13% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms usleep_range(100, 200) kni_single CPU usage: 7% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms usleep_range(1000, 1100) kni_single CPU usage: 2% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms Upon testing, usleep_range(1000, 1100) seems roughly equivalent in latency and cpu usage to the variant with schedule_timeout_interruptible(), while usleep_range(100, 200) seems to give a decent tradeoff between latency and cpu usage, while allowing users to tweak the limits for improved precision if they have such use cases. Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a softlockup on my kernel. Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 1226 Comm: kni_single Tainted: GW O 3.10 #1 [] dump_stack+0x19/0x1b [] panic+0xcd/0x1e0 [] watchdog_timer_fn+0x160/0x160 [] __run_hrtimer.isra.4+0x42/0xd0 [] hrtimer_interrupt+0xe7/0x1f0 [] smp_apic_timer_interrupt+0x67/0xa0 [] apic_timer_interrupt+0x6d/0x80 This patch also attempts to remove this option. References: [1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt Signed-off-by: Tudor Cornea Acked-by: Padraig Connolly Reviewed-by: Ferruh Yigit --- v5: * Rebased the patch on top of the dpdk-next-net branch v4: * Removed RTE_KNI_PREEMPT_DEFAULT configuration option v3: * Fixed unwrapped commit description warning * Changed from hrtimers to Linux High Precision Timers in docs * Added two tabs at the beginning of the new params description. Stephen correctly pointed out that the descriptions of the parameters for the Kni module are nonstandard w.r.t existing kernel code. I was thinking to preserve compatibility with the existing parameters of the Kni module for the moment, while an additional clean-up
Re: [PATCH] kernel/kni: retry the xmit in case ring is full
Hi Stephen, > NAK > Doing this risks having a CPU lockup if userspace does not keep up > or the DPDK application gets stuck. > > There are better ways to solve the TCP stack queue overrun issue: > 1. Use a better queueing discipline on the kni device. The Linux default >of pfifo_fast has bufferbloat issues. Use fq_codel, fq, codel or pie? > 2. KNI should implement BQL so that TCP stack can see lock backpressure >about possible queue depth. > > Thanks for the suggestions. I agree that we risk a lockup, in case the DPDK app gets stuck. Indeed, I am running on an older Linux kernel, and the default queuing discipline is pfifo_fast. I'll experiment with the queuing disciplines you recommended. > As a simple workaround increase the KNI ring size. It won't solve the whole > problem but i tcan help > I obtained moderate success with increasing MAX_MBUF_BURST_NUM from 32 to 1024 in librte_kni. I'm not sure if such a change would be upstreamable. Perhaps it needs a bit of testing. I'll drop the current patch.
Re: [PATCH v5] kni: allow configuring the kni thread granularity
On Fri, 14 Jan 2022 at 18:44, Ferruh Yigit wrote: > On 1/14/2022 4:24 PM, Stephen Hemminger wrote: > > On Fri, 14 Jan 2022 17:18:19 +0200 > > Tudor Cornea wrote: > > > >> +module_param(min_scheduling_interval, long, 0644); > >> +MODULE_PARM_DESC(min_scheduling_interval, > >> +"\t\tKni thread min scheduling interval (default=100 microseconds):\n" > >> +"\t\t" > >> +); > >> + > >> +module_param(max_scheduling_interval, long, 0644); > >> +MODULE_PARM_DESC(max_scheduling_interval, > >> +"\t\tKni thread max scheduling interval (default=200 microseconds):\n" > >> +"\t\t" > >> +); > > > > Please don't add more bad module parameter strings. > > The KNI author did something no other kernel modules do with tabs > > and double spacing, stop this bogus stuff. > > > > The patch is good, let's not block it for the module parameter string, > all can be fixed with another patch. > > Can you please give a sample what is a common way of it, me or Tudor can > do the patch? > > I agree that the module parameter string is in non-standard format. I was planning to send a follow-up patch, which would correct the description for all of the KNI parameters (including the two new parameters that the current patch would add) in one shot. > > Is there any reason you have to use KNI at all. > > KNI is broken on many levels and is not fixable. > > What about virtio or tap? > > We've run some tests with tap interfaces and found the performance to not be good enough for our use. We're going to experiment with virtio_user in the future. I'm aware that there is a long term plan to deprecate the KNI.
[PATCH v6] kni: allow configuring the kni thread granularity
The Kni kthreads seem to be re-scheduled at a granularity of roughly 1 millisecond right now, which seems to be insufficient for performing tests involving a lot of control plane traffic. Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it seems that the existing code cannot reschedule at the desired granularily, due to precision constraints of schedule_timeout_interruptible(). In our use case, we leverage the Linux Kernel for control plane, and it is not uncommon to have 60K - 100K pps for some signaling protocols. Since we are not in atomic context, the usleep_range() function seems to be more appropriate for being able to introduce smaller controlled delays, in the range of 5-10 microseconds. Upon reading the existing code, it would seem that this was the original intent. Adding sub-millisecond delays, seems unfeasible with a call to schedule_timeout_interruptible(). KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */ schedule_timeout_interruptible( usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL)); Below, we attempted a brief comparison between the existing implementation, which uses schedule_timeout_interruptible() and usleep_range(). We attempt to measure the CPU usage, and RTT between two Kni interfaces, which are created on top of vmxnet3 adapters, connected by a vSwitch. insmod rte_kni.ko kthread_mode=single carrier=on schedule_timeout_interruptible(usecs_to_jiffies(5)) kni_single CPU Usage: 2-4 % [root@localhost ~]# ping 1.1.1.2 -I eth1 PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data. 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms usleep_range(5, 10) kni_single CPU usage: 50% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms usleep_range(20, 50) kni_single CPU usage: 24% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms usleep_range(50, 100) kni_single CPU usage: 13% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms usleep_range(100, 200) kni_single CPU usage: 7% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms usleep_range(1000, 1100) kni_single CPU usage: 2% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms Upon testing, usleep_range(1000, 1100) seems roughly equivalent in latency and cpu usage to the variant with schedule_timeout_interruptible(), while usleep_range(100, 200) seems to give a decent tradeoff between latency and cpu usage, while allowing users to tweak the limits for improved precision if they have such use cases. Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a softlockup on my kernel. Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 1226 Comm: kni_single Tainted: GW O 3.10 #1 [] dump_stack+0x19/0x1b [] panic+0xcd/0x1e0 [] watchdog_timer_fn+0x160/0x160 [] __run_hrtimer.isra.4+0x42/0xd0 [] hrtimer_interrupt+0xe7/0x1f0 [] smp_apic_timer_interrupt+0x67/0xa0 [] apic_timer_interrupt+0x6d/0x80 This patch also attempts to remove this option. References: [1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt Signed-off-by: Tudor Cornea Acked-by: Padraig Connolly Reviewed-by: Ferruh Yigit --- v6: * Removed tabs and newline in the description of the min_scheduling_interval and max_scheduling_interval parameters. They seem to be non-standard. In a quick glance over the Linux tree, I saw some (rare) usages of newlines and tabs: drivers/scsi/bnx2fc/bnx2fc_fcoe.c (debug_logging) Fixing the other parameters might mean that we have to chop some text, otherwise the line could probably get too big. v5: * Rebased the patch on top of the dpdk-next-net branch v4: * Removed RTE_KNI_PREEMPT_DEFAULT configuration option v3: * Fixed unwrapped commit
[dpdk-dev] [PATCH] net/ixgbe: initialize max_rx_pkt_len if rlpml_set_vf fails
It seems that if the call to ixgbevf_rlpml_set_vf fails, we will not initialize dev_conf.rxmode.max_rx_pkt_len correctly anymore. This happens with a 82599EB NIC and a VMware ESXI 6.0 setup, and is causing VF the ports to fail to initialize We see the following error: ixgbevf_dev_rx_init(): Set max packet length to 1518 failed. Investigating over DPDK 19.11, it seems that the call still fails, but it doesn't exit prematurely, and max_rx_pkt_len is initialized in the respective case. On the ESXi server, we seem to have the following driver net-ixgbe: 3.7.13.7.14iov-20vmw.600.0.0.2494585 It seems that the behavior related to MTU setting has changed since the following commit: commit c77866a16904 ("net/ixgbe: detect failed VF MTU set") We would like to still be able to support older setups if possible, as we might have customers running ESXi 6.0 or 6.5, and these seem to have an older version of the PF driver as default. Signed-off-by: Tudor Cornea --- drivers/net/ixgbe/ixgbe_ethdev.c | 5 +++-- drivers/net/ixgbe/ixgbe_rxtx.c | 1 - 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 4dbe049..4301dfd 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -6369,6 +6369,9 @@ ixgbevf_dev_set_mtu(struct rte_eth_dev *dev, uint16_t mtu) return -EINVAL; } + /* update max frame size */ + dev->data->dev_conf.rxmode.max_rx_pkt_len = max_frame; + /* * When supported by the underlying PF driver, use the IXGBE_VF_SET_MTU * request of the version 2.0 of the mailbox API. @@ -6381,8 +6384,6 @@ ixgbevf_dev_set_mtu(struct rte_eth_dev *dev, uint16_t mtu) if (ixgbevf_rlpml_set_vf(hw, max_frame)) return -EINVAL; - /* update max frame size */ - dev->data->dev_conf.rxmode.max_rx_pkt_len = max_frame; return 0; } diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index 0ac89cb..02d9809 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -5677,7 +5677,6 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev) (uint16_t)dev->data->dev_conf.rxmode.max_rx_pkt_len)) { PMD_INIT_LOG(ERR, "Set max packet length to %d failed.", dev->data->dev_conf.rxmode.max_rx_pkt_len); - return -EINVAL; } /* -- 2.7.4
Re: [dpdk-dev] [PATCH] net/ixgbe: initialize max_rx_pkt_len if rlpml_set_vf fails
Some of our customers use ESXi 6.0 or 6.5 servers, which could have older versions of the PF ixgbe driver. It seems that with a more recent version of the PMD driver, we are not able to initialize 82599EB ports correctly. This scenario seems to have worked with DPDK 19.11. Would it be possible to print a warning, while still allowing the driver to initialize the ports ? I was also thinking about the return code of ixgbevf_dev_set_mtu. Do you think it would be more appropriate to return ENOTSUP or ENOSYS instead of EINVAL ? As a user, calling 'rte_eth_dev_mtu_set', I would expect an error like EINVAL to suggest to me that the mtu value which I provided is incorrect [1]. The 82599 NIC, however has some particularities related to mtu, which could cause the operation to fail. I was thinking that EINVAL might not be most descriptive error. [1] https://doc.dpdk.org/api/rte__ethdev_8h.html Thanks, Tudor On Fri, 15 Oct 2021 at 17:06, Tudor Cornea wrote: > It seems that if the call to ixgbevf_rlpml_set_vf fails, > we will not initialize dev_conf.rxmode.max_rx_pkt_len correctly anymore. > > This happens with a 82599EB NIC and a VMware ESXI 6.0 setup, > and is causing VF the ports to fail to initialize > > We see the following error: > ixgbevf_dev_rx_init(): Set max packet length to 1518 failed. > > Investigating over DPDK 19.11, it seems that the call still fails, > but it doesn't exit prematurely, and max_rx_pkt_len is initialized in > the respective case. > > On the ESXi server, we seem to have the following driver > net-ixgbe: 3.7.13.7.14iov-20vmw.600.0.0.2494585 > > It seems that the behavior related to MTU setting has changed > since the following commit: > > commit c77866a16904 ("net/ixgbe: detect failed VF MTU set") > > We would like to still be able to support older setups if possible, > as we might have customers running ESXi 6.0 or 6.5, and these seem > to have an older version of the PF driver as default. > > Signed-off-by: Tudor Cornea > --- > drivers/net/ixgbe/ixgbe_ethdev.c | 5 +++-- > drivers/net/ixgbe/ixgbe_rxtx.c | 1 - > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c > b/drivers/net/ixgbe/ixgbe_ethdev.c > index 4dbe049..4301dfd 100644 > --- a/drivers/net/ixgbe/ixgbe_ethdev.c > +++ b/drivers/net/ixgbe/ixgbe_ethdev.c > @@ -6369,6 +6369,9 @@ ixgbevf_dev_set_mtu(struct rte_eth_dev *dev, > uint16_t mtu) > return -EINVAL; > } > > + /* update max frame size */ > + dev->data->dev_conf.rxmode.max_rx_pkt_len = max_frame; > + > /* > * When supported by the underlying PF driver, use the > IXGBE_VF_SET_MTU > * request of the version 2.0 of the mailbox API. > @@ -6381,8 +6384,6 @@ ixgbevf_dev_set_mtu(struct rte_eth_dev *dev, > uint16_t mtu) > if (ixgbevf_rlpml_set_vf(hw, max_frame)) > return -EINVAL; > > - /* update max frame size */ > - dev->data->dev_conf.rxmode.max_rx_pkt_len = max_frame; > return 0; > } > > diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c > b/drivers/net/ixgbe/ixgbe_rxtx.c > index 0ac89cb..02d9809 100644 > --- a/drivers/net/ixgbe/ixgbe_rxtx.c > +++ b/drivers/net/ixgbe/ixgbe_rxtx.c > @@ -5677,7 +5677,6 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev) > (uint16_t)dev->data->dev_conf.rxmode.max_rx_pkt_len)) { > PMD_INIT_LOG(ERR, "Set max packet length to %d failed.", > dev->data->dev_conf.rxmode.max_rx_pkt_len); > - return -EINVAL; > } > > /* > -- > 2.7.4 > >
Re: [dpdk-dev] [PATCH] net/ixgbe: initialize max_rx_pkt_len if rlpml_set_vf fails
Hi Ferruh, I have tested with the dpdk-next-net branch. It includes the commit 'ethdev: fix max Rx packet length' Setup: Hypervisor: VMware ESXi 6.0.0 PF ixgbe Driver: 3.7.13.7 (default PF driver installed with ESXi 6.0 and 6.5) NIC: Intel 82599 Guest OS : Ubuntu 20.04 It seems that with the recent changes testpmd still can't initialize the ports. EAL: Detected CPU lcores: 8 EAL: Detected NUMA nodes: 1 EAL: Detected static linkage of DPDK EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Selected IOVA mode 'PA' EAL: No available 1048576 kB hugepages reported EAL: VFIO support initialized EAL: Probe PCI driver: net_ixgbe_vf (8086:10ed) device: :0b:00.0 (socket 0) EAL: Probe PCI driver: net_ixgbe_vf (8086:10ed) device: :13:00.0 (socket 0) TELEMETRY: No legacy callbacks, legacy socket not created Interactive-mode selected previous number of forwarding ports 2 - changed to number of configured ports 1 Error picking flow transfer proxy for port 0: Function not implemented - ignore Error picking flow transfer proxy for port 1: Function not implemented - ignore testpmd: create a new mbuf pool : n=171456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc Warning! port-topology=paired and odd forward ports number, the last port will pair with itself. Configuring Port 0 (socket 0) ixgbevf_dev_rx_init(): Set max packet length to 1518 failed. ixgbevf_dev_start(): Unable to initialize RX hardware (-22) Fail to start port 0: Invalid argument Configuring Port 1 (socket 0) ixgbevf_dev_rx_init(): Set max packet length to 1518 failed. ixgbevf_dev_start(): Unable to initialize RX hardware (-22) Fail to start port 1: Invalid argument Please stop the ports first Done Error during enabling promiscuous mode for port 0: Operation not supported - ignore Error during enabling promiscuous mode for port 1: Operation not supported - ignore testpmd> start tx_first Not all ports were started I think failing to set set 'max_rx_pkt_len' after ixgbevf_rlpml_set_vf() call failed in ixgbevf_dev_set_mtu(), might have been one half of the problem. The other one is in ixgbevf_dev_rx_init(). The init function seems to return prematurely, and not initialize the ports. With the below patch (on top of net-next branch), I seem to be able to initialize the ports correctly, and send packets using testpmd. diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index b263dfe1d5..fdd057c789 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -5676,7 +5676,6 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev) if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0) { PMD_INIT_LOG(ERR, "Set max packet length to %d failed.", frame_size); - return -EINVAL; } /* Alvin, would it be acceptable to not return -EINVAL in this case while still printing an error, so that we can still debug mtu issues on 82599 ? If I recall correctly, the NIC has issues when the MTU of VFs is bigger than the MTU of the PFs. On my setup, however the MTUs have default values (1500). Should I rebase the patch on top of net-next ? Thanks, Tudor On Wed, 20 Oct 2021 at 06:08, Zhang, AlvinX wrote: > > -Original Message- > > From: Yigit, Ferruh > > Sent: Tuesday, October 19, 2021 8:58 PM > > To: Tudor Cornea ; Zhang, Qi Z > > > > Cc: Zhang, AlvinX ; Wang, Haiyue > > ; dev@dpdk.org > > Subject: Re: [dpdk-dev] [PATCH] net/ixgbe: initialize max_rx_pkt_len if > > rlpml_set_vf fails > > > > On 10/15/2021 3:20 PM, Tudor Cornea wrote: > > > Some of our customers use ESXi 6.0 or 6.5 servers, which could have > > > older versions of the PF ixgbe driver. > > > It seems that with a more recent version of the PMD driver, we are not > > > able to initialize 82599EB ports correctly. > > > This scenario seems to have worked with DPDK 19.11. > > > > > > Would it be possible to print a warning, while still allowing the > > > driver to initialize the ports ? > > There is a scenario that we initialize the port successful, with no any > error, but we cannot get any packets. > So keep printing error or warning and still allowing the driver to > initialize the port may be the best solution. > > > > > > > I was also thinking about the return code of ixgbevf_dev_set_mtu. > > > Do you think it would be more appropriate to return ENOTSUP or ENOSYS > > > instead of EINVAL ? > > If ixgbevf_dev_set_mtu fails, we have no way to know it because of invalid > value or not supported? > ixgbevf_dev_set_mtu returns one the these values: 0 or IXGBE_ERR_MBX > > > > > > > As a user, calling 'rte_eth_dev_mtu_set', I would expect an error like > > > EINVAL to
[dpdk-dev] [PATCH v2] net/ixgbe: initialize port even if mtu config fails
On a VMware ESXi 6.0 setup with an Intel 82599 NIC the ports don't seem to initialize anymore, while running testpmd. Configuring Port 0 (socket 0) ixgbevf_dev_rx_init(): Set max packet length to 1518 failed. ixgbevf_dev_start(): Unable to initialize RX hardware (-22) Fail to start port 0: Invalid argument Configuring Port 1 (socket 0) ixgbevf_dev_rx_init(): Set max packet length to 1518 failed. ixgbevf_dev_start(): Unable to initialize RX hardware (-22) Fail to start port 1: Invalid argument Please stop the ports first If the call to ixgbevf_rlpml_set_vf fails and we return prematurely, we will not be able to initialize the ports correctly. The behavior seems to have changed since the following commit: commit c77866a16904 ("net/ixgbe: detect failed VF MTU set") We can make this particular use case work correctly if we don't return an error, which seems to be consistent with the overall kernel ixgbevf implementation. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c#n2015 Signed-off-by: Tudor Cornea --- v2: * Change title * Remove max_rx_pkt_len fix in ixgbe_ethdev.c It's already fixed as part of Ferruh's changes in next-net branch, so this part should be redundant, now --- drivers/net/ixgbe/ixgbe_rxtx.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index b263dfe..a51450f 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -5673,11 +5673,9 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev) * ixgbevf_rlpml_set_vf even if jumbo frames are not used. This way, * VF packets received can work in all cases. */ - if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0) { + if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0) PMD_INIT_LOG(ERR, "Set max packet length to %d failed.", frame_size); - return -EINVAL; - } /* * Assume no header split and no VLAN strip support -- 2.7.4
Re: [dpdk-dev] [PATCH v2] net/ixgbe: initialize port even if mtu config fails
On Thu, 21 Oct 2021 at 18:33, Ferruh Yigit wrote: > On 10/20/2021 7:13 PM, Tudor Cornea wrote: > > On a VMware ESXi 6.0 setup with an Intel 82599 NIC the ports don't > > seem to initialize anymore, while running testpmd. > > > > Configuring Port 0 (socket 0) > > ixgbevf_dev_rx_init(): Set max packet length to 1518 failed. > > ixgbevf_dev_start(): Unable to initialize RX hardware (-22) > > Fail to start port 0: Invalid argument > > Configuring Port 1 (socket 0) > > ixgbevf_dev_rx_init(): Set max packet length to 1518 failed. > > ixgbevf_dev_start(): Unable to initialize RX hardware (-22) > > Fail to start port 1: Invalid argument > > Please stop the ports first > > > > If the call to ixgbevf_rlpml_set_vf fails and we return prematurely, > > we will not be able to initialize the ports correctly. > > > > The behavior seems to have changed since the following commit: > > > > commit c77866a16904 ("net/ixgbe: detect failed VF MTU set") > > > > Hi Tudor, > > We document this with explicit 'Fixes' tag, this also helps up to > manage backporting patches to LTS releases, so updating as: > > Fixes: 3a6bfc37eaf4 ("net/ice: support QoS config VF bandwidth in > DCF") > Cc: sta...@dpdk.org > > Also we use 'fix' verb in the patch title almost as keyword, again > to help deciding which patch to backport, also to clarify impact of > the patch, so will update patch title as: > > net/ixgbe: fix port initialization if MTU config fails > > > For more details please check contribution guide: > https://doc.dpdk.org/guides/contributing/patches.html > > Thanks, > ferruh > > > We can make this particular use case work correctly if we don't > > return an error, which seems to be consistent with the overall > > kernel ixgbevf implementation. > > > > [1] > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c#n2015 > > > > The code that this link references can change by time as code changes, > I will update it as following to bind it a specific version (v5.14): > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c?h=v5.14#n2015 > > > Signed-off-by: Tudor Cornea > > > > --- > > v2: > > * Change title > > * Remove max_rx_pkt_len fix in ixgbe_ethdev.c > >It's already fixed as part of Ferruh's changes in next-net branch, > >so this part should be redundant, now > > --- > > drivers/net/ixgbe/ixgbe_rxtx.c | 4 +--- > > 1 file changed, 1 insertion(+), 3 deletions(-) > > > > diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c > b/drivers/net/ixgbe/ixgbe_rxtx.c > > index b263dfe..a51450f 100644 > > --- a/drivers/net/ixgbe/ixgbe_rxtx.c > > +++ b/drivers/net/ixgbe/ixgbe_rxtx.c > > @@ -5673,11 +5673,9 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev) > >* ixgbevf_rlpml_set_vf even if jumbo frames are not used. This > way, > >* VF packets received can work in all cases. > >*/ > > - if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0) { > > + if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0) > > PMD_INIT_LOG(ERR, "Set max packet length to %d failed.", > >frame_size); > > - return -EINVAL; > > - } > > > > /* > >* Assume no header split and no VLAN strip support > > > > Hi Ferruh, Thanks a lot for the good observations and for pointing me in the right direction ! Tudor
[dpdk-dev] [PATCH] kni: allow configuring the kni thread granularity
The Kni kthreads seem to be re-scheduled at a granularity of roughly 1 milisecond right now, which seems to be insufficient for performing tests involving a lot of control plane traffic. Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it seems that the existing code cannot reschedule at the desired granularily, due to precision constraints of schedule_timeout_interruptible(). In our use case, we leverage the Linux Kernel for control plane, and it is not uncommon to have 60K - 100K pps for some signaling protocols. Since we are in non-atomic context, the usleep_range() function seems to be more appropriate for being able to introduce smaller controlled delays, in the range of 5-10 microseconds. Upon reading the existing code, it would seem that this was the original intent. Adding sub-milisecond delays, seems unfeasible with a call to schedule_timeout_interruptible(). KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */ schedule_timeout_interruptible( usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL)); Below, we attempted a brief comparison between the existing implementation, which uses schedule_timeout_interruptible() and usleep_range(). insmod rte_kni.ko kthread_mode=single carrier=on schedule_timeout_interruptible(usecs_to_jiffies(5)) kni_single CPU Usage: 2-4 % [root@localhost ~]# ping 1.1.1.2 -I eth1 PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data. 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms usleep_range(5, 10) kni_single CPU usage: 50% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms usleep_range(20, 50) kni_single CPU usage: 24% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms usleep_range(50, 100) kni_single CPU usage: 13% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms usleep_range(100, 200) kni_single CPU usage: 7% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms usleep_range(1000, 1100) kni_single CPU usage: 2% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms Upon testing, usleep_range(1000, 1100) seems roughly equivalent in latency and cpu usage to the variant with schedule_timeout_interruptible(), while usleep_range(100, 200) seems to give a decent tradeoff between latency and cpu usage, while allowing users to tweak the limits for improved precision if they have such use cases. Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a softlockup on my kernel. Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 1226 Comm: kni_single Tainted: GW O 3.10 #1 [] dump_stack+0x19/0x1b [] panic+0xcd/0x1e0 [] watchdog_timer_fn+0x160/0x160 [] __run_hrtimer.isra.4+0x42/0xd0 [] hrtimer_interrupt+0xe7/0x1f0 [] smp_apic_timer_interrupt+0x67/0xa0 [] apic_timer_interrupt+0x6d/0x80 References: [1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt Signed-off-by: Tudor Cornea --- doc/guides/prog_guide/kernel_nic_interface.rst | 33 +++ kernel/linux/kni/kni_dev.h | 2 +- kernel/linux/kni/kni_misc.c| 36 +++--- 3 files changed, 66 insertions(+), 5 deletions(-) diff --git a/doc/guides/prog_guide/kernel_nic_interface.rst b/doc/guides/prog_guide/kernel_nic_interface.rst index 1ce03ec..5d6d535 100644 --- a/doc/guides/prog_guide/kernel_nic_interface.rst +++ b/doc/guides/prog_guide/kernel_nic_interface.rst @@ -56,6 +56,10 @@ can be specified when the module is loaded to control its behavior: off Interfaces will be created with carrier state set to off. onInterfaces will be created with carrier state set to on. (charp) +parm
Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx
On Tue, 26 Oct 2021 at 17:41, Ferruh Yigit wrote: > Hi Tudor, > > I have used testpmd, 'txonly' forwarding. Tx recovers after interface up, > but by adding some debug logs I can see 'poll()' returns with POLLOUT even > there is no space in the buffer. > > According the logic in the PMD, when 'poll()' returns success, it expects > to have some space in the Tx buffer. > > So I agree to add the check. > > Only have a question on the POLLERR, should we separate the POLLERR check > to cover ifdown case, what do you think about following logic: > > if (!TP_STATUS_AVAILABLE) { > if (poll() < 0) > break; > if (pfd.revents & POLLERR) > break; > } > > if (!TP_STATUS_AVAILABLE) > break; > > Hi Ferruh, Thanks for the suggestion. I was thinking of adding this check, and intuitively, it seems correct. I tried to do some further testing. I tested with the POLLERR check, and I don't see the issue anymore. However, if I remove the second if (!TP_STATUS_AVAILABLE), I seem to get bursts of 2048 packets followed by periods of not sending anything when toggling the interface link state. Without the second if(!TP_STATUS_AVAILABLE) statement, the issue seems to reproduce, regardless if I add the check for POLLERR or not. RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=606208 TxP=2048 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 I will send an updated version of the patch. Thanks, Tudor
[dpdk-dev] [PATCH v3] net/af_packet: fix ignoring full ring on tx
The poll call can return POLLERR which is ignored, or it can return POLLOUT, even if there are no free frames in the mmap-ed area. We can account for both of these cases by re-checking if the next frame is empty before writing into it. We have attempted to reproduce this issue with pktgen-dpdk, using the following configuration. pktgen -l 1-4 -n 4 --proc-type=primary --no-pci --no-telemetry \ --no-huge -m 512 \ --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192, \ framecnt=2048,qpairs=1,qdisc_bypass=0 \ -- \ -P \ -T \ -m "3.0" \ -f themes/black-yellow.theme We configure a low tx rate (~ 335 packets / second) and a small packet size, of about 300 Bytes from the pktgen CLI. set 0 size 300 set 0 rate 0.008 set 0 burst 1 start 0 After bringing the interface down, and up again, we seem to arrive in a state in which the tx rate is inconsistent, and does not recover. ifconfig eth1 down; sleep 7; ifconfig eth1 up [1] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md Signed-off-by: Mihai Pogonaru Signed-off-by: Tudor Cornea --- v2: * Added check for POLLERR * Used tx_ring_status_available() for checking TP_STATUS_AVAILABLE --- drivers/net/af_packet/rte_eth_af_packet.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 559f5a0..d3d3104 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -237,8 +237,30 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status) && - poll(&pfd, 1, -1) < 0) + if (!tx_ring_status_available(ppd->tp_status)) { + if (poll(&pfd, 1, -1) < 0) + break; + if (pfd.revents & POLLERR) + break; + } + + /* +* Poll can return POLLERR if the interface is down +* +* It will almost always return POLLOUT, even if there +* are no extra buffers available +* +* This happens, because packet_poll() calls datagram_poll() +* which checks the space left in the socket buffer and, +* in the case of packet_mmap, the default socket buffer length +* doesn't match the requested size for the tx_ring. +* As such, there is almost always space left in socket buffer, +* which doesn't seem to be correlated to the requested size +* for the tx_ring in packet_mmap. +* +* This results in poll() returning POLLOUT. +*/ + if (!tx_ring_status_available(ppd->tp_status)) break; /* copy the tx frame data */ -- 2.7.4
[dpdk-dev] [PATCH v2] kni: allow configuring the kni thread granularity
The Kni kthreads seem to be re-scheduled at a granularity of roughly 1 millisecond right now, which seems to be insufficient for performing tests involving a lot of control plane traffic. Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it seems that the existing code cannot reschedule at the desired granularily, due to precision constraints of schedule_timeout_interruptible(). In our use case, we leverage the Linux Kernel for control plane, and it is not uncommon to have 60K - 100K pps for some signaling protocols. Since we are in non-atomic context, the usleep_range() function seems to be more appropriate for being able to introduce smaller controlled delays, in the range of 5-10 microseconds. Upon reading the existing code, it would seem that this was the original intent. Adding sub-millisecond delays, seems unfeasible with a call to schedule_timeout_interruptible(). KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */ schedule_timeout_interruptible( usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL)); Below, we attempted a brief comparison between the existing implementation, which uses schedule_timeout_interruptible() and usleep_range(). insmod rte_kni.ko kthread_mode=single carrier=on schedule_timeout_interruptible(usecs_to_jiffies(5)) kni_single CPU Usage: 2-4 % [root@localhost ~]# ping 1.1.1.2 -I eth1 PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data. 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms usleep_range(5, 10) kni_single CPU usage: 50% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms usleep_range(20, 50) kni_single CPU usage: 24% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms usleep_range(50, 100) kni_single CPU usage: 13% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms usleep_range(100, 200) kni_single CPU usage: 7% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms usleep_range(1000, 1100) kni_single CPU usage: 2% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms Upon testing, usleep_range(1000, 1100) seems roughly equivalent in latency and cpu usage to the variant with schedule_timeout_interruptible(), while usleep_range(100, 200) seems to give a decent tradeoff between latency and cpu usage, while allowing users to tweak the limits for improved precision if they have such use cases. Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a softlockup on my kernel. Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 1226 Comm: kni_single Tainted: GW O 3.10 #1 [] dump_stack+0x19/0x1b [] panic+0xcd/0x1e0 [] watchdog_timer_fn+0x160/0x160 [] __run_hrtimer.isra.4+0x42/0xd0 [] hrtimer_interrupt+0xe7/0x1f0 [] smp_apic_timer_interrupt+0x67/0xa0 [] apic_timer_interrupt+0x6d/0x80 References: [1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt Signed-off-by: Tudor Cornea --- v2: * Fixed some spelling errors --- doc/guides/prog_guide/kernel_nic_interface.rst | 33 +++ kernel/linux/kni/kni_dev.h | 2 +- kernel/linux/kni/kni_misc.c| 36 +++--- 3 files changed, 66 insertions(+), 5 deletions(-) diff --git a/doc/guides/prog_guide/kernel_nic_interface.rst b/doc/guides/prog_guide/kernel_nic_interface.rst index 1ce03ec..2dd3481 100644 --- a/doc/guides/prog_guide/kernel_nic_interface.rst +++ b/doc/guides/prog_guide/kernel_nic_interface.rst @@ -56,6 +56,10 @@ can be specified when the module is loaded to control its behavior: off Interfaces will be created with carrier state set to off. onInterfaces will be created with carrier state set to on
[dpdk-dev] [PATCH v4] net/af_packet: fix ignoring full ring on tx
The poll call can return POLLERR which is ignored, or it can return POLLOUT, even if there are no free frames in the mmap-ed area. We can account for both of these cases by re-checking if the next frame is empty before writing into it. We have attempted to reproduce this issue with pktgen-dpdk, using the following configuration. pktgen -l 1-4 -n 4 --proc-type=primary --no-pci --no-telemetry \ --no-huge -m 512 \ --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192, \ framecnt=2048,qpairs=1,qdisc_bypass=0 \ -- \ -P \ -T \ -m "3.0" \ -f themes/black-yellow.theme We configure a low tx rate (~ 335 packets / second) and a small packet size, of about 300 bytes from the pktgen CLI. set 0 size 300 set 0 rate 0.008 set 0 burst 1 start 0 After bringing the interface down, and up again, we seem to arrive in a state in which the tx rate is inconsistent, and does not recover. ifconfig eth1 down; sleep 7; ifconfig eth1 up [1] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md Signed-off-by: Mihai Pogonaru Signed-off-by: Tudor Cornea Reviewed-by: Ferruh Yigit --- v4: * Updated comment regarding POLLERR Removed some extra details, and left notice about poll() returning POLLERR, in case of interface link down. v3: * Added check for POLLERR * Used tx_ring_status_available() for checking TP_STATUS_AVAILABLE --- drivers/net/af_packet/rte_eth_af_packet.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 559f5a0..dd6d165 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -237,8 +237,16 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status) && - poll(&pfd, 1, -1) < 0) + if (!tx_ring_status_available(ppd->tp_status)) { + if (poll(&pfd, 1, -1) < 0) + break; + + /* poll() can return POLLERR if the interface is down */ + if (pfd.revents & POLLERR) + break; + } + + if (!tx_ring_status_available(ppd->tp_status)) break; /* copy the tx frame data */ -- 2.7.4
Re: [dpdk-dev] [PATCH v2] kni: allow configuring the kni thread granularity
On Tue, 2 Nov 2021 at 17:53, Stephen Hemminger wrote: > On Tue, 2 Nov 2021 17:51:13 +0200 > Tudor Cornea wrote: > > > +#ifdef RTE_KNI_PREEMPT_DEFAULT > > +module_param(min_scheduling_interval, long, 0644); > > +MODULE_PARM_DESC(min_scheduling_interval, > > +"Kni thread min scheduling interval (default=100 microseconds):\n" > > +"\t\t" > > +); > > Why the non-standard newline's and tab's? > Please try to make KNI look like other kernel code. > Hi Stephen, I tried to base the description of the new parameters on an existing parameter implemented for the rte_kni module - carrier. module_param(carrier, charp, 0644); MODULE_PARM_DESC(carrier, "Default carrier state for KNI interface (default=off):\n" "\t\toff Interfaces will be created with carrier state set to off.\n" "\t\tonInterfaces will be created with carrier state set to on.\n" "\t\t" ); I thought about keeping the compatibility in terms of coding style with the existing Kni module parameters. Upon browsing the Linux tree, I realise it might not be standard ( checkpatch.pl , interestingly didn't seem to complain about the patch) I also realise now, that I missed two tabs at the beginning of the params description. Should I add the missing tabs, so that the new parameters that I intend to add through this patch are similar in style to the existing ones, or should I remove the newlines and tabs altogether, when specifying the description for min_scheduling_interval and max_scheduling_interval ? Thanks, Tudor
[dpdk-dev] [PATCH v3] kni: allow configuring the kni thread granularity
The Kni kthreads seem to be re-scheduled at a granularity of roughly 1 millisecond right now, which seems to be insufficient for performing tests involving a lot of control plane traffic. Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it seems that the existing code cannot reschedule at the desired granularily, due to precision constraints of schedule_timeout_interruptible(). In our use case, we leverage the Linux Kernel for control plane, and it is not uncommon to have 60K - 100K pps for some signaling protocols. Since we are not in atomic context, the usleep_range() function seems to be more appropriate for being able to introduce smaller controlled delays, in the range of 5-10 microseconds. Upon reading the existing code, it would seem that this was the original intent. Adding sub-millisecond delays, seems unfeasible with a call to schedule_timeout_interruptible(). KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */ schedule_timeout_interruptible( usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL)); Below, we attempted a brief comparison between the existing implementation, which uses schedule_timeout_interruptible() and usleep_range(). We attempt to measure the CPU usage, and RTT between two Kni interfaces, which are created on top of vmxnet3 adapters, connected by a vSwitch. insmod rte_kni.ko kthread_mode=single carrier=on schedule_timeout_interruptible(usecs_to_jiffies(5)) kni_single CPU Usage: 2-4 % [root@localhost ~]# ping 1.1.1.2 -I eth1 PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data. 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms usleep_range(5, 10) kni_single CPU usage: 50% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms usleep_range(20, 50) kni_single CPU usage: 24% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms usleep_range(50, 100) kni_single CPU usage: 13% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms usleep_range(100, 200) kni_single CPU usage: 7% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms usleep_range(1000, 1100) kni_single CPU usage: 2% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms Upon testing, usleep_range(1000, 1100) seems roughly equivalent in latency and cpu usage to the variant with schedule_timeout_interruptible(), while usleep_range(100, 200) seems to give a decent tradeoff between latency and cpu usage, while allowing users to tweak the limits for improved precision if they have such use cases. Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a softlockup on my kernel. Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 1226 Comm: kni_single Tainted: GW O 3.10 #1 [] dump_stack+0x19/0x1b [] panic+0xcd/0x1e0 [] watchdog_timer_fn+0x160/0x160 [] __run_hrtimer.isra.4+0x42/0xd0 [] hrtimer_interrupt+0xe7/0x1f0 [] smp_apic_timer_interrupt+0x67/0xa0 [] apic_timer_interrupt+0x6d/0x80 References: [1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt Signed-off-by: Tudor Cornea --- v3: * Fixed unwrapped commit description warning * Changed from hrtimers to Linux High Precision Timers in docs * Added two tabs at the beginning of the new params description. Stephen correctly pointed out that the descriptions of the parameters for the Kni module are nonstandard w.r.t existing kernel code. I was thinking to preserve compatibility with the existing parameters of the Kni module for the moment, while an additional clean-up patch could format the descriptions to be closer to the kernel standard. v2: * Fixed some spelling errors --- doc/guides/prog_guide/kernel_nic_interface.rst | 33 +++ kernel/linux/kni/kni_dev.h
[dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx
The poll call can return POLLERR which is ignored, or it can return POLLOUT, even if there are no free frames in the mmap-ed area. We can account for both of these cases by re-checking if the next frame is empty before writing into it. Signed-off-by: Mihai Pogonaru Signed-off-by: Tudor Cornea --- drivers/net/af_packet/rte_eth_af_packet.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..087c196 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) (poll(&pfd, 1, -1) < 0)) break; + /* +* Poll can return POLLERR if the interface is down +* +* It will almost always return POLLOUT, even if there +* are no extra buffers available +* +* This happens, because packet_poll() calls datagram_poll() +* which checks the space left in the socket buffer and, +* in the case of packet_mmap, the default socket buffer length +* doesn't match the requested size for the tx_ring. +* As such, there is almost always space left in socket buffer, +* which doesn't seem to be correlated to the requested size +* for the tx_ring in packet_mmap. +* +* This results in poll() returning POLLOUT. +*/ + if (ppd->tp_status != TP_STATUS_AVAILABLE) + break; + /* copy the tx frame data */ pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - sizeof(struct sockaddr_ll); -- 2.7.4
[dpdk-dev] [PATCH] net/af_packet: remove timestamp from packet status
We should eliminate the timestamp status from the packet status. This should only matter if timestamping is enabled on the socket, but we might hit a kernel bug, which is fixed in newer releases. For interfaces of type 'veth', the sent skb is forwarded to the peer and back into the network stack which timestamps it on the RX path if timestamping is enabled globally (which happens if any socket enables timestamping). When the skb is destructed, tpacket_destruct_skb() is called and it calls __packet_set_timestamp() which doesn't check the flags on the socket and returns the timestamp if it is set in the skb (and for veth it is, as mentioned above). See the following kernel commit for reference [1]: net: packetmmap: fix only tx timestamp on request The packetmmap tx ring should only return timestamps if requested via setsockopt PACKET_TIMESTAMP, as documented. This allows compatibility with non-timestamp aware user-space code which checks tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp flags to be set in tp_status. [1] https://www.spinics.net/lists/kernel/msg3959391.html Signed-off-by: Mihai Pogonaru Signed-off-by: Tudor Cornea --- drivers/net/af_packet/rte_eth_af_packet.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..a6638a2 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -167,6 +168,23 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) return num_rx; } +static inline bool tx_ring_status_unavailable(uint32_t tp_status) +{ +#if KERNEL_VERSION(5, 10, 0) > LINUX_VERSION_CODE + /* +* We eliminate the timestamp status from the packet status. +* This should only matter if timestamping is enabled on the socket, +* but there is a bug in the kernel which is fixed in newer releases. +* +* See the following kernel commit for reference: +* commit 171c3b151118a2fe0fc1e2a9d1b5a1570cfe82d2 +* net: packetmmap: fix only tx timestamp on request +*/ + tp_status &= ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); +#endif + return tp_status != TP_STATUS_AVAILABLE; +} + /* * Callback to handle sending packets through a real NIC. */ @@ -212,8 +230,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* point at the next incoming frame */ - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && - (poll(&pfd, 1, -1) < 0)) + if (tx_ring_status_unavailable(ppd->tp_status) && + poll(&pfd, 1, -1) < 0) break; /* copy the tx frame data */ -- 2.7.4
[dpdk-dev] [PATCH v2] net/af_packet: remove timestamp from packet status
We should eliminate the timestamp status from the packet status. This should only matter if timestamping is enabled on the socket, but we might hit a kernel bug, which is fixed in newer releases. For interfaces of type 'veth', the sent skb is forwarded to the peer and back into the network stack which timestamps it on the RX path if timestamping is enabled globally (which happens if any socket enables timestamping). When the skb is destructed, tpacket_destruct_skb() is called and it calls __packet_set_timestamp() which doesn't check the flags on the socket and returns the timestamp if it is set in the skb (and for veth it is, as mentioned above). See the following kernel commit for reference [1]: net: packetmmap: fix only tx timestamp on request The packetmmap tx ring should only return timestamps if requested via setsockopt PACKET_TIMESTAMP, as documented. This allows compatibility with non-timestamp aware user-space code which checks tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp flags to be set in tp_status. [1] https://www.spinics.net/lists/kernel/msg3959391.html Signed-off-by: Mihai Pogonaru Signed-off-by: Tudor Cornea --- v2: * Remove compile-time check for kernel version --- drivers/net/af_packet/rte_eth_af_packet.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..7ecea4e 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -167,6 +167,22 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) return num_rx; } +static inline bool tx_ring_status_unavailable(uint32_t tp_status) +{ + /* +* We eliminate the timestamp status from the packet status. +* This should only matter if timestamping is enabled on the socket, +* but there is a bug in the kernel which is fixed in newer releases. +* +* See the following kernel commit for reference: +* commit 171c3b151118a2fe0fc1e2a9d1b5a1570cfe82d2 +* net: packetmmap: fix only tx timestamp on request +*/ + tp_status &= ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); + + return tp_status != TP_STATUS_AVAILABLE; +} + /* * Callback to handle sending packets through a real NIC. */ @@ -212,8 +228,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* point at the next incoming frame */ - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && - (poll(&pfd, 1, -1) < 0)) + if (tx_ring_status_unavailable(ppd->tp_status) && + poll(&pfd, 1, -1) < 0) break; /* copy the tx frame data */ -- 2.7.4
Re: [dpdk-dev] [PATCH] net/af_packet: remove timestamp from packet status
Thanks for the observation. I have removed the compile-time kernel version check in v2 of the patch On Mon, 13 Sept 2021 at 18:09, Stephen Hemminger wrote: > On Mon, 13 Sep 2021 17:09:11 +0300 > Tudor Cornea wrote: > > > +static inline bool tx_ring_status_unavailable(uint32_t tp_status) > > +{ > > +#if KERNEL_VERSION(5, 10, 0) > LINUX_VERSION_CODE > > No, having kernel dependent userspace in DPDK is not good practice. > > Distribution vendors don't number their kernels the same as upstream. > RHEL for example, keeps same version over life or release but backports > many fixes. > > Also, the system DPDK runs on is often not the system DPDK is built > on. >
Re: [dpdk-dev] [PATCH v2] net/af_packet: reinsert the stripped vlan tag
Thanks, Ferruh I will perform the suggested recommendations in version 3 of the patch. On Mon, 20 Sept 2021 at 18:41, Ferruh Yigit wrote: > On 9/8/2021 9:59 AM, Tudor Cornea wrote: > > The af_packet pmd driver binds to a raw socket and allows > > sending and receiving of packets through the kernel. > > > > Since commit [1], the kernel strips the vlan tags early in > > __netif_receive_skb_core(), so we receive untagged packets while > > running with the af_packet pmd. > > > > Luckily for us, the skb vlan-related fields are still populated from the > > stripped vlan tags, so we end up having all the information > > that we need in the mbuf. > > > > Having the PMD driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the > > application to control the desired vlan stripping behavior. > > > > [1] > https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2 > > > > Signed-off-by: Tudor Cornea > > > > Hi Tudor, > > The concern was unexpected performance degradation (user not setting any > offload > will have performance drop). But since your measurements show no > significant > drop, I think it is fair to make driver behave same as other drivers. > (Until we have a way to describe offloads that can't be disabled by PMDs.) > > Can you do a few minor updates: > - Put your performance measurements into to the commit log to record them > - Update the af_packet documentation (doc/guides/nics/af_packet.rst) to > document > PMD behavior with packets with VLAN tag > - Update release note (doc/guides/rel_notes/release_21_11.rst) with a > one/two > sentences to document the change, to notify possible users of the > af_packet with > the change. > > Thanks, > ferruh > > > --- > > v2: > > * Add DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads > > --- > > drivers/net/af_packet/rte_eth_af_packet.c | 12 > > 1 file changed, 12 insertions(+) > > > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c > b/drivers/net/af_packet/rte_eth_af_packet.c > > index b73b211..5ed9dd6 100644 > > --- a/drivers/net/af_packet/rte_eth_af_packet.c > > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > > @@ -48,6 +48,7 @@ struct pkt_rx_queue { > > > > struct rte_mempool *mb_pool; > > uint16_t in_port; > > + uint8_t vlan_strip; > > > > volatile unsigned long rx_pkts; > > volatile unsigned long rx_bytes; > > @@ -78,6 +79,7 @@ struct pmd_internals { > > > > struct pkt_rx_queue *rx_queue; > > struct pkt_tx_queue *tx_queue; > > + uint8_t vlan_strip; > > }; > > > > static const char *valid_arguments[] = { > > @@ -148,6 +150,9 @@ eth_af_packet_rx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > if (ppd->tp_status & TP_STATUS_VLAN_VALID) { > > mbuf->vlan_tci = ppd->tp_vlan_tci; > > mbuf->ol_flags |= (PKT_RX_VLAN | > PKT_RX_VLAN_STRIPPED); > > + > > + if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf)) > > + PMD_LOG(ERR, "Failed to reinsert VLAN > tag"); > > } > > > > /* release incoming frame and advance ring buffer */ > > @@ -302,6 +307,11 @@ eth_dev_stop(struct rte_eth_dev *dev) > > static int > > eth_dev_configure(struct rte_eth_dev *dev __rte_unused) > > { > > + struct rte_eth_conf *dev_conf = &dev->data->dev_conf; > > + const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; > > + struct pmd_internals *internals = dev->data->dev_private; > > + > > + internals->vlan_strip = !!(rxmode->offloads & > DEV_RX_OFFLOAD_VLAN_STRIP); > > return 0; > > } > > > > @@ -318,6 +328,7 @@ eth_dev_info(struct rte_eth_dev *dev, struct > rte_eth_dev_info *dev_info) > > dev_info->min_rx_bufsize = 0; > > dev_info->tx_offload_capa = DEV_TX_OFFLOAD_MULTI_SEGS | > > DEV_TX_OFFLOAD_VLAN_INSERT; > > + dev_info->rx_offload_capa = DEV_RX_OFFLOAD_VLAN_STRIP; > > > > return 0; > > } > > @@ -448,6 +459,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, > > > > dev->data->rx_queues[rx_queue_id] = pkt_q; > > pkt_q->in_port = dev->data->port_id; > > + pkt_q->vlan_strip = internals->vlan_strip; > > > > return 0; > > } > > > >
Re: [dpdk-dev] [PATCH v2] net/af_packet: remove timestamp from packet status
Thanks for the suggestion. I will send a new version of the patch with the required changes. Tudor On Mon, 20 Sept 2021 at 20:49, Ferruh Yigit wrote: > On 9/13/2021 6:23 PM, Tudor Cornea wrote: > > We should eliminate the timestamp status from the packet > > status. This should only matter if timestamping is enabled > > on the socket, but we might hit a kernel bug, which is fixed > > in newer releases. > > > > For interfaces of type 'veth', the sent skb is forwarded > > to the peer and back into the network stack which timestamps > > it on the RX path if timestamping is enabled globally > > (which happens if any socket enables timestamping). > > > > When the skb is destructed, tpacket_destruct_skb() is called > > and it calls __packet_set_timestamp() which doesn't check > > the flags on the socket and returns the timestamp if it is > > set in the skb (and for veth it is, as mentioned above). > > > > See the following kernel commit for reference [1]: > > > > net: packetmmap: fix only tx timestamp on request > > > > The packetmmap tx ring should only return timestamps if requested > > via setsockopt PACKET_TIMESTAMP, as documented. This allows > > compatibility with non-timestamp aware user-space code which checks > > tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp > > flags to be set in tp_status. > > > > [1] https://www.spinics.net/lists/kernel/msg3959391.html > > > > Signed-off-by: Mihai Pogonaru > > Signed-off-by: Tudor Cornea > > > > --- > > v2: > > * Remove compile-time check for kernel version > > OK, Stephen's comment makes sense. > > > --- > > drivers/net/af_packet/rte_eth_af_packet.c | 20 ++-- > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c > b/drivers/net/af_packet/rte_eth_af_packet.c > > index b73b211..7ecea4e 100644 > > --- a/drivers/net/af_packet/rte_eth_af_packet.c > > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > > @@ -167,6 +167,22 @@ eth_af_packet_rx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > return num_rx; > > } > > > > +static inline bool tx_ring_status_unavailable(uint32_t tp_status) > > +{ > > Minor syntax comment, can you have the 'static inline bool' part in > separate > line. And a basic function comment can be good. > > Thanks, > ferruh > > > + /* > > + * We eliminate the timestamp status from the packet status. > > + * This should only matter if timestamping is enabled on the > socket, > > + * but there is a bug in the kernel which is fixed in newer > releases. > > + * > > + * See the following kernel commit for reference: > > + * commit 171c3b151118a2fe0fc1e2a9d1b5a1570cfe82d2 > > + * net: packetmmap: fix only tx timestamp on request > > + */ > > + tp_status &= ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); > > + > > + return tp_status != TP_STATUS_AVAILABLE; > > +} > > + > > /* > > * Callback to handle sending packets through a real NIC. > > */ > > @@ -212,8 +228,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > } > > > > /* point at the next incoming frame */ > > - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && > > - (poll(&pfd, 1, -1) < 0)) > > + if (tx_ring_status_unavailable(ppd->tp_status) && > > + poll(&pfd, 1, -1) < 0) > > break; > > > > /* copy the tx frame data */ > > > >
[dpdk-dev] [PATCH v3] net/af_packet: remove timestamp from packet status
We should eliminate the timestamp status from the packet status. This should only matter if timestamping is enabled on the socket, but we might hit a kernel bug, which is fixed in newer releases. For interfaces of type 'veth', the sent skb is forwarded to the peer and back into the network stack which timestamps it on the RX path if timestamping is enabled globally (which happens if any socket enables timestamping). When the skb is destructed, tpacket_destruct_skb() is called and it calls __packet_set_timestamp() which doesn't check the flags on the socket and returns the timestamp if it is set in the skb (and for veth it is, as mentioned above). See the following kernel commit for reference [1]: net: packetmmap: fix only tx timestamp on request The packetmmap tx ring should only return timestamps if requested via setsockopt PACKET_TIMESTAMP, as documented. This allows compatibility with non-timestamp aware user-space code which checks tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp flags to be set in tp_status. [1] https://www.spinics.net/lists/kernel/msg3959391.html Signed-off-by: Mihai Pogonaru Signed-off-by: Tudor Cornea --- v3: * Place function name and parameters on a separate line. v2: * Remove compile-time check for kernel version --- drivers/net/af_packet/rte_eth_af_packet.c | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..fcd8090 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -168,6 +168,26 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* + * Check if there is an available frame in the ring + */ +static inline bool +tx_ring_status_available(uint32_t tp_status) +{ + /* +* We eliminate the timestamp status from the packet status. +* This should only matter if timestamping is enabled on the socket, +* but there is a bug in the kernel which is fixed in newer releases. +* +* See the following kernel commit for reference: +* commit 171c3b151118a2fe0fc1e2a9d1b5a1570cfe82d2 +* net: packetmmap: fix only tx timestamp on request +*/ + tp_status &= ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); + + return tp_status == TP_STATUS_AVAILABLE; +} + +/* * Callback to handle sending packets through a real NIC. */ static uint16_t @@ -212,8 +232,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* point at the next incoming frame */ - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && - (poll(&pfd, 1, -1) < 0)) + if (!tx_ring_status_available(ppd->tp_status) && + poll(&pfd, 1, -1) < 0) break; /* copy the tx frame data */ -- 2.7.4
[dpdk-dev] [PATCH v3] net/af_packet: reinsert the stripped vlan tag
The af_packet pmd driver binds to a raw socket and allows sending and receiving of packets through the kernel. Since commit [1], the kernel strips the vlan tags early in __netif_receive_skb_core(), so we receive untagged packets while running with the af_packet pmd. Luckily for us, the skb vlan-related fields are still populated from the stripped vlan tags, so we end up having all the information that we need in the mbuf. Having the pmd driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the application to control the desired vlan stripping behavior, until we have a way to describe offloads that can't be disabled by pmd drivers. This patch will cause a change in the default way that the af_packet pmd treats received vlan-tagged frames. While previously, the application was required to check the PKT_RX_VLAN_STRIPPED flag, after this patch, the pmd will re-insert the vlan tag transparently to the user, unless the DEV_RX_OFFLOAD_VLAN_STRIP is enabled in rxmode.offloads. I've attempted a preliminary benchmark to understand if the change could cause a sizable performance hit. Setup: Two virtual machines running on top of an ESXi hypervisor Tx: DPDK app (running on top of vmxnet3 PMD) Rx: af_packet (running on top of a kernel vmxnet3 interface) Packet size :68 (packet contains a vlan tag) Rates: Tx - 1.419 Mpps Rx (without vlan insertion) - 1227636 pps Rx (with vlan insertion)- 1220081 pps At a first glance, we don't seem to have a large degradation in terms of packet rate [1] https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2 Signed-off-by: Tudor Cornea --- v3: * Updated release note and documentation * Updated commit with performance measurements v2: * Added DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads --- doc/guides/nics/af_packet.rst | 38 +++ doc/guides/rel_notes/release_21_11.rst| 4 drivers/net/af_packet/rte_eth_af_packet.c | 12 ++ 3 files changed, 54 insertions(+) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index efd6f1c..97d5502 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -65,3 +65,41 @@ framecnt=512): .. code-block:: console --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0 + +Features and Limitations of the af_packet PMD +- + +Since the following commit, the Linux kernel strips the vlan tag + +.. code-block:: console + +commit bcc6d47903612c3861201cc3a866fb604f26b8b2 +Author: Jiri Pirko +Date: Thu Apr 7 19:48:33 2011 + + + net: vlan: make non-hw-accel rx path similar to hw-accel + +Running on such a kernel results in receiving untagged frames while using +the af_packet PMD. Fortunately, the stripped information is still available +for use in ``mbuf->vlan_tci``, and applications could check ``PKT_RX_VLAN_STRIPPED``. + +However, we currently don't have a way to describe offloads which can't be +disabled by PMDs, and this creates an inconsistency with the way applications +expect the PMD offloads to work, and requires them to be aware of which +underlying driver they use. + +Since release 21.11 the af_packet PMD will implement support for the +``DEV_RX_OFFLOAD_VLAN_STRIP`` offload, and users can control the desired vlan +stripping behavior. + +It's important to note that the default case will change. If previously, +the vlan tag was stripped, if the application now requires the same behavior, +it will need to configure ``rxmode.offloads`` with ``DEV_RX_OFFLOAD_VLAN_STRIP``. + +The PMD driver will re-insert the vlan tag transparently to the application +if the kernel strips it, as long as the ``DEV_RX_OFFLOAD_VLAN_STRIP`` is not +enabled. + +.. code-block:: console + +port_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_STRIP diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst index ad7c1af..095fd5b 100644 --- a/doc/guides/rel_notes/release_21_11.rst +++ b/doc/guides/rel_notes/release_21_11.rst @@ -66,6 +66,10 @@ New Features * Added rte_flow support for dual VLAN insert and strip actions. +* **Updated af_packet ethdev driver.** + + * Added DEV_RX_OFFLOAD_VLAN_STRIP capability. + * **Updated Marvell cnxk crypto PMD.** * Added AES-CBC SHA1-HMAC support in lookaside protocol (IPsec) for CN10K. diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..5ed9dd6 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -48,6 +48,7 @@ struct pkt_rx_queue { struct rte_mempool *mb_pool; uint16_t in_port; + uint8_t vlan_strip; volatile unsigned long rx_pkts; volatile unsigned long rx_bytes; @@ -78,6 +79,7 @@ struct pmd_internals { struct pkt_rx_queue *rx_queue;
[dpdk-dev] [PATCH] net/af_packet: run on kernels without qdisc bypass support
Some older kernels do not support the PACKET_QDISC_BYPASS socket option. Such an example is the CentOS 7 kernel (3.10). If we only check for the definition of PACKET_QDISC_BYPASS, it might mean that we will not be able to compile the PMD driver on a newer platform, and run in on a machine with an older kernel. Setting the socket option only if it is specifically requested from the EAL arguments, allows us to have a way to run the PMD compiled against newer kernel headers, on platforms having older kernels. Signed-off-by: Tudor Cornea --- drivers/net/af_packet/rte_eth_af_packet.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 2e90e29..60b485a 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -749,13 +749,15 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, } #if defined(PACKET_QDISC_BYPASS) - rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS, - &qdisc_bypass, sizeof(qdisc_bypass)); - if (rc == -1) { - PMD_LOG_ERRNO(ERR, - "%s: could not set PACKET_QDISC_BYPASS on AF_PACKET socket for %s", - name, pair->value); - goto error; + if (qdisc_bypass) { + rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS, + &qdisc_bypass, sizeof(qdisc_bypass)); + if (rc == -1) { + PMD_LOG_ERRNO(ERR, + "%s: could not set PACKET_QDISC_BYPASS on AF_PACKET socket for %s", + name, pair->value); + goto error; + } } #else RTE_SET_USED(qdisc_bypass); -- 2.7.4
[dpdk-dev] [PATCH] net/iavf: fix overflow in maximum packet length config
The len variable, used in the computation of max_pkt_len could overflow, if used to store the result of the following computation: rxq->rx_buf_len * IAVF_MAX_CHAINED_RX_BUFFERS Since, we could define the mbuf size to have a large value (i.e 13312), and IAVF_MAX_CHAINED_RX_BUFFERS is defined as 5, the computation mentioned above could potentially result in a value which might be bigger than MAX_USHORT. The result will be that Jumbo Frames will not work properly A similar fix was submitted for the ice driver Signed-off-by: Tudor Cornea --- drivers/net/iavf/iavf_ethdev.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c index 574cfe0..dc5cbc2 100644 --- a/drivers/net/iavf/iavf_ethdev.c +++ b/drivers/net/iavf/iavf_ethdev.c @@ -574,13 +574,14 @@ iavf_init_rxq(struct rte_eth_dev *dev, struct iavf_rx_queue *rxq) { struct iavf_hw *hw = IAVF_DEV_PRIVATE_TO_HW(dev->data->dev_private); struct rte_eth_dev_data *dev_data = dev->data; - uint16_t buf_size, max_pkt_len, len; + uint16_t buf_size, max_pkt_len; buf_size = rte_pktmbuf_data_room_size(rxq->mp) - RTE_PKTMBUF_HEADROOM; /* Calculate the maximum packet length allowed */ - len = rxq->rx_buf_len * IAVF_MAX_CHAINED_RX_BUFFERS; - max_pkt_len = RTE_MIN(len, dev->data->dev_conf.rxmode.max_rx_pkt_len); + max_pkt_len = RTE_MIN((uint32_t) + rxq->rx_buf_len * IAVF_MAX_CHAINED_RX_BUFFERS, + dev->data->dev_conf.rxmode.max_rx_pkt_len); /* Check if the jumbo frame and maximum packet length are set * correctly. -- 2.7.4
Re: [PATCH v3] kni: allow configuring the kni thread granularity
Hi Ferruh, > As I tested both with KNI sample app and KNI PMD, change looks good, > practically we can't change the scheduler delay with existing code but > this patch enables it and lets configure performance/CPU usage trade of. > > Only possible change is to remove 'RTE_KNI_PREEMPT_DEFAULT' macro as > mentioned below. > > Thanks for the suggestion. I will send an updated version of the patch, which will remove the RTE_KNI_PREEMPT_DEFAULT macro
[PATCH v4] kni: allow configuring the kni thread granularity
The Kni kthreads seem to be re-scheduled at a granularity of roughly 1 millisecond right now, which seems to be insufficient for performing tests involving a lot of control plane traffic. Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it seems that the existing code cannot reschedule at the desired granularily, due to precision constraints of schedule_timeout_interruptible(). In our use case, we leverage the Linux Kernel for control plane, and it is not uncommon to have 60K - 100K pps for some signaling protocols. Since we are not in atomic context, the usleep_range() function seems to be more appropriate for being able to introduce smaller controlled delays, in the range of 5-10 microseconds. Upon reading the existing code, it would seem that this was the original intent. Adding sub-millisecond delays, seems unfeasible with a call to schedule_timeout_interruptible(). KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */ schedule_timeout_interruptible( usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL)); Below, we attempted a brief comparison between the existing implementation, which uses schedule_timeout_interruptible() and usleep_range(). We attempt to measure the CPU usage, and RTT between two Kni interfaces, which are created on top of vmxnet3 adapters, connected by a vSwitch. insmod rte_kni.ko kthread_mode=single carrier=on schedule_timeout_interruptible(usecs_to_jiffies(5)) kni_single CPU Usage: 2-4 % [root@localhost ~]# ping 1.1.1.2 -I eth1 PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data. 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms usleep_range(5, 10) kni_single CPU usage: 50% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms usleep_range(20, 50) kni_single CPU usage: 24% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms usleep_range(50, 100) kni_single CPU usage: 13% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms usleep_range(100, 200) kni_single CPU usage: 7% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms usleep_range(1000, 1100) kni_single CPU usage: 2% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms Upon testing, usleep_range(1000, 1100) seems roughly equivalent in latency and cpu usage to the variant with schedule_timeout_interruptible(), while usleep_range(100, 200) seems to give a decent tradeoff between latency and cpu usage, while allowing users to tweak the limits for improved precision if they have such use cases. Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a softlockup on my kernel. Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 1226 Comm: kni_single Tainted: GW O 3.10 #1 [] dump_stack+0x19/0x1b [] panic+0xcd/0x1e0 [] watchdog_timer_fn+0x160/0x160 [] __run_hrtimer.isra.4+0x42/0xd0 [] hrtimer_interrupt+0xe7/0x1f0 [] smp_apic_timer_interrupt+0x67/0xa0 [] apic_timer_interrupt+0x6d/0x80 This patch also attempts to remove this option. References: [1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt Signed-off-by: Tudor Cornea --- v4: * Removed RTE_KNI_PREEMPT_DEFAULT configuration option v3: * Fixed unwrapped commit description warning * Changed from hrtimers to Linux High Precision Timers in docs * Added two tabs at the beginning of the new params description. Stephen correctly pointed out that the descriptions of the parameters for the Kni module are nonstandard w.r.t existing kernel code. I was thinking to preserve compatibility with the existing parameters of the Kni module for the moment, while an additional clean-up patch could format the descriptions to be closer to the kernel standard. v2: * Fixed some spelling errors
[PATCH] kernel/kni: retry the xmit in case ring is full
This patch attempts to avoid dropping packets that are to be transmitted, in case there is no space in the KNI ring. We have a use case in which we leverage the Linux TCP / IP stack for control plane, and some protocols might be sensitive to packet drops. This might mean that the sender (Kernel) might be moving at a faster pace than the receiver end (DPDK application), or it might have some brief moments of bursty traffic patterns. Requeuing the packets could add a kind of backpressure until a transmit window is available to us. The burden of retransmitting is shifted to the caller of ndo_start_xmit, which in our case is the configured queuing discipline. This way, the user should be able to influence the behavior w.r.t dropping packets, by picking the desired queuing discipline. Although it should technically be a good approach, from what I have tested, stopping the queue prior to returning NETDEV_TX_BUSY seems to add some extra overhead, and degrade the control-plane performance a bit. Signed-off-by: Tudor Cornea --- kernel/linux/kni/kni_net.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c index 29e5b9e..db0330f 100644 --- a/kernel/linux/kni/kni_net.c +++ b/kernel/linux/kni/kni_net.c @@ -321,10 +321,9 @@ kni_net_tx(struct sk_buff *skb, struct net_device *dev) if (kni_fifo_free_count(kni->tx_q) == 0 || kni_fifo_count(kni->alloc_q) == 0) { /** -* If no free entry in tx_q or no entry in alloc_q, -* drops skb and goes out. +* Tell the caller to requeue, and retry at a later time. */ - goto drop; + goto requeue; } /* dequeue a mbuf from alloc_q */ @@ -371,6 +370,10 @@ kni_net_tx(struct sk_buff *skb, struct net_device *dev) dev->stats.tx_dropped++; return NETDEV_TX_OK; + +requeue: + /* Signal the caller to re-transmit at a later time */ + return NETDEV_TX_BUSY; } /* -- 2.7.4
Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx
Hi Ferruh, What you described above looks like a ring buffer with single producer and > single consumer, and producer overwrites the not consumed items. Indeed. This is also my understanding of the bug. I am going to try to isolate the issue, and should probably be able to come up with a script in a few days. Our of curiosity, are you using an modified af_packet implementation in > kernel > for above described usage? We are currently using an Ubuntu-based distro with a 4.15 Linux kernel. We don't have any kernel patches for the af_packet implementation to my knowledge (probably excepting patches that are back-ported by Ubuntu maintainers from newer releases). On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit wrote: > On 9/13/2021 2:45 PM, Tudor Cornea wrote: > > The poll call can return POLLERR which is ignored, or it can return > > POLLOUT, even if there are no free frames in the mmap-ed area. > > > > We can account for both of these cases by re-checking if the next > > frame is empty before writing into it. > > > > Signed-off-by: Mihai Pogonaru > > Signed-off-by: Tudor Cornea > > --- > > drivers/net/af_packet/rte_eth_af_packet.c | 19 +++ > > 1 file changed, 19 insertions(+) > > > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c > b/drivers/net/af_packet/rte_eth_af_packet.c > > index b73b211..087c196 100644 > > --- a/drivers/net/af_packet/rte_eth_af_packet.c > > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > > @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > (poll(&pfd, 1, -1) < 0)) > > break; > > > > + /* > > + * Poll can return POLLERR if the interface is down > > + * > > + * It will almost always return POLLOUT, even if there > > + * are no extra buffers available > > + * > > + * This happens, because packet_poll() calls > datagram_poll() > > + * which checks the space left in the socket buffer and, > > + * in the case of packet_mmap, the default socket buffer > length > > + * doesn't match the requested size for the tx_ring. > > + * As such, there is almost always space left in socket > buffer, > > + * which doesn't seem to be correlated to the requested > size > > + * for the tx_ring in packet_mmap. > > + * > > + * This results in poll() returning POLLOUT. > > + */ > > + if (ppd->tp_status != TP_STATUS_AVAILABLE) > > + break; > > + > > If 'POLLOUT' doesn't indicate that there is space in the buffer, what is > the > point of the 'poll()' at all? > > What can we test/reproduce the mentioned behavior? Or is there a way to > fix the > behavior of poll() or use an alternative of it? > > > OK to break on the 'POLLERR', I guess it can be detected in the > 'pfd.revent'. > > > > /* copy the tx frame data */ > > pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - > > sizeof(struct sockaddr_ll); > > > >
[dpdk-dev] [PATCH v4] net/af_packet: reinsert the stripped vlan tag
The af_packet pmd driver binds to a raw socket and allows sending and receiving of packets through the kernel. Since commit [1], the kernel strips the vlan tags early in __netif_receive_skb_core(), so we receive untagged packets while running with the af_packet pmd. Luckily for us, the skb vlan-related fields are still populated from the stripped vlan tags, so we end up having all the information that we need in the mbuf. Having the pmd driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the application to control the desired vlan stripping behavior, until we have a way to describe offloads that can't be disabled by pmd drivers. This patch will cause a change in the default way that the af_packet pmd treats received vlan-tagged frames. While previously, the application was required to check the PKT_RX_VLAN_STRIPPED flag, after this patch, the pmd will re-insert the vlan tag transparently to the user, unless the DEV_RX_OFFLOAD_VLAN_STRIP is enabled in rxmode.offloads. I've attempted a preliminary benchmark to understand if the change could cause a sizable performance hit. Setup: Two virtual machines running on top of an ESXi hypervisor Tx: DPDK app (running on top of vmxnet3 PMD) Rx: af_packet (running on top of a kernel vmxnet3 interface) Packet size :68 (packet contains a vlan tag) Rates: Tx - 1.419 Mpps Rx (without vlan insertion) - 1227636 pps Rx (with vlan insertion)- 1220081 pps At a first glance, we don't seem to have a large degradation in terms of packet rate. [1] https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2 Signed-off-by: Tudor Cornea --- v4: * Updated the af_packet documentation v3: * Updated release note and documentation * Updated commit with performance measurements v2: * Added DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads --- doc/guides/nics/af_packet.rst | 5 + doc/guides/rel_notes/release_21_11.rst| 4 drivers/net/af_packet/rte_eth_af_packet.c | 12 3 files changed, 21 insertions(+) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index efd6f1c..c87310b 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -65,3 +65,8 @@ framecnt=512): .. code-block:: console --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0 + +Features and Limitations of the af_packet PMD +- + +Af_packet PMD now works with VLAN's on Linux diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst index ad7c1af..095fd5b 100644 --- a/doc/guides/rel_notes/release_21_11.rst +++ b/doc/guides/rel_notes/release_21_11.rst @@ -66,6 +66,10 @@ New Features * Added rte_flow support for dual VLAN insert and strip actions. +* **Updated af_packet ethdev driver.** + + * Added DEV_RX_OFFLOAD_VLAN_STRIP capability. + * **Updated Marvell cnxk crypto PMD.** * Added AES-CBC SHA1-HMAC support in lookaside protocol (IPsec) for CN10K. diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..5ed9dd6 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -48,6 +48,7 @@ struct pkt_rx_queue { struct rte_mempool *mb_pool; uint16_t in_port; + uint8_t vlan_strip; volatile unsigned long rx_pkts; volatile unsigned long rx_bytes; @@ -78,6 +79,7 @@ struct pmd_internals { struct pkt_rx_queue *rx_queue; struct pkt_tx_queue *tx_queue; + uint8_t vlan_strip; }; static const char *valid_arguments[] = { @@ -148,6 +150,9 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) if (ppd->tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED); + + if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf)) + PMD_LOG(ERR, "Failed to reinsert VLAN tag"); } /* release incoming frame and advance ring buffer */ @@ -302,6 +307,11 @@ eth_dev_stop(struct rte_eth_dev *dev) static int eth_dev_configure(struct rte_eth_dev *dev __rte_unused) { + struct rte_eth_conf *dev_conf = &dev->data->dev_conf; + const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; + struct pmd_internals *internals = dev->data->dev_private; + + internals->vlan_strip = !!(rxmode->offloads & DEV_RX_OFFLOAD_VLAN_STRIP); return 0; } @@ -318,6 +328,7 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->min_rx_bufsize = 0; dev_info->tx_offload_capa = DEV_TX_OFFLOAD_MULTI_SEGS | DEV_TX_OFFLOAD_VLAN_INSERT;
Re: [dpdk-dev] [PATCH v3] net/af_packet: reinsert the stripped vlan tag
Thanks Stephen, for the suggestion I've sent v4 of the patch, which adds the succinct description in the af_packet documentation.. On Fri, 24 Sept 2021 at 18:11, Stephen Hemminger wrote: > On Fri, 24 Sep 2021 14:44:45 +0300 > Tudor Cornea wrote: > > > +Features and Limitations of the af_packet PMD > > +- > > + > > +Since the following commit, the Linux kernel strips the vlan tag > > + > > +.. code-block:: console > > + > > +commit bcc6d47903612c3861201cc3a866fb604f26b8b2 > > +Author: Jiri Pirko > > +Date: Thu Apr 7 19:48:33 2011 + > > + > > + net: vlan: make non-hw-accel rx path similar to hw-accel > > + > > +Running on such a kernel results in receiving untagged frames while > using > > +the af_packet PMD. Fortunately, the stripped information is still > available > > +for use in ``mbuf->vlan_tci``, and applications could check > ``PKT_RX_VLAN_STRIPPED``. > > + > > +However, we currently don't have a way to describe offloads which can't > be > > +disabled by PMDs, and this creates an inconsistency with the way > applications > > +expect the PMD offloads to work, and requires them to be aware of which > > +underlying driver they use. > > + > > +Since release 21.11 the af_packet PMD will implement support for the > > +``DEV_RX_OFFLOAD_VLAN_STRIP`` offload, and users can control the > desired vlan > > +stripping behavior. > > + > > +It's important to note that the default case will change. If previously, > > +the vlan tag was stripped, if the application now requires the same > behavior, > > +it will need to configure ``rxmode.offloads`` with > ``DEV_RX_OFFLOAD_VLAN_STRIP``. > > + > > +The PMD driver will re-insert the vlan tag transparently to the > application > > +if the kernel strips it, as long as the ``DEV_RX_OFFLOAD_VLAN_STRIP`` > is not > > +enabled. > > This seems like the wrong place for this text. > It is not a new feature, it is a bug fix. > > If you want to describe it as an enhancement, the text should be succinct > and describe > the situation from user point of view. Something like: > > Af_packet PMD now works with VLAN's on Linux > >
[dpdk-dev] [PATCH v5] net/af_packet: reinsert the stripped vlan tag
The af_packet pmd driver binds to a raw socket and allows sending and receiving of packets through the kernel. Since commit [1], the kernel strips the vlan tags early in __netif_receive_skb_core(), so we receive untagged packets while running with the af_packet pmd. Luckily for us, the skb vlan-related fields are still populated from the stripped vlan tags, so we end up having all the information that we need in the mbuf. Having the pmd driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the application to control the desired vlan stripping behavior, until we have a way to describe offloads that can't be disabled by pmd drivers. This patch will cause a change in the default way that the af_packet pmd treats received vlan-tagged frames. While previously, the application was required to check the PKT_RX_VLAN_STRIPPED flag, after this patch, the pmd will re-insert the vlan tag transparently to the user, unless the DEV_RX_OFFLOAD_VLAN_STRIP is enabled in rxmode.offloads. I've attempted a preliminary benchmark to understand if the change could cause a sizable performance hit. Setup: Two virtual machines running on top of an ESXi hypervisor Tx: DPDK app (running on top of vmxnet3 PMD) Rx: af_packet (running on top of a kernel vmxnet3 interface) Packet size :68 (packet contains a vlan tag) Rates: Tx - 1.419 Mpps Rx (without vlan insertion) - 1227636 pps Rx (with vlan insertion)- 1220081 pps At a first glance, we don't seem to have a large degradation in terms of packet rate. [1] https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2 Signed-off-by: Tudor Cornea --- v5: * Updated the af_packet documentation * Updated the af_packet release notes v4: * Updated the af_packet documentation v3: * Updated release note and documentation * Updated commit with performance measurements v2: * Added DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads --- doc/guides/nics/af_packet.rst | 7 +++ doc/guides/rel_notes/release_21_11.rst| 5 + drivers/net/af_packet/rte_eth_af_packet.c | 12 3 files changed, 24 insertions(+) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index efd6f1c..168a946 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -65,3 +65,10 @@ framecnt=512): .. code-block:: console --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0 + +Features and Limitations of the af_packet PMD +- + +The PMD will re-insert the VLAN tag transparently to the packet +if the kernel strips it, as long as the ``DEV_RX_OFFLOAD_VLAN_STRIP`` is not +enabled by the application. diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst index ad7c1af..3315703 100644 --- a/doc/guides/rel_notes/release_21_11.rst +++ b/doc/guides/rel_notes/release_21_11.rst @@ -66,6 +66,11 @@ New Features * Added rte_flow support for dual VLAN insert and strip actions. +* **Updated af_packet ethdev driver.** + + * Default VLAN strip behavior changed. +If previously, the VLAN tag was stripped by the kernel, if the application now requires the same behavior, it will need to enable ``DEV_RX_OFFLOAD_VLAN_STRIP``. + * **Updated Marvell cnxk crypto PMD.** * Added AES-CBC SHA1-HMAC support in lookaside protocol (IPsec) for CN10K. diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..5ed9dd6 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -48,6 +48,7 @@ struct pkt_rx_queue { struct rte_mempool *mb_pool; uint16_t in_port; + uint8_t vlan_strip; volatile unsigned long rx_pkts; volatile unsigned long rx_bytes; @@ -78,6 +79,7 @@ struct pmd_internals { struct pkt_rx_queue *rx_queue; struct pkt_tx_queue *tx_queue; + uint8_t vlan_strip; }; static const char *valid_arguments[] = { @@ -148,6 +150,9 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) if (ppd->tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED); + + if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf)) + PMD_LOG(ERR, "Failed to reinsert VLAN tag"); } /* release incoming frame and advance ring buffer */ @@ -302,6 +307,11 @@ eth_dev_stop(struct rte_eth_dev *dev) static int eth_dev_configure(struct rte_eth_dev *dev __rte_unused) { + struct rte_eth_conf *dev_conf = &dev->data->dev_conf; + const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; + struct pmd_internals *internals = dev->data->dev_private;
Re: [dpdk-dev] [PATCH v4] net/af_packet: reinsert the stripped vlan tag
Hi Ferruh, Also driver was already working with VLAN, the change is VLAN is not > force stripped anymore. Agreed. It makes more sense to inform the users that the default behavior of the PMD has changed w.r.t VLAN stripping. I have updated the patch On Thu, 30 Sept 2021 at 11:14, Ferruh Yigit wrote: > On 9/29/2021 3:08 PM, Tudor Cornea wrote: > > The af_packet pmd driver binds to a raw socket and allows > > sending and receiving of packets through the kernel. > > > > Since commit [1], the kernel strips the vlan tags early in > > __netif_receive_skb_core(), so we receive untagged packets while > > running with the af_packet pmd. > > > > Luckily for us, the skb vlan-related fields are still populated from the > > stripped vlan tags, so we end up having all the information > > that we need in the mbuf. > > > > Having the pmd driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the > > application to control the desired vlan stripping behavior, > > until we have a way to describe offloads that can't be disabled by > > pmd drivers. > > > > This patch will cause a change in the default way that the af_packet > > pmd treats received vlan-tagged frames. While previously, the > > application was required to check the PKT_RX_VLAN_STRIPPED flag, after > > this patch, the pmd will re-insert the vlan tag transparently to the > > user, unless the DEV_RX_OFFLOAD_VLAN_STRIP is enabled in > > rxmode.offloads. > > > > I've attempted a preliminary benchmark to understand if the change could > > cause a sizable performance hit. > > > > Setup: > > Two virtual machines running on top of an ESXi hypervisor > > > > Tx: DPDK app (running on top of vmxnet3 PMD) > > Rx: af_packet (running on top of a kernel vmxnet3 interface) > > Packet size :68 (packet contains a vlan tag) > > > > Rates: > > Tx - 1.419 Mpps > > Rx (without vlan insertion) - 1227636 pps > > Rx (with vlan insertion) - 1220081 pps > > > > At a first glance, we don't seem to have a large degradation in terms > > of packet rate. > > > > [1] > https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2 > > > > Signed-off-by: Tudor Cornea > > > > --- > > v4: > > * Updated the af_packet documentation > > v3: > > * Updated release note and documentation > > * Updated commit with performance measurements > > v2: > > * Added DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads > > --- > > doc/guides/nics/af_packet.rst | 5 + > > doc/guides/rel_notes/release_21_11.rst| 4 > > drivers/net/af_packet/rte_eth_af_packet.c | 12 > > 3 files changed, 21 insertions(+) > > > > diff --git a/doc/guides/nics/af_packet.rst > b/doc/guides/nics/af_packet.rst > > index efd6f1c..c87310b 100644 > > --- a/doc/guides/nics/af_packet.rst > > +++ b/doc/guides/nics/af_packet.rst > > @@ -65,3 +65,8 @@ framecnt=512): > > .. code-block:: console > > > > > --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0 > > + > > +Features and Limitations of the af_packet PMD > > +- > > + > > +Af_packet PMD now works with VLAN's on Linux > > Thanks Tudor. > > I see this is suggested by Stephen, but I think 'now' is confusing in the > driver > guide. Also driver was already working with VLAN, the change is VLAN is not > force stripped anymore. > > I think following part from your previous version is better, what do you > think > something like following? > " > The PMD will re-insert the VLAN tag transparently to the packet > if the kernel strips it, as long as the ``DEV_RX_OFFLOAD_VLAN_STRIP`` is > not > enabled by application. > " > > > diff --git a/doc/guides/rel_notes/release_21_11.rst > b/doc/guides/rel_notes/release_21_11.rst > > index ad7c1af..095fd5b 100644 > > --- a/doc/guides/rel_notes/release_21_11.rst > > +++ b/doc/guides/rel_notes/release_21_11.rst > > @@ -66,6 +66,10 @@ New Features > > > >* Added rte_flow support for dual VLAN insert and strip actions. > > > > +* **Updated af_packet ethdev driver.** > > + > > + * Added DEV_RX_OFFLOAD_VLAN_STRIP capability. > > + > > I think change in the default behavior is more important in the release > notes, > again what do you think to have your following update here: > > " > Default VLAN strip behavior changed. If previously, > the vlan tag was stripped, if the application now requires the same > behavior, > it will need to configure ``DEV_RX_OFFLOAD_VLAN_STRIP``. > " >
Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx
Hi Ferruh, I have attempted to narrow down the issue. I have the following bash script, which computes packet rates on an interface. [root@localhost ~]# cat compute-rates.sh #!/usr/bin/env bash if [[ ${#} -ne 2 ]]; then echo "Usage: ${0} " exit 1 fi IFACE_NAME="${1}" SLEEP_INTERVAL_SECONDS="${2}" TMP_STATS_FILE="/tmp/netstat" # Clear Previous stats file echo "0 0 0 0" > "${TMP_STATS_FILE}" echo "Press CTRL+C to exit..." while true; do export "RxB=0" "RxP=0" "TxB=0" "TxP=0" # Extract Rx{Bytes,Packets} and Tx{Bytes,Packets} and # format the output. Individual fields will be exported export $(\ ifconfig "${IFACE_NAME}" \ | grep 'packets' \ | awk '{print $5, $3}' \ | xargs echo \ | sed -E -e \ "s/([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+)/RxB=\1 RxP=\2 TxB=\3 TxP=\4/") # Print Packet and Byte Rates # Format: | Rx Bytes | Rx Packets | Tx Bytes | Tx Packets | echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" $(cat "${TMP_STATS_FILE}") \ | awk '{print "RxB="$1-$5, "RxP="$2-$6, "TxB="$3-$7, "TxP="$4-$8}' # Save the new values echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" > "${TMP_STATS_FILE}" sleep "${SLEEP_INTERVAL_SECONDS}" done On the transmit side, I'm using the engine behind [1] with the af_packet PMD. The configuration for the af_packet PMD is the following: --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0 I'm configuring a Tx rate of 335 packets / second and a packet size of 300 Bytes. These seem to be the values using which we seem to have better chances of seeing the problem. I suspect it might also be linked with the af_packet configuration. I'm starting traffic using the specified configuration, and in parallel, running the script that computes the rates as follows: ./compute-rates.sh eth1 0.1 Initially, the packet rates seem steady RxB=0 RxP=0 TxB=10952 TxP=37 RxB=0 RxP=0 TxB=10656 TxP=36 RxB=0 RxP=0 TxB=10656 TxP=36 RxB=0 RxP=0 TxB=10656 TxP=36 RxB=0 RxP=0 TxB=10952 TxP=37 RxB=0 RxP=0 TxB=10952 TxP=37 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10952 TxP=37 [...] After a while, we toggle the interface up / down with a sleep between the steps. I suspect the length of the sleep might be a variable in the equation. ifconfig eth1 down; sleep 7; ifconfig eth1 up What we see, is that even after the interface is toggled back up, the rates never seem to recover. RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=2072 TxP=7 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=10360 TxP=35 RxB=0 RxP=0 TxB=521256 TxP=1761 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 RxB=0 RxP=0 TxB=0 TxP=0 [...] I've attempted to mirror the same behavior using dpdk-pktgen [2] on a different machine (Ubuntu 20.04). This time, af_packet runs on top of a Linux virtio_net interface. I seem to be getting a similar behavior. I have used the following dpdk-pktgen configuration and run-time settings pktgen \ -l 1-4 \ -n 4 \ --proc-type=primary \ --no-pci \ --no-telemetry \ --no-huge \ -m 512 \ --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0 \ -- \ -P \ -T \ -m "3.0" \ -f themes/black-yellow.theme set 0 size 300 set 0 rate 0.008 set 0 burst 1 start 0 [1] https://github.com/open-traffic-generator/ixia-c [2] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md On Wed, 29 Sept 2021 at 13:03, Tudor Cornea wrote: > Hi Ferruh, > > What you described above looks like a ring buffer with single producer and >> single consumer, and producer overwrites the not consumed items. > > > Indeed. This is also my understanding of the bug. > I am going to try to isolate the issue, and should probably be able to > come up with a script in a few days. > > Our of curiosity, are you using an modified af_packet implementation in >> kernel >> for above described usage? > > > We are currently using an Ubuntu-based distro with a 4.15 Linux kernel. > We don't have any kernel patches for the af_packet implementation to my > knowledge (probably excepting patches that are back-ported by Ubuntu > maintainers from newer releases). > > > On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit > wrote: > >> O
[dpdk-dev] [PATCH] net/af_packet: try to reinsert the stripped vlan tag
The af_packet pmd driver binds to a raw socket and allows sending and receiving of packets through the kernel. Since commit bcc6d47903 [1], the kernel strips the vlan tags early in __netif_receive_skb_core(), so we receive untagged packets while running with the af_packet pmd. Luckily for us, the skb vlan-related fields are still populated from the stripped vlan tags, so we end up having all the information that we need in the mbuf. We would like to have the the vlan tag inside the mbuf. Let's take a shot at it by trying to reinsert the stripped vlan tag. As a side note, something similar was done for the netvsc pmd. [1] https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2 Signed-off-by: Tudor Cornea --- drivers/net/af_packet/rte_eth_af_packet.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..d116583 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -148,6 +148,10 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) if (ppd->tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED); + + /* the kernel always strips the vlan tag, try to reinsert it */ + if (rte_vlan_insert(&mbuf)) + PMD_LOG(ERR, "Failed to reinsert vlan tag"); } /* release incoming frame and advance ring buffer */ -- 2.7.4
[dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx
The poll call can return POLLERR which is ignored, or it can return POLLOUT, even if there are no free frames in the mmap-ed area. We can account for both of these cases by re-checking if the next frame is empty before writing into it. We also now eliminate the timestamp status from the frame status. Signed-off-by: Mihai Pogonaru Signed-off-by: Tudor Cornea --- drivers/net/af_packet/rte_eth_af_packet.c | 47 +-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..3845df5 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) return num_rx; } +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status) +{ + return *tp_status & + ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); +} + /* * Callback to handle sending packets through a real NIC. */ @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } } + /* +* We must eliminate the timestamp status from the packet +* status. This should only matter if timestamping is enabled +* on the socket, but there is a BUG in the kernel which is +* fixed in newer releases. + +* For interfaces of type 'veth', the sent skb is forwarded +* to the peer and back into the network stack which timestamps +* it on the RX path if timestamping is enabled globally +* (which happens if any socket enables timestamping). + +* When the skb is destructed, tpacket_destruct_skb() is called +* and it calls __packet_set_timestamp() which doesn't check +* the flags on the socket and returns the timestamp if it is +* set in the skb (and for veth it is, as mentioned above). +*/ + /* point at the next incoming frame */ - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && - (poll(&pfd, 1, -1) < 0)) + if ((tx_ring_status_remove_ts(&ppd->tp_status) + != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0)) + break; + + /* +* Poll can return POLLERR if the interface is down or POLLOUT, +* even if there are no extra buffers available. +* This happens, because packet_poll() calls datagram_poll() +* which checks the space left in the socket buffer and in the +* case of packet_mmap the default socket buffer length +* doesn't match the requested size for the tx_ring so there +* is always space left in socket buffer, which doesn't seem +* to be correlated to the requested size for the tx_ring +* in packet_mmap. +*/ + if (tx_ring_status_remove_ts(&ppd->tp_status) + != TP_STATUS_AVAILABLE) break; /* copy the tx frame data */ @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) rte_pktmbuf_free(mbuf); } + /* +* We might have to ignore a few more errnos here since the packets +* remain in the mmap-ed queue and will be sent later, presumably. +*/ + /* kick-off transmits */ if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && errno != ENOBUFS && errno != EAGAIN) { -- 2.7.4
Re: [dpdk-dev] [PATCH] net/af_packet: try to reinsert the stripped vlan tag
Indeed, the vlan insertion could be a costly operation. We should probably do it only if the user specifically asks to have the vlan tag in the packet. Otherwise, af_packet PMD users might pay a price in terms of performance for something they didn't ask for. I was thinking of avoiding having to change the application in order to re-insert the vlan tag. Doing this operation inside the PMD driver seemed like a good fit. Looking at the netvsc driver (drivers/net/netvsc), the vlan insertion is guarded by a check to hv->vlan_strip if (!hv->vlan_strip && rte_vlan_insert(&m)) { hv->vlan_strip seems to be initialized in hn_dev_configure() in the following way hv->vlan_strip = !!(rxmode->offloads & DEV_RX_OFFLOAD_VLAN_STRIP); while 'hv' seems to be stored in rte_eth_dev->data->dev_private I am thinking of doing something similar for the af_packet PMD. The 'pmd_internals' structure could potentially hold a field, say vlan_strip', which could be initialized if the application enables the DEV_RX_OFFLOAD_VLAN_STRIP in rxmode->offloads This way, I'm thinking that the application could potentially control the effect of vlan stripping for the af_packet PMD, in an uniform way, similar to other PMDs. Would this be considered an acceptable solution ? On Tue, 31 Aug 2021 at 18:31, Ferruh Yigit wrote: > On 8/20/2021 1:46 PM, Tudor Cornea wrote: > > The af_packet pmd driver binds to a raw socket and allows > > sending and receiving of packets through the kernel. > > > > Since commit bcc6d47903 [1], the kernel strips the vlan tags early in > > __netif_receive_skb_core(), so we receive untagged packets while > > running with the af_packet pmd. > > > > Luckily for us, the skb vlan-related fields are still populated from the > > stripped vlan tags, so we end up having all the information > > that we need in the mbuf. > > > > We would like to have the the vlan tag inside the mbuf. > > Let's take a shot at it by trying to reinsert the stripped vlan tag. > > > > PMD already sets 'mbuf->vlan_tci' and 'PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED' > flags, so application can be aware of the vlan tag and can consume it. > > Inserting the vlan tag back to packet is costly, what is the motivation to > do so? > > > As a side note, something similar was done for the netvsc pmd. > > > > [1] > https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2 > > > > Signed-off-by: Tudor Cornea > > --- > > drivers/net/af_packet/rte_eth_af_packet.c | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c > b/drivers/net/af_packet/rte_eth_af_packet.c > > index b73b211..d116583 100644 > > --- a/drivers/net/af_packet/rte_eth_af_packet.c > > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > > @@ -148,6 +148,10 @@ eth_af_packet_rx(void *queue, struct rte_mbuf > **bufs, uint16_t nb_pkts) > > if (ppd->tp_status & TP_STATUS_VLAN_VALID) { > > mbuf->vlan_tci = ppd->tp_vlan_tci; > > mbuf->ol_flags |= (PKT_RX_VLAN | > PKT_RX_VLAN_STRIPPED); > > + > > + /* the kernel always strips the vlan tag, try to > reinsert it */ > > + if (rte_vlan_insert(&mbuf)) > > + PMD_LOG(ERR, "Failed to reinsert vlan > tag"); > > } > > > > /* release incoming frame and advance ring buffer */ > > > >
Re: [dpdk-dev] [PATCH] net/af_packet: try to reinsert the stripped vlan tag
Thanks, I hope I understood correctly the above comments. I'm thinking of adding DEV_RX_OFFLOAD_VLAN_STRIP to dev_info->rx_offload_capa eth_dev_info() +dev_info->rx_offload_capa = DEV_RX_OFFLOAD_VLAN_STRIP; then populating pmd_internals->vlan_strip with the vlan stripping option that the application requests eth_dev_configure() +internals->vlan_strip = !!(rxmode->offloads & DEV_RX_OFFLOAD_VLAN_STRIP); >From the internals structure, we could populate a newly-added field in the pkt_rx_queue structure 'vlan_strip' eth_rx_queue_setup() +pkt_q->vlan_strip = internals->vlan_strip; And attempt to re-insert the vlan only if required in eth_af_packet_rx eth_af_packet_rx() +if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf)) I've attempted a simple benchmark to understand if the change could cause a sizable performance hit. Setup: Tx: vmxnet3 PMD Rx: af_packet (running on top of a vmxnet3 interface) Packet size :68 (packet contains a vlan tag) Rates: Tx - 1.419 Mpps Rx (without vlan insertion) - 1227636 pps Rx (with vlan insertion) - 1220081 pps I don't seem to have a large degradation in terms of packet rate at first glance, but maybe the experiment could be repeated on different setups as I'm using a virtual environment. Would it be reasonable if I send v2 of the patch for review, with the above changes ? On Thu, 2 Sept 2021 at 13:49, Ferruh Yigit wrote: > On 9/1/2021 10:34 PM, Stephen Hemminger wrote: > > On Wed, 1 Sep 2021 22:07:22 +0300 > > Tudor Cornea wrote: > > > >> Indeed, the vlan insertion could be a costly operation. We should > probably > >> do it only if the user specifically asks to have the vlan tag in the > packet. > >> Otherwise, af_packet PMD users might pay a price in terms of performance > >> for something they didn't ask for. > >> > >> I was thinking of avoiding having to change the application in order to > >> re-insert the vlan tag. > >> Doing this operation inside the PMD driver seemed like a good fit. > >> > >> Looking at the netvsc driver (drivers/net/netvsc), the vlan insertion is > >> guarded by a check to hv->vlan_strip > >> > >> if (!hv->vlan_strip && rte_vlan_insert(&m)) { > >> > >> hv->vlan_strip seems to be initialized in hn_dev_configure() in the > >> following way > >> > >> hv->vlan_strip = !!(rxmode->offloads & DEV_RX_OFFLOAD_VLAN_STRIP); > >> > >> while 'hv' seems to be stored in rte_eth_dev->data->dev_private > >> > >> I am thinking of doing something similar for the af_packet PMD. > >> The 'pmd_internals' structure could potentially hold a field, say > >> vlan_strip', which could be initialized if the application enables the > >> DEV_RX_OFFLOAD_VLAN_STRIP in rxmode->offloads > >> > >> This way, I'm thinking that the application could potentially control > the > >> effect of vlan stripping for the af_packet PMD, in an uniform way, > similar > >> to other PMDs. > >> Would this be considered an acceptable solution ? > >> > >> > >> > >> > >> On Tue, 31 Aug 2021 at 18:31, Ferruh Yigit > wrote: > >> > >>> On 8/20/2021 1:46 PM, Tudor Cornea wrote: > >>>> The af_packet pmd driver binds to a raw socket and allows > >>>> sending and receiving of packets through the kernel. > >>>> > >>>> Since commit bcc6d47903 [1], the kernel strips the vlan tags early in > >>>> __netif_receive_skb_core(), so we receive untagged packets while > >>>> running with the af_packet pmd. > >>>> > >>>> Luckily for us, the skb vlan-related fields are still populated from > the > >>>> stripped vlan tags, so we end up having all the information > >>>> that we need in the mbuf. > >>>> > >>>> We would like to have the the vlan tag inside the mbuf. > >>>> Let's take a shot at it by trying to reinsert the stripped vlan tag. > >>>> > >>> > >>> PMD already sets 'mbuf->vlan_tci' and 'PKT_RX_VLAN | > PKT_RX_VLAN_STRIPPED' > >>> flags, so application can be aware of the vlan tag and can consume it. > >>> > >>> Inserting the vlan tag back to packet is costly, what is the > motivation to > >>> do so? > >>> > >>>> As a side note, something similar was done for the netvsc pmd. > >>>> > >>>> [1] > >>> > https:
Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx
Hi Ferruh, Would you mind separate timestamp status fix to its own patch? I think > better to > fix 'ignoring full Tx ring' first, to not make it dependent to timestamp > patch. Agreed. There are two issues solved by this patch. We will break it in two different patches. I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the > kernel > versions the bug exists, this flag is not set, but can you please confirm? And does it only seen with veth, if so I wonder if we can ignore it, not > sure > how common to use af_packet PMD over veth interface, do you have this > usecase? We've seen the timestamping issue only when running af_packet over veth interfaces. We have a particular use-case internally, in which we need to run inside a Kubernetes cluster. We've found the following resources [1] , [2] related to this behavior in the kernel. We believe that issue #2 (the ring getting full), can theoretically occur on any type of NIC. We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3 interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the interface on / off ifconfig $iface_name down; sleep 10; ifconfig $iface_name up We will attempt to give more context on the issue below, about what we think happens: - we have a 2048 queue shared between the kernel and the dpdk socket, there's an index the queue in both the kernel and the dpdk driver - the dpdk driver writes a packet or a burst, advances its idx and tells the kernel to send the packets via a call to sendto() and the kernel sends the packets and advances its idx - once the interface is down the kernel can no longer send packets, but it doesn't drop them, it just doesn't advance its idx - for each packet there is header and in the header there is a status integer which, among others, indicates the owner of the packet: the userspace or the kernel - the userspace (dpdk driver) sets the status as owned by the kernel when it adds another packet ; the kernel sets the status back to owned by the userspace once it sends a packet - the dpdk driver was ignoring this status bit and, even after the queue was full, it would continue to put packets in the queue - its idx would be "after" that of the kernel - once the interface is brought up, the kernel would send all the packets in the queue (since they have the status of being owned by the kernel) on the next call to sendto() and the idx would be back to where it was before the interface was brought up (let's call it k1) - the dpdk driver idx into the queue would point somewhere in the queue (let's call it d1) and would continue to add packets at that point, but the kernel wouldn't send any packet anymore since there is now a gap of packets owned by the userspace between the kernel index (k1) and the dpdk driver idx (d1) - the dpdk idx would eventually reach k1 and packets would be transferred at a normal rate until both the dpdk idx and the kernel idx would reach d1 again - between d1 and k1 there are only packets with the status as owned by the kernel - which where added by the dpdk driver while its index was between d1 and k1 ; thus the kernel would burst all the packets till k1, while the dpdk idx is at d1 - the cycle repeats If a new traffic config comes (in our application) while this cycle is happening, it could be that some of the packets of the old config are still in queue (between d1 and k1) and will be bursted when the dpdk and kernel idx reach d1 ; this would explain seeing packets from an old config, but only in the first 2048 packets (which is the queue size) [1] https://www.spinics.net/lists/kernel/msg3959391.html [2] https://www.spinics.net/lists/netdev/msg739372.html On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit wrote: > On 8/20/2021 2:39 PM, Tudor Cornea wrote: > > The poll call can return POLLERR which is ignored, or it can return > > POLLOUT, even if there are no free frames in the mmap-ed area. > > > > We can account for both of these cases by re-checking if the next > > frame is empty before writing into it. > > > > We also now eliminate the timestamp status from the frame status. > > > > Hi Tudor, > > Would you mind separate timestamp status fix to its own patch? I think > better to > fix 'ignoring full Tx ring' first, to not make it dependent to timestamp > patch. > > > Signed-off-by: Mihai Pogonaru > > Signed-off-by: Tudor Cornea > > --- > > drivers/net/af_packet/rte_eth_af_packet.c | 47 > +-- > > 1 file changed, 45 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c > b/drivers/net/af_packet/rte_eth_af_packet.c > > index b73b211..3845df5 100644 > > --- a/drivers/net/af_packet/rte_eth_af_packet.c > > +++ b/drivers
[dpdk-dev] [PATCH v2] net/af_packet: reinsert the stripped vlan tag
The af_packet pmd driver binds to a raw socket and allows sending and receiving of packets through the kernel. Since commit [1], the kernel strips the vlan tags early in __netif_receive_skb_core(), so we receive untagged packets while running with the af_packet pmd. Luckily for us, the skb vlan-related fields are still populated from the stripped vlan tags, so we end up having all the information that we need in the mbuf. Having the PMD driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the application to control the desired vlan stripping behavior. [1] https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2 Signed-off-by: Tudor Cornea --- v2: * Add DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads --- drivers/net/af_packet/rte_eth_af_packet.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index b73b211..5ed9dd6 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -48,6 +48,7 @@ struct pkt_rx_queue { struct rte_mempool *mb_pool; uint16_t in_port; + uint8_t vlan_strip; volatile unsigned long rx_pkts; volatile unsigned long rx_bytes; @@ -78,6 +79,7 @@ struct pmd_internals { struct pkt_rx_queue *rx_queue; struct pkt_tx_queue *tx_queue; + uint8_t vlan_strip; }; static const char *valid_arguments[] = { @@ -148,6 +150,9 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) if (ppd->tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED); + + if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf)) + PMD_LOG(ERR, "Failed to reinsert VLAN tag"); } /* release incoming frame and advance ring buffer */ @@ -302,6 +307,11 @@ eth_dev_stop(struct rte_eth_dev *dev) static int eth_dev_configure(struct rte_eth_dev *dev __rte_unused) { + struct rte_eth_conf *dev_conf = &dev->data->dev_conf; + const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; + struct pmd_internals *internals = dev->data->dev_private; + + internals->vlan_strip = !!(rxmode->offloads & DEV_RX_OFFLOAD_VLAN_STRIP); return 0; } @@ -318,6 +328,7 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->min_rx_bufsize = 0; dev_info->tx_offload_capa = DEV_TX_OFFLOAD_MULTI_SEGS | DEV_TX_OFFLOAD_VLAN_INSERT; + dev_info->rx_offload_capa = DEV_RX_OFFLOAD_VLAN_STRIP; return 0; } @@ -448,6 +459,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, dev->data->rx_queues[rx_queue_id] = pkt_q; pkt_q->in_port = dev->data->port_id; + pkt_q->vlan_strip = internals->vlan_strip; return 0; } -- 2.7.4
[dpdk-dev] [PATCH] net/ice: fix integer overflow when computing max_pkt_len
Greetings, Please review the following patch for the dpdk-next-net-intel branch. The len variable, used in the computation of max_pkt_len could overflow, if used to store the result of the following computation: ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len Since, we could define the mbuf size to have a large value (i.e 13312), and ICE_SUPPORT_CHAIN_NUM is defined as 5, the computation mentioned above could potentially result in a value which might be bigger than MAX_USHORT. The result will be that Jumbo Frames will not work properly Signed-off-by: Tudor Cornea --- drivers/net/ice/ice_dcf_ethdev.c | 7 --- drivers/net/ice/ice_rxtx.c | 6 +++--- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/net/ice/ice_dcf_ethdev.c b/drivers/net/ice/ice_dcf_ethdev.c index b937cbb..f73dc80 100644 --- a/drivers/net/ice/ice_dcf_ethdev.c +++ b/drivers/net/ice/ice_dcf_ethdev.c @@ -54,13 +54,14 @@ ice_dcf_init_rxq(struct rte_eth_dev *dev, struct ice_rx_queue *rxq) struct ice_dcf_adapter *dcf_ad = dev->data->dev_private; struct rte_eth_dev_data *dev_data = dev->data; struct iavf_hw *hw = &dcf_ad->real_hw.avf; - uint16_t buf_size, max_pkt_len, len; + uint16_t buf_size, max_pkt_len; buf_size = rte_pktmbuf_data_room_size(rxq->mp) - RTE_PKTMBUF_HEADROOM; rxq->rx_hdr_len = 0; rxq->rx_buf_len = RTE_ALIGN(buf_size, (1 << ICE_RLAN_CTX_DBUF_S)); - len = ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len; - max_pkt_len = RTE_MIN(len, dev->data->dev_conf.rxmode.max_rx_pkt_len); + max_pkt_len = RTE_MIN((uint32_t) + ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len, + dev->data->dev_conf.rxmode.max_rx_pkt_len); /* Check if the jumbo frame and maximum packet length are set * correctly. diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c index fc9bb5a..20352b0 100644 --- a/drivers/net/ice/ice_rxtx.c +++ b/drivers/net/ice/ice_rxtx.c @@ -258,7 +258,7 @@ ice_program_hw_rx_queue(struct ice_rx_queue *rxq) struct rte_eth_dev_data *dev_data = rxq->vsi->adapter->pf.dev_data; struct ice_rlan_ctx rx_ctx; enum ice_status err; - uint16_t buf_size, len; + uint16_t buf_size; struct rte_eth_rxmode *rxmode = &dev_data->dev_conf.rxmode; uint32_t rxdid = ICE_RXDID_COMMS_OVS; uint32_t regval; @@ -268,8 +268,8 @@ ice_program_hw_rx_queue(struct ice_rx_queue *rxq) RTE_PKTMBUF_HEADROOM); rxq->rx_hdr_len = 0; rxq->rx_buf_len = RTE_ALIGN(buf_size, (1 << ICE_RLAN_CTX_DBUF_S)); - len = ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len; - rxq->max_pkt_len = RTE_MIN(len, + rxq->max_pkt_len = RTE_MIN((uint32_t) + ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len, dev_data->dev_conf.rxmode.max_rx_pkt_len); if (rxmode->offloads & DEV_RX_OFFLOAD_JUMBO_FRAME) { -- 2.7.4
[dpdk-dev] [PATCH] net/ice: fix integer overflow when computing max_pkt_len
The len variable, used in the computation of max_pkt_len could overflow, if used to store the result of the following computation: ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len Since, we could define the mbuf size to have a large value (i.e 13312), and ICE_SUPPORT_CHAIN_NUM is defined as 5, the computation mentioned above could potentially result in a value which might be bigger than MAX_USHORT. The result will be that Jumbo Frames will not work properly Signed-off-by: Tudor Cornea --- drivers/net/ice/ice_dcf_ethdev.c | 7 --- drivers/net/ice/ice_rxtx.c | 6 +++--- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/net/ice/ice_dcf_ethdev.c b/drivers/net/ice/ice_dcf_ethdev.c index b937cbb..f73dc80 100644 --- a/drivers/net/ice/ice_dcf_ethdev.c +++ b/drivers/net/ice/ice_dcf_ethdev.c @@ -54,13 +54,14 @@ ice_dcf_init_rxq(struct rte_eth_dev *dev, struct ice_rx_queue *rxq) struct ice_dcf_adapter *dcf_ad = dev->data->dev_private; struct rte_eth_dev_data *dev_data = dev->data; struct iavf_hw *hw = &dcf_ad->real_hw.avf; - uint16_t buf_size, max_pkt_len, len; + uint16_t buf_size, max_pkt_len; buf_size = rte_pktmbuf_data_room_size(rxq->mp) - RTE_PKTMBUF_HEADROOM; rxq->rx_hdr_len = 0; rxq->rx_buf_len = RTE_ALIGN(buf_size, (1 << ICE_RLAN_CTX_DBUF_S)); - len = ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len; - max_pkt_len = RTE_MIN(len, dev->data->dev_conf.rxmode.max_rx_pkt_len); + max_pkt_len = RTE_MIN((uint32_t) + ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len, + dev->data->dev_conf.rxmode.max_rx_pkt_len); /* Check if the jumbo frame and maximum packet length are set * correctly. diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c index fc9bb5a..20352b0 100644 --- a/drivers/net/ice/ice_rxtx.c +++ b/drivers/net/ice/ice_rxtx.c @@ -258,7 +258,7 @@ ice_program_hw_rx_queue(struct ice_rx_queue *rxq) struct rte_eth_dev_data *dev_data = rxq->vsi->adapter->pf.dev_data; struct ice_rlan_ctx rx_ctx; enum ice_status err; - uint16_t buf_size, len; + uint16_t buf_size; struct rte_eth_rxmode *rxmode = &dev_data->dev_conf.rxmode; uint32_t rxdid = ICE_RXDID_COMMS_OVS; uint32_t regval; @@ -268,8 +268,8 @@ ice_program_hw_rx_queue(struct ice_rx_queue *rxq) RTE_PKTMBUF_HEADROOM); rxq->rx_hdr_len = 0; rxq->rx_buf_len = RTE_ALIGN(buf_size, (1 << ICE_RLAN_CTX_DBUF_S)); - len = ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len; - rxq->max_pkt_len = RTE_MIN(len, + rxq->max_pkt_len = RTE_MIN((uint32_t) + ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len, dev_data->dev_conf.rxmode.max_rx_pkt_len); if (rxmode->offloads & DEV_RX_OFFLOAD_JUMBO_FRAME) { -- 2.7.4
[PATCH v2] net/af_packet: allow changing fanout mode
This allows us to control the algorithm used to spread traffic between sockets, adding more fine grained control. If the user does not specify a fanout mode, the PMD driver will default to PACKET_FANOUT_HASH. Signed-off-by: Tudor Cornea --- v2: * Renamed the patch * Replaced packet_fanout argument with fanout_mode, which allows more fine grained control --- doc/guides/nics/af_packet.rst | 4 +- drivers/net/af_packet/rte_eth_af_packet.c | 92 --- 2 files changed, 83 insertions(+), 13 deletions(-) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index a343d3a961..3443f95004 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -25,6 +25,8 @@ Some of these, in turn, will be used to configure the PACKET_MMAP settings. * ``qpairs`` - number of Rx and Tx queues (optional, default 1); * ``qdisc_bypass`` - set PACKET_QDISC_BYPASS option in AF_PACKET (optional, disabled by default); +* ``fanout_mode`` - set fanout algorithm. Possible choices: hash,lb,cpu,rollover,rnd,qm (optional, +default hash); * ``blocksz`` - PACKET_MMAP block size (optional, default 4096); * ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple of 16B); @@ -64,7 +66,7 @@ framecnt=512): .. code-block:: console - --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0 + --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash Features and Limitations diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index ceb8d9356a..8449975384 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -36,6 +36,7 @@ #define ETH_AF_PACKET_FRAMESIZE_ARG"framesz" #define ETH_AF_PACKET_FRAMECOUNT_ARG "framecnt" #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass" +#define ETH_AF_PACKET_FANOUT_MODE_ARG "fanout_mode" #define DFLT_FRAME_SIZE(1 << 11) #define DFLT_FRAME_COUNT (1 << 9) @@ -96,6 +97,7 @@ static const char *valid_arguments[] = { ETH_AF_PACKET_FRAMESIZE_ARG, ETH_AF_PACKET_FRAMECOUNT_ARG, ETH_AF_PACKET_QDISC_BYPASS_ARG, + ETH_AF_PACKET_FANOUT_MODE_ARG, NULL }; @@ -700,6 +702,61 @@ open_packet_iface(const char *key __rte_unused, return 0; } +#if defined(PACKET_FANOUT) +#define PACKET_FANOUT_INVALID -1 + +static int +get_fanout_group_id(int if_index) +{ + return (getpid() ^ if_index) & 0x; +} + +static int +get_fanout_mode(const char *fanout_mode) +{ + int mode = PACKET_FANOUT_FLAG_DEFRAG; + +#if defined(PACKET_FANOUT_FLAG_ROLLOVER) + mode |= PACKET_FANOUT_FLAG_ROLLOVER; +#endif + + if (!fanout_mode) { + /* Default */ + mode |= PACKET_FANOUT_HASH; + } else if (!strcmp(fanout_mode, "hash")) { + mode |= PACKET_FANOUT_HASH; + } else if (!strcmp(fanout_mode, "lb")) { + mode |= PACKET_FANOUT_LB; + } else if (!strcmp(fanout_mode, "cpu")) { + mode |= PACKET_FANOUT_CPU; + } else if (!strcmp(fanout_mode, "rollover")) { + mode |= PACKET_FANOUT_ROLLOVER; + } else if (!strcmp(fanout_mode, "rnd")) { + mode |= PACKET_FANOUT_RND; + } else if (!strcmp(fanout_mode, "qm")) { + mode |= PACKET_FANOUT_QM; + } else { + /* Invalid Fanout Mode */ + mode = PACKET_FANOUT_INVALID; + } + + return mode; +} + +static int +get_fanout(const char *fanout_mode, int if_index) +{ + int group_id = get_fanout_group_id(if_index); + int mode = get_fanout_mode(fanout_mode); + int fanout = PACKET_FANOUT_INVALID; + + if (mode != PACKET_FANOUT_INVALID) + fanout = group_id | (mode << 16); + + return fanout; +} +#endif + static int rte_pmd_init_internals(struct rte_vdev_device *dev, const int sockfd, @@ -709,6 +766,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, unsigned int framesize, unsigned int framecnt, unsigned int qdisc_bypass, + const char *fanout_mode, struct pmd_internals **internals, struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist) @@ -810,11 +868,12 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, sockaddr.sll_ifindex = (*internals)->if_index; #if defined(PACKET_FANOUT) - fanout_arg = (getpid() ^ (*internals)->if_index) & 0x; - fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG) << 16;
Re: [PATCH] net/af_packet: allow disabling packet fanout
> This is a great idea. Would introducing a new devarg, (e.g > 'fanout_mode') be the proper way to allow the application to customize > fanout in more detail ? > > --vdev=net_af_packet0,iface=eth1,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,fanout_mode=[fanout_hash|fanout_cpu|fanout_rnd|fanout_qm] I'll send a new version of the patch which will add some additional customization
[PATCH] net/af_packet: fix socket close on device stop
Currently, if we call rte_eth_dev_stop(), the sockets are closed. If we attempt to start the port again, socket related operations will not work correctly. This can be alleviated by closing the socket at the same place in which we currently free the memory, in eth_dev_close(). If an application calls rte_eth_dev_stop() on a port managed by the af_packet PMD, the port becomes unusable. This is in contrast with ports managed by other drivers (e.g virtio). I also managed to reproduce the issue using testpmd. sudo ip link add test-veth0 type veth peer name test-veth1 sudo ip link set test-veth0 up sudo ip link set test-veth1 up AF_PACKET_ARGS=\ "blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0" sudo ./dpdk-testpmd \ -l 0-3 \ -m 1024 \ --no-huge \ --no-shconf \ --no-pci \ --vdev=net_af_packet0,iface=test-veth0,${AF_PACKET_ARGS} \ --vdev=net_af_packet1,iface=test-veth1,${AF_PACKET_ARGS} \ -- \ -i testpmd> start tx_first Forwarding will start, and we will see traffic on the interfaces. testpmd> stop testpmd> port stop 0 Stopping ports... Checking link statuses... Done testpmd> port stop 1 Stopping ports... Checking link statuses... Done testpmd> port start 0 AFPACKET: eth_dev_macaddr_set(): receive socket not found Port 0: CA:65:81:63:81:B2 Checking link statuses... Done testpmd> port start 1 AFPACKET: eth_dev_macaddr_set(): receive socket not found Port 1: CA:12:D0:BE:93:3F Checking link statuses... Done testpmd> start tx_first When we start forwarding again, we can see that there is no traffic on the interfaces. This does not happen when testing with other PMD drivers (e.g virtio). With the patch, the port should re-initialize correctly. testpmd> port start 0 Port 0: CA:65:81:63:81:B2 Checking link statuses... Done Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Signed-off-by: Tudor Cornea --- drivers/net/af_packet/rte_eth_af_packet.c | 30 +++ 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index ceb8d9356a..efae15a493 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -357,27 +357,12 @@ static int eth_dev_stop(struct rte_eth_dev *dev) { unsigned i; - int sockfd; struct pmd_internals *internals = dev->data->dev_private; for (i = 0; i < internals->nb_queues; i++) { - sockfd = internals->rx_queue[i].sockfd; - if (sockfd != -1) - close(sockfd); - - /* Prevent use after free in case tx fd == rx fd */ - if (sockfd != internals->tx_queue[i].sockfd) { - sockfd = internals->tx_queue[i].sockfd; - if (sockfd != -1) - close(sockfd); - } - - internals->rx_queue[i].sockfd = -1; - internals->tx_queue[i].sockfd = -1; dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED; dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED; } - dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN; return 0; } @@ -474,6 +459,7 @@ eth_dev_close(struct rte_eth_dev *dev) struct pmd_internals *internals; struct tpacket_req *req; unsigned int q; + int sockfd; if (rte_eal_process_type() != RTE_PROC_PRIMARY) return 0; @@ -484,6 +470,20 @@ eth_dev_close(struct rte_eth_dev *dev) internals = dev->data->dev_private; req = &internals->req; for (q = 0; q < internals->nb_queues; q++) { + sockfd = internals->rx_queue[q].sockfd; + if (sockfd != -1) + close(sockfd); + + /* Prevent use after free in case tx fd == rx fd */ + if (sockfd != internals->tx_queue[q].sockfd) { + sockfd = internals->tx_queue[q].sockfd; + if (sockfd != -1) + close(sockfd); + } + + internals->rx_queue[q].sockfd = -1; + internals->tx_queue[q].sockfd = -1; + munmap(internals->rx_queue[q].map, 2 * req->tp_block_size * req->tp_block_nr); rte_free(internals->rx_queue[q].rd); -- 2.34.1
Re: [PATCH] net/af_packet: fix socket close on device stop
> Applied to next-net > Should this go to stable as well I think it would probably make sense to have it in stable.
[PATCH] net/af_packet: allow disabling packet fanout
This allows us to control whether the PMD will attempt to use the PACKET_FANOUT socket option, and allows the binary compiled against newer kernel headers to run on an older kernel, which lacks support for it. Signed-off-by: Tudor Cornea --- doc/guides/nics/af_packet.rst | 4 ++- drivers/net/af_packet/rte_eth_af_packet.c | 32 ++- 2 files changed, 28 insertions(+), 8 deletions(-) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index a343d3a961..42837031de 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -25,6 +25,8 @@ Some of these, in turn, will be used to configure the PACKET_MMAP settings. * ``qpairs`` - number of Rx and Tx queues (optional, default 1); * ``qdisc_bypass`` - set PACKET_QDISC_BYPASS option in AF_PACKET (optional, disabled by default); +* ``packet_fanout`` - set PACKET_FANOUT option in AF_PACKET (optional, +enabled by default); * ``blocksz`` - PACKET_MMAP block size (optional, default 4096); * ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple of 16B); @@ -64,7 +66,7 @@ framecnt=512): .. code-block:: console - --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0 + --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,packet_fanout=0 Features and Limitations diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index ceb8d9356a..521dbef317 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -36,6 +36,7 @@ #define ETH_AF_PACKET_FRAMESIZE_ARG"framesz" #define ETH_AF_PACKET_FRAMECOUNT_ARG "framecnt" #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass" +#define ETH_AF_PACKET_PACKET_FANOUT_ARG"packet_fanout" #define DFLT_FRAME_SIZE(1 << 11) #define DFLT_FRAME_COUNT (1 << 9) @@ -96,6 +97,7 @@ static const char *valid_arguments[] = { ETH_AF_PACKET_FRAMESIZE_ARG, ETH_AF_PACKET_FRAMECOUNT_ARG, ETH_AF_PACKET_QDISC_BYPASS_ARG, + ETH_AF_PACKET_PACKET_FANOUT_ARG, NULL }; @@ -709,6 +711,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, unsigned int framesize, unsigned int framecnt, unsigned int qdisc_bypass, + unsigned int packet_fanout, struct pmd_internals **internals, struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist) @@ -926,14 +929,17 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, goto error; } + if (packet_fanout) { #if defined(PACKET_FANOUT) - rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT, - &fanout_arg, sizeof(fanout_arg)); - if (rc == -1) { - PMD_LOG_ERRNO(ERR, - "%s: could not set PACKET_FANOUT on AF_PACKET socket for %s", - name, pair->value); - goto error; + rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT, + &fanout_arg, sizeof(fanout_arg)); + if (rc == -1) { + PMD_LOG_ERRNO(ERR, + "%s: could not set PACKET_FANOUT " + "on AF_PACKET socket for %s", + name, pair->value); + goto error; + } } #endif } @@ -1003,6 +1009,7 @@ rte_eth_from_packet(struct rte_vdev_device *dev, unsigned int framecount = DFLT_FRAME_COUNT; unsigned int qpairs = 1; unsigned int qdisc_bypass = 1; + unsigned int packet_fanout = 1; /* do some parameter checking */ if (*sockfd < 0) @@ -1065,6 +1072,16 @@ rte_eth_from_packet(struct rte_vdev_device *dev, } continue; } + if (strstr(pair->key, ETH_AF_PACKET_PACKET_FANOUT_ARG) != NULL) { + packet_fanout = atoi(pair->value); + if (packet_fanout > 1) { + PMD_LOG(ERR, + "%s: invalid packet fanout value", + name); + return -1; + } + continue; + } } if (framesize > blocksize) { @@ -1091,6 +1108,7 @@ rte_eth_from_packet(struct rte
Re: [PATCH] net/af_packet: allow disabling packet fanout
> Controlling fanout more is a good idea but not sure what this patch > is trying to do with it. I am maintaining a DPDK application which should run on a large number of setups. Unfortunately, I do not have a lot of control over the environment the application runs on (e.g kernel). The problem I was trying to solve is that the Af_Packet PMD seems to fail to initialize properly on some setups. I managed to reproduce the issue with testpmd. sudo ./x86_64-native-linuxapp-gcc/app/dpdk-testpmd \ -l 0-3 \ -m 1024 \ --no-huge \ --no-shconf \ --no-pci \ --vdev=net_af_packet0,iface=eth1,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0 \ --vdev=net_af_packet1,iface=eth2,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0 \ -- \ -i Error: Unknown device type. EAL: Detected CPU lcores: 8 EAL: Detected NUMA nodes: 1 EAL: Detected static linkage of DPDK EAL: Selected IOVA mode 'VA' AFPACKET: rte_pmd_init_internals(): net_af_packet0: could not set PACKET_FANOUT on AF_PACKET socket for eth1:Invalid argument VDEV_BUS: vdev_probe(): failed to initialize net_af_packet0 device AFPACKET: rte_pmd_init_internals(): net_af_packet1: could not set PACKET_FANOUT on AF_PACKET socket for eth1:Invalid argument VDEV_BUS: vdev_probe(): failed to initialize net_af_packet1 device EAL: Bus (vdev) probe failed. testpmd: No probed ethernet devices It seems that the call to PACKET_FANOUT setsockopt failed with EINVAL. I debugged the issue further. It seems that the root cause is that when the interface is down (e.g eth1), after the bind operation succeeds, the PACKET_FANOUT setsockopt will fail. This is a behavior of AF_Packet in Linux which I'm not sure is that well documented. It seems to be fixed in a recent Linux Kernel commit. I re-compiled the kernel with the change, and did not see the issue afterwards. Commit 2cee3e6e2e4b74bec96694169f01cd3feec1f264 af_packet: allow fanout_add when socket is not RUNNING [1] https://github.com/torvalds/linux/commit/2cee3e6e2e4b74bec96694169f01cd3feec1f264 I was trying to find a solution for my application to run on setups which don't have the kernel patch. Introducing a devarg to control when packet_fanout is enabled seemed like a possible way around the issue. As a second solution around the issue, I was also thinking of using the PACKET_FANOUT setsockopt only when nb_queues > 1. Conceptually, I think it would make some sense, because fanout doesn't really help when having a single socket. We would need more sockets in order to load balance incoming packets. And this would still allow me to control the behavior from the existing 'qpairs' devarg. The idea would be something along these lines if (nb_queues > 1) { rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT, &fanout_arg, sizeof(fanout_arg)); /* ... */ } Would a patch using this idea be considered more suitable ? > - DPDK minimum kernel version is now 4.19 so no point in worrying about >backward compatibility. According to man page for packet, fanout >was added in 3.1 kernel. > > - It would be useful to allow application to control fanout in more detail. This is a great idea. Would introducing a new devarg, (e.g 'fanout_mode') be the proper way to allow the application to customize fanout in more detail ? --vdev=net_af_packet0,iface=eth1,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,fanout_mode=[fanout_hash|fanout_cpu|fanout_rnd|fanout_qm]
[PATCH v3] net/af_packet: allow changing fanout mode
This allows us to control the algorithm used to spread traffic between sockets, adding more fine grained control. If the user does not specify a fanout mode, the PMD driver will default to PACKET_FANOUT_HASH. Signed-off-by: Tudor Cornea --- v3: * Removed PACKET_FANOUT ifdef. The feature has existed since Linux 3.1 * Removed PACKET_FANOUT_FLAG_ROLLOVER ifdef * Simplified the get_fanout() function * Renamed mode variable to load_balance. I was also thinking of load_balance_mode as an alternative. * Removed space between get_fanout and following if statement * Improved documentation. Added link to manual page for PACKET_FANOUT v2: * Renamed the patch * Replaced packet_fanout argument with fanout_mode, which allows more fine grained control --- doc/guides/nics/af_packet.rst | 16 +++- drivers/net/af_packet/rte_eth_af_packet.c | 89 ++- 2 files changed, 84 insertions(+), 21 deletions(-) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index a343d3a961..2ec9e9e674 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -13,8 +13,6 @@ PACKET_MMAP, which provides a mmapped ring buffer, shared between user space and kernel, that's used to send and receive packets. This helps reducing system calls and the copies needed between user space and Kernel. -The PACKET_FANOUT_HASH behavior of AF_PACKET is used for frame reception. - Options and inherent limitations @@ -25,11 +23,23 @@ Some of these, in turn, will be used to configure the PACKET_MMAP settings. * ``qpairs`` - number of Rx and Tx queues (optional, default 1); * ``qdisc_bypass`` - set PACKET_QDISC_BYPASS option in AF_PACKET (optional, disabled by default); +* ``fanout_mode`` - set fanout algorithm. +Possible choices: hash, lb, cpu, rollover, rnd, qm (optional, default hash); * ``blocksz`` - PACKET_MMAP block size (optional, default 4096); * ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple of 16B); * ``framecnt`` - PACKET_MMAP frame count (optional, default 512). +For details regarding ``fanout_mode`` argument, you can consult the +`PACKET_FANOUT documentation <https://www.man7.org/linux/man-pages/man7/packet.7.html>`_. + +As an example, when ``fanout_mode=cpu`` is selected, the PACKET_FANOUT_CPU +mode will be set on the sockets, so that on frame reception, the socket +will be selected based on the CPU on which the packet arrived. + +Only one ``fanout_mode`` can be chosen. If left unspecified, the default is to +use the PACKET_FANOUT_HASH behavior of AF_PACKET for frame reception. + Because this implementation is based on PACKET_MMAP, and PACKET_MMAP has its own pre-requisites, it should be noted that the inner workings of PACKET_MMAP should be carefully considered before modifying some of these options (namely, @@ -64,7 +74,7 @@ framecnt=512): .. code-block:: console - --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0 + --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash Features and Limitations diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index ceb8d9356a..9ca43fc54e 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -36,6 +36,7 @@ #define ETH_AF_PACKET_FRAMESIZE_ARG"framesz" #define ETH_AF_PACKET_FRAMECOUNT_ARG "framecnt" #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass" +#define ETH_AF_PACKET_FANOUT_MODE_ARG "fanout_mode" #define DFLT_FRAME_SIZE(1 << 11) #define DFLT_FRAME_COUNT (1 << 9) @@ -96,6 +97,7 @@ static const char *valid_arguments[] = { ETH_AF_PACKET_FRAMESIZE_ARG, ETH_AF_PACKET_FRAMECOUNT_ARG, ETH_AF_PACKET_QDISC_BYPASS_ARG, + ETH_AF_PACKET_FANOUT_MODE_ARG, NULL }; @@ -700,6 +702,53 @@ open_packet_iface(const char *key __rte_unused, return 0; } +#define PACKET_FANOUT_INVALID -1 + +static int +get_fanout_group_id(int if_index) +{ + return (getpid() ^ if_index) & 0x; +} + +static int +get_fanout_mode(const char *fanout_mode) +{ + int load_balance = PACKET_FANOUT_FLAG_DEFRAG | + PACKET_FANOUT_FLAG_ROLLOVER; + + if (!fanout_mode) { + /* Default */ + load_balance |= PACKET_FANOUT_HASH; + } else if (!strcmp(fanout_mode, "hash")) { + load_balance |= PACKET_FANOUT_HASH; + } else if (!strcmp(fanout_mode, "lb")) { + load_balance |= PACKET_FANOUT_LB; + } else if (!strcmp(fanout_mode, "cpu")) { + load_balance |= PACKET_FANOUT_CPU; + } else if (!strcmp(fanout_mode, "rollover&quo