[PATCH v5] kni: allow configuring the kni thread granularity

2022-01-14 Thread Tudor Cornea
The Kni kthreads seem to be re-scheduled at a granularity of roughly
1 millisecond right now, which seems to be insufficient for performing
tests involving a lot of control plane traffic.

Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it
seems that the existing code cannot reschedule at the desired granularily,
due to precision constraints of schedule_timeout_interruptible().

In our use case, we leverage the Linux Kernel for control plane, and
it is not uncommon to have 60K - 100K pps for some signaling protocols.

Since we are not in atomic context, the usleep_range() function seems to be
more appropriate for being able to introduce smaller controlled delays,
in the range of 5-10 microseconds. Upon reading the existing code, it would
seem that this was the original intent. Adding sub-millisecond delays,
seems unfeasible with a call to schedule_timeout_interruptible().

KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */
schedule_timeout_interruptible(
usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL));

Below, we attempted a brief comparison between the existing implementation,
which uses schedule_timeout_interruptible() and usleep_range().

We attempt to measure the CPU usage, and RTT between two Kni interfaces,
which are created on top of vmxnet3 adapters, connected by a vSwitch.

insmod rte_kni.ko kthread_mode=single carrier=on

schedule_timeout_interruptible(usecs_to_jiffies(5))
kni_single CPU Usage: 2-4 %
[root@localhost ~]# ping 1.1.1.2 -I eth1
PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data.
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms

usleep_range(5, 10)
kni_single CPU usage: 50%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms

usleep_range(20, 50)
kni_single CPU usage: 24%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms

usleep_range(50, 100)
kni_single CPU usage: 13%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms

usleep_range(100, 200)
kni_single CPU usage: 7%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms

usleep_range(1000, 1100)
kni_single CPU usage: 2%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms

Upon testing, usleep_range(1000, 1100) seems roughly equivalent in
latency and cpu usage to the variant with schedule_timeout_interruptible(),
while usleep_range(100, 200) seems to give a decent tradeoff between
latency and cpu usage, while allowing users to tweak the limits for
improved precision if they have such use cases.

Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a
softlockup on my kernel.

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 1226 Comm: kni_single Tainted: GW  O 3.10 #1
   [] dump_stack+0x19/0x1b
 [] panic+0xcd/0x1e0
 [] watchdog_timer_fn+0x160/0x160
 [] __run_hrtimer.isra.4+0x42/0xd0
 [] hrtimer_interrupt+0xe7/0x1f0
 [] smp_apic_timer_interrupt+0x67/0xa0
 [] apic_timer_interrupt+0x6d/0x80

This patch also attempts to remove this option.

References:
[1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

Signed-off-by: Tudor Cornea 
Acked-by: Padraig Connolly 
Reviewed-by: Ferruh Yigit 
---
v5:
* Rebased the patch on top of the dpdk-next-net branch
v4:
* Removed RTE_KNI_PREEMPT_DEFAULT configuration option
v3:
* Fixed unwrapped commit description warning
* Changed from hrtimers to Linux High Precision Timers in docs
* Added two tabs at the beginning of the new params description.
  Stephen correctly pointed out that the descriptions of the parameters
  for the Kni module are nonstandard w.r.t existing kernel code.
  I was thinking to preserve compatibility with the existing parameters
  of the Kni module for the moment, while an additional clean-up

Re: [PATCH] kernel/kni: retry the xmit in case ring is full

2022-01-17 Thread Tudor Cornea
Hi Stephen,


> NAK
> Doing this risks having a CPU lockup if userspace does not keep up
> or the DPDK application gets stuck.
>
> There are better ways to solve the TCP stack queue overrun issue:
> 1. Use a better queueing discipline on the kni device. The Linux default
>of pfifo_fast has bufferbloat issues. Use fq_codel, fq, codel or pie?
> 2. KNI should implement BQL so that TCP stack can see lock backpressure
>about possible queue depth.
>
>
Thanks for the suggestions.
I agree that we risk a lockup, in case the DPDK app gets stuck.

Indeed, I am running on an older Linux kernel, and the default queuing
discipline is pfifo_fast.
I'll experiment with the queuing disciplines you recommended.


> As a simple workaround increase the KNI ring size. It won't solve the whole
> problem but i tcan help
>

I obtained moderate success with increasing MAX_MBUF_BURST_NUM from 32 to
1024 in librte_kni.
I'm not sure if such a change would be upstreamable. Perhaps it needs a bit
of testing.

I'll drop the current patch.


Re: [PATCH v5] kni: allow configuring the kni thread granularity

2022-01-17 Thread Tudor Cornea
On Fri, 14 Jan 2022 at 18:44, Ferruh Yigit  wrote:

> On 1/14/2022 4:24 PM, Stephen Hemminger wrote:
> > On Fri, 14 Jan 2022 17:18:19 +0200
> > Tudor Cornea  wrote:
> >
> >> +module_param(min_scheduling_interval, long, 0644);
> >> +MODULE_PARM_DESC(min_scheduling_interval,
> >> +"\t\tKni thread min scheduling interval (default=100 microseconds):\n"
> >> +"\t\t"
> >> +);
> >> +
> >> +module_param(max_scheduling_interval, long, 0644);
> >> +MODULE_PARM_DESC(max_scheduling_interval,
> >> +"\t\tKni thread max scheduling interval (default=200 microseconds):\n"
> >> +"\t\t"
> >> +);
> >
> > Please don't add more bad module parameter strings.
> > The KNI author did something no other kernel modules do with tabs
> > and double spacing, stop this bogus stuff.
> >
>
> The patch is good, let's not block it for the module parameter string,
> all can be fixed with another patch.
>
> Can you please give a sample what is a common way of it, me or Tudor can
> do the patch?
>
>
I agree that the module parameter string is in non-standard format.
I was planning to send a follow-up patch, which would correct the
description for all of the KNI parameters (including the two new parameters
that the current patch would add) in one shot.


> > Is there any reason you have to use KNI at all.
> > KNI is broken on many levels and is not fixable.
> > What about virtio or tap?
>
>
We've run some tests with tap interfaces and found the performance to not
be good enough for our use.
We're going to experiment with virtio_user in the future. I'm aware that
there is a long term plan to deprecate the KNI.


[PATCH v6] kni: allow configuring the kni thread granularity

2022-01-20 Thread Tudor Cornea
The Kni kthreads seem to be re-scheduled at a granularity of roughly
1 millisecond right now, which seems to be insufficient for performing
tests involving a lot of control plane traffic.

Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it
seems that the existing code cannot reschedule at the desired granularily,
due to precision constraints of schedule_timeout_interruptible().

In our use case, we leverage the Linux Kernel for control plane, and
it is not uncommon to have 60K - 100K pps for some signaling protocols.

Since we are not in atomic context, the usleep_range() function seems to be
more appropriate for being able to introduce smaller controlled delays,
in the range of 5-10 microseconds. Upon reading the existing code, it would
seem that this was the original intent. Adding sub-millisecond delays,
seems unfeasible with a call to schedule_timeout_interruptible().

KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */
schedule_timeout_interruptible(
usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL));

Below, we attempted a brief comparison between the existing implementation,
which uses schedule_timeout_interruptible() and usleep_range().

We attempt to measure the CPU usage, and RTT between two Kni interfaces,
which are created on top of vmxnet3 adapters, connected by a vSwitch.

insmod rte_kni.ko kthread_mode=single carrier=on

schedule_timeout_interruptible(usecs_to_jiffies(5))
kni_single CPU Usage: 2-4 %
[root@localhost ~]# ping 1.1.1.2 -I eth1
PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data.
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms

usleep_range(5, 10)
kni_single CPU usage: 50%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms

usleep_range(20, 50)
kni_single CPU usage: 24%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms

usleep_range(50, 100)
kni_single CPU usage: 13%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms

usleep_range(100, 200)
kni_single CPU usage: 7%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms

usleep_range(1000, 1100)
kni_single CPU usage: 2%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms

Upon testing, usleep_range(1000, 1100) seems roughly equivalent in
latency and cpu usage to the variant with schedule_timeout_interruptible(),
while usleep_range(100, 200) seems to give a decent tradeoff between
latency and cpu usage, while allowing users to tweak the limits for
improved precision if they have such use cases.

Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a
softlockup on my kernel.

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 1226 Comm: kni_single Tainted: GW  O 3.10 #1
   [] dump_stack+0x19/0x1b
 [] panic+0xcd/0x1e0
 [] watchdog_timer_fn+0x160/0x160
 [] __run_hrtimer.isra.4+0x42/0xd0
 [] hrtimer_interrupt+0xe7/0x1f0
 [] smp_apic_timer_interrupt+0x67/0xa0
 [] apic_timer_interrupt+0x6d/0x80

This patch also attempts to remove this option.

References:
[1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

Signed-off-by: Tudor Cornea 
Acked-by: Padraig Connolly 
Reviewed-by: Ferruh Yigit 
---
v6:
* Removed tabs and newline in the description of the
  min_scheduling_interval and max_scheduling_interval
  parameters. They seem to be non-standard.
  In a quick glance over the Linux tree, I saw some (rare)
  usages of newlines and tabs:
  drivers/scsi/bnx2fc/bnx2fc_fcoe.c (debug_logging)
  Fixing the other parameters might mean that we have to chop
  some text, otherwise the line could probably get too big.
v5:
* Rebased the patch on top of the dpdk-next-net branch
v4:
* Removed RTE_KNI_PREEMPT_DEFAULT configuration option
v3:
* Fixed unwrapped commit

[dpdk-dev] [PATCH] net/ixgbe: initialize max_rx_pkt_len if rlpml_set_vf fails

2021-10-15 Thread Tudor Cornea
It seems that if the call to ixgbevf_rlpml_set_vf fails,
we will not initialize dev_conf.rxmode.max_rx_pkt_len correctly anymore.

This happens with a 82599EB NIC and a VMware ESXI 6.0 setup,
and is causing VF the ports to fail to initialize

We see the following error:
ixgbevf_dev_rx_init(): Set max packet length to 1518 failed.

Investigating over DPDK 19.11, it seems that the call still fails,
but it doesn't exit prematurely, and max_rx_pkt_len is initialized in
the respective case.

On the ESXi server, we seem to have the following driver
net-ixgbe: 3.7.13.7.14iov-20vmw.600.0.0.2494585

It seems that the behavior related to MTU setting has changed
since the following commit:

commit c77866a16904 ("net/ixgbe: detect failed VF MTU set")

We would like to still be able to support older setups if possible,
as we might have customers running ESXi 6.0 or 6.5, and these seem
to have an older version of the PF driver as default.

Signed-off-by: Tudor Cornea 
---
 drivers/net/ixgbe/ixgbe_ethdev.c | 5 +++--
 drivers/net/ixgbe/ixgbe_rxtx.c   | 1 -
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 4dbe049..4301dfd 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -6369,6 +6369,9 @@ ixgbevf_dev_set_mtu(struct rte_eth_dev *dev, uint16_t mtu)
return -EINVAL;
}
 
+   /* update max frame size */
+   dev->data->dev_conf.rxmode.max_rx_pkt_len = max_frame;
+
/*
 * When supported by the underlying PF driver, use the IXGBE_VF_SET_MTU
 * request of the version 2.0 of the mailbox API.
@@ -6381,8 +6384,6 @@ ixgbevf_dev_set_mtu(struct rte_eth_dev *dev, uint16_t mtu)
if (ixgbevf_rlpml_set_vf(hw, max_frame))
return -EINVAL;
 
-   /* update max frame size */
-   dev->data->dev_conf.rxmode.max_rx_pkt_len = max_frame;
return 0;
 }
 
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 0ac89cb..02d9809 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -5677,7 +5677,6 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev)
(uint16_t)dev->data->dev_conf.rxmode.max_rx_pkt_len)) {
PMD_INIT_LOG(ERR, "Set max packet length to %d failed.",
 dev->data->dev_conf.rxmode.max_rx_pkt_len);
-   return -EINVAL;
}
 
/*
-- 
2.7.4



Re: [dpdk-dev] [PATCH] net/ixgbe: initialize max_rx_pkt_len if rlpml_set_vf fails

2021-10-15 Thread Tudor Cornea
Some of our customers use ESXi 6.0 or 6.5 servers, which could have older
versions of the PF ixgbe driver.
It seems that with a more recent version of the PMD driver, we are not able
to initialize 82599EB ports correctly.
This scenario seems to have worked with DPDK 19.11.

Would it be possible to print a warning, while still allowing the driver to
initialize the ports ?

I was also thinking about the return code of ixgbevf_dev_set_mtu.
Do you think it would be more appropriate to return ENOTSUP or ENOSYS
instead of EINVAL ?

As a user, calling 'rte_eth_dev_mtu_set', I would expect an error like
EINVAL to suggest to me that
the mtu value which I provided is incorrect [1]. The 82599 NIC, however has
some particularities related to mtu,
which could cause the operation to fail. I was thinking that EINVAL might
not be most descriptive error.

[1] https://doc.dpdk.org/api/rte__ethdev_8h.html

Thanks,
Tudor

On Fri, 15 Oct 2021 at 17:06, Tudor Cornea  wrote:

> It seems that if the call to ixgbevf_rlpml_set_vf fails,
> we will not initialize dev_conf.rxmode.max_rx_pkt_len correctly anymore.
>
> This happens with a 82599EB NIC and a VMware ESXI 6.0 setup,
> and is causing VF the ports to fail to initialize
>
> We see the following error:
> ixgbevf_dev_rx_init(): Set max packet length to 1518 failed.
>
> Investigating over DPDK 19.11, it seems that the call still fails,
> but it doesn't exit prematurely, and max_rx_pkt_len is initialized in
> the respective case.
>
> On the ESXi server, we seem to have the following driver
> net-ixgbe: 3.7.13.7.14iov-20vmw.600.0.0.2494585
>
> It seems that the behavior related to MTU setting has changed
> since the following commit:
>
> commit c77866a16904 ("net/ixgbe: detect failed VF MTU set")
>
> We would like to still be able to support older setups if possible,
> as we might have customers running ESXi 6.0 or 6.5, and these seem
> to have an older version of the PF driver as default.
>
> Signed-off-by: Tudor Cornea 
> ---
>  drivers/net/ixgbe/ixgbe_ethdev.c | 5 +++--
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 1 -
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
> b/drivers/net/ixgbe/ixgbe_ethdev.c
> index 4dbe049..4301dfd 100644
> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> @@ -6369,6 +6369,9 @@ ixgbevf_dev_set_mtu(struct rte_eth_dev *dev,
> uint16_t mtu)
> return -EINVAL;
> }
>
> +   /* update max frame size */
> +   dev->data->dev_conf.rxmode.max_rx_pkt_len = max_frame;
> +
> /*
>  * When supported by the underlying PF driver, use the
> IXGBE_VF_SET_MTU
>  * request of the version 2.0 of the mailbox API.
> @@ -6381,8 +6384,6 @@ ixgbevf_dev_set_mtu(struct rte_eth_dev *dev,
> uint16_t mtu)
> if (ixgbevf_rlpml_set_vf(hw, max_frame))
> return -EINVAL;
>
> -   /* update max frame size */
> -   dev->data->dev_conf.rxmode.max_rx_pkt_len = max_frame;
> return 0;
>  }
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 0ac89cb..02d9809 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -5677,7 +5677,6 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev)
> (uint16_t)dev->data->dev_conf.rxmode.max_rx_pkt_len)) {
> PMD_INIT_LOG(ERR, "Set max packet length to %d failed.",
>  dev->data->dev_conf.rxmode.max_rx_pkt_len);
> -   return -EINVAL;
> }
>
> /*
> --
> 2.7.4
>
>


Re: [dpdk-dev] [PATCH] net/ixgbe: initialize max_rx_pkt_len if rlpml_set_vf fails

2021-10-20 Thread Tudor Cornea
Hi Ferruh,

I have tested with the dpdk-next-net branch.
It includes the commit 'ethdev: fix max Rx packet length'

Setup:
Hypervisor: VMware ESXi 6.0.0
PF ixgbe Driver: 3.7.13.7 (default PF driver installed with ESXi 6.0 and
6.5)
NIC: Intel 82599
Guest OS : Ubuntu 20.04

It seems that with the recent changes testpmd still can't initialize the
ports.

EAL: Detected CPU lcores: 8
EAL: Detected NUMA nodes: 1
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No available 1048576 kB hugepages reported
EAL: VFIO support initialized
EAL: Probe PCI driver: net_ixgbe_vf (8086:10ed) device: :0b:00.0
(socket 0)
EAL: Probe PCI driver: net_ixgbe_vf (8086:10ed) device: :13:00.0
(socket 0)
TELEMETRY: No legacy callbacks, legacy socket not created
Interactive-mode selected
previous number of forwarding ports 2 - changed to number of configured
ports 1
Error picking flow transfer proxy for port 0: Function not implemented -
ignore
Error picking flow transfer proxy for port 1: Function not implemented -
ignore
testpmd: create a new mbuf pool : n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc

Warning! port-topology=paired and odd forward ports number, the last port
will pair with itself.

Configuring Port 0 (socket 0)
ixgbevf_dev_rx_init(): Set max packet length to 1518 failed.
ixgbevf_dev_start(): Unable to initialize RX hardware (-22)
Fail to start port 0: Invalid argument
Configuring Port 1 (socket 0)
ixgbevf_dev_rx_init(): Set max packet length to 1518 failed.
ixgbevf_dev_start(): Unable to initialize RX hardware (-22)
Fail to start port 1: Invalid argument
Please stop the ports first
Done
Error during enabling promiscuous mode for port 0: Operation not supported
- ignore
Error during enabling promiscuous mode for port 1: Operation not supported
- ignore
testpmd> start tx_first
Not all ports were started

I think failing to set set 'max_rx_pkt_len' after ixgbevf_rlpml_set_vf()
call failed in ixgbevf_dev_set_mtu(), might have been one half of the
problem.
The other one is in ixgbevf_dev_rx_init(). The init function seems to
return prematurely, and not initialize the ports.

With the below patch (on top of net-next branch), I seem to be able to
initialize the ports correctly, and send packets using testpmd.

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index b263dfe1d5..fdd057c789 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -5676,7 +5676,6 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev)
if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0) {
PMD_INIT_LOG(ERR, "Set max packet length to %d failed.",
 frame_size);
-   return -EINVAL;
}

/*

Alvin, would it be acceptable to not return -EINVAL in this case while
still printing an error, so that we can still debug mtu issues on 82599 ?

If I recall correctly, the NIC has issues when the MTU of VFs is bigger
than the MTU of the PFs. On my setup, however the MTUs have default values
(1500).
Should I rebase the patch on top of net-next ?

Thanks,
Tudor

On Wed, 20 Oct 2021 at 06:08, Zhang, AlvinX  wrote:

> > -Original Message-
> > From: Yigit, Ferruh 
> > Sent: Tuesday, October 19, 2021 8:58 PM
> > To: Tudor Cornea ; Zhang, Qi Z
> > 
> > Cc: Zhang, AlvinX ; Wang, Haiyue
> > ; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] net/ixgbe: initialize max_rx_pkt_len if
> > rlpml_set_vf fails
> >
> > On 10/15/2021 3:20 PM, Tudor Cornea wrote:
> > > Some of our customers use ESXi 6.0 or 6.5 servers, which could have
> > > older versions of the PF ixgbe driver.
> > > It seems that with a more recent version of the PMD driver, we are not
> > > able to initialize 82599EB ports correctly.
> > > This scenario seems to have worked with DPDK 19.11.
> > >
> > > Would it be possible to print a warning, while still allowing the
> > > driver to initialize the ports ?
>
> There is a scenario that we initialize the port successful, with no any
> error, but we cannot get any packets.
> So keep printing error or warning and still allowing the driver to
> initialize the port may be the best solution.
>
> > >
> > > I was also thinking about the return code of ixgbevf_dev_set_mtu.
> > > Do you think it would be more appropriate to return ENOTSUP or ENOSYS
> > > instead of EINVAL ?
>
> If ixgbevf_dev_set_mtu fails, we have no way to know it because of invalid
> value or not supported?
> ixgbevf_dev_set_mtu returns one the these values: 0 or IXGBE_ERR_MBX
>
> > >
> > > As a user, calling 'rte_eth_dev_mtu_set', I would expect an error like
> > > EINVAL to

[dpdk-dev] [PATCH v2] net/ixgbe: initialize port even if mtu config fails

2021-10-20 Thread Tudor Cornea
On a VMware ESXi 6.0 setup with an Intel 82599 NIC the ports don't
seem to initialize anymore, while running testpmd.

Configuring Port 0 (socket 0)
ixgbevf_dev_rx_init(): Set max packet length to 1518 failed.
ixgbevf_dev_start(): Unable to initialize RX hardware (-22)
Fail to start port 0: Invalid argument
Configuring Port 1 (socket 0)
ixgbevf_dev_rx_init(): Set max packet length to 1518 failed.
ixgbevf_dev_start(): Unable to initialize RX hardware (-22)
Fail to start port 1: Invalid argument
Please stop the ports first

If the call to ixgbevf_rlpml_set_vf fails and we return prematurely,
we will not be able to initialize the ports correctly.

The behavior seems to have changed since the following commit:

commit c77866a16904 ("net/ixgbe: detect failed VF MTU set")

We can make this particular use case work correctly if we don't
return an error, which seems to be consistent with the overall
kernel ixgbevf implementation.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c#n2015

Signed-off-by: Tudor Cornea 

---
v2:
* Change title
* Remove max_rx_pkt_len fix in ixgbe_ethdev.c
  It's already fixed as part of Ferruh's changes in next-net branch,
  so this part should be redundant, now
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index b263dfe..a51450f 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -5673,11 +5673,9 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev)
 * ixgbevf_rlpml_set_vf even if jumbo frames are not used. This way,
 * VF packets received can work in all cases.
 */
-   if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0) {
+   if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0)
PMD_INIT_LOG(ERR, "Set max packet length to %d failed.",
 frame_size);
-   return -EINVAL;
-   }
 
/*
 * Assume no header split and no VLAN strip support
-- 
2.7.4



Re: [dpdk-dev] [PATCH v2] net/ixgbe: initialize port even if mtu config fails

2021-10-21 Thread Tudor Cornea
On Thu, 21 Oct 2021 at 18:33, Ferruh Yigit  wrote:

> On 10/20/2021 7:13 PM, Tudor Cornea wrote:
> > On a VMware ESXi 6.0 setup with an Intel 82599 NIC the ports don't
> > seem to initialize anymore, while running testpmd.
> >
> > Configuring Port 0 (socket 0)
> > ixgbevf_dev_rx_init(): Set max packet length to 1518 failed.
> > ixgbevf_dev_start(): Unable to initialize RX hardware (-22)
> > Fail to start port 0: Invalid argument
> > Configuring Port 1 (socket 0)
> > ixgbevf_dev_rx_init(): Set max packet length to 1518 failed.
> > ixgbevf_dev_start(): Unable to initialize RX hardware (-22)
> > Fail to start port 1: Invalid argument
> > Please stop the ports first
> >
> > If the call to ixgbevf_rlpml_set_vf fails and we return prematurely,
> > we will not be able to initialize the ports correctly.
> >
> > The behavior seems to have changed since the following commit:
> >
> > commit c77866a16904 ("net/ixgbe: detect failed VF MTU set")
> >
>
> Hi Tudor,
>
> We document this with explicit 'Fixes' tag, this also helps up to
> manage backporting patches to LTS releases, so updating as:
>
>  Fixes: 3a6bfc37eaf4 ("net/ice: support QoS config VF bandwidth in
> DCF")
>  Cc: sta...@dpdk.org
>
> Also we use 'fix' verb in the patch title almost as keyword, again
> to help deciding which patch to backport, also to clarify impact of
> the patch, so will update patch title as:
>
>  net/ixgbe: fix port initialization if MTU config fails
>
>
> For more details please check contribution guide:
> https://doc.dpdk.org/guides/contributing/patches.html
>
> Thanks,
> ferruh
>
> > We can make this particular use case work correctly if we don't
> > return an error, which seems to be consistent with the overall
> > kernel ixgbevf implementation.
> >
> > [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c#n2015
> >
>
> The code that this link references can change by time as code changes,
> I will update it as following to bind it a specific version (v5.14):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c?h=v5.14#n2015
>
> > Signed-off-by: Tudor Cornea 
> >
> > ---
> > v2:
> > * Change title
> > * Remove max_rx_pkt_len fix in ixgbe_ethdev.c
> >It's already fixed as part of Ferruh's changes in next-net branch,
> >so this part should be redundant, now
> > ---
> >   drivers/net/ixgbe/ixgbe_rxtx.c | 4 +---
> >   1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> b/drivers/net/ixgbe/ixgbe_rxtx.c
> > index b263dfe..a51450f 100644
> > --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > @@ -5673,11 +5673,9 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev)
> >* ixgbevf_rlpml_set_vf even if jumbo frames are not used. This
> way,
> >* VF packets received can work in all cases.
> >*/
> > - if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0) {
> > + if (ixgbevf_rlpml_set_vf(hw, frame_size) != 0)
> >   PMD_INIT_LOG(ERR, "Set max packet length to %d failed.",
> >frame_size);
> > - return -EINVAL;
> > - }
> >
> >   /*
> >* Assume no header split and no VLAN strip support
> >
>
>
Hi Ferruh,

Thanks a lot for the good observations and for pointing me in the right
direction !

Tudor


[dpdk-dev] [PATCH] kni: allow configuring the kni thread granularity

2021-11-02 Thread Tudor Cornea
The Kni kthreads seem to be re-scheduled at a granularity of roughly
1 milisecond right now, which seems to be insufficient for performing tests
involving a lot of control plane traffic.

Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it
seems that the existing code cannot reschedule at the desired granularily,
due to precision constraints of schedule_timeout_interruptible().

In our use case, we leverage the Linux Kernel for control plane, and
it is not uncommon to have 60K - 100K pps for some signaling protocols.

Since we are in non-atomic context, the usleep_range() function seems to be
more appropriate for being able to introduce smaller controlled delays,
in the range of 5-10 microseconds. Upon reading the existing code, it would
seem that this was the original intent. Adding sub-milisecond delays,
seems unfeasible with a call to schedule_timeout_interruptible().

KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */
schedule_timeout_interruptible(
usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL));

Below, we attempted a brief comparison between the existing implementation,
which uses schedule_timeout_interruptible() and usleep_range().

insmod rte_kni.ko kthread_mode=single carrier=on

schedule_timeout_interruptible(usecs_to_jiffies(5))
kni_single CPU Usage: 2-4 %
[root@localhost ~]# ping 1.1.1.2 -I eth1
PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data.
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms

usleep_range(5, 10)
kni_single CPU usage: 50%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms

usleep_range(20, 50)
kni_single CPU usage: 24%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms

usleep_range(50, 100)
kni_single CPU usage: 13%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms

usleep_range(100, 200)
kni_single CPU usage: 7%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms

usleep_range(1000, 1100)
kni_single CPU usage: 2%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms

Upon testing, usleep_range(1000, 1100) seems roughly equivalent in
latency and cpu usage to the variant with schedule_timeout_interruptible(),
while usleep_range(100, 200) seems to give a decent tradeoff between
latency and cpu usage, while allowing users to tweak the limits for
improved precision if they have such use cases.

Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a
softlockup on my kernel.

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 1226 Comm: kni_single Tainted: GW  O 3.10 #1
   [] dump_stack+0x19/0x1b
 [] panic+0xcd/0x1e0
 [] watchdog_timer_fn+0x160/0x160
 [] __run_hrtimer.isra.4+0x42/0xd0
 [] hrtimer_interrupt+0xe7/0x1f0
 [] smp_apic_timer_interrupt+0x67/0xa0
 [] apic_timer_interrupt+0x6d/0x80

References:
[1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

Signed-off-by: Tudor Cornea 
---
 doc/guides/prog_guide/kernel_nic_interface.rst | 33 +++
 kernel/linux/kni/kni_dev.h |  2 +-
 kernel/linux/kni/kni_misc.c| 36 +++---
 3 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/doc/guides/prog_guide/kernel_nic_interface.rst 
b/doc/guides/prog_guide/kernel_nic_interface.rst
index 1ce03ec..5d6d535 100644
--- a/doc/guides/prog_guide/kernel_nic_interface.rst
+++ b/doc/guides/prog_guide/kernel_nic_interface.rst
@@ -56,6 +56,10 @@ can be specified when the module is loaded to control its 
behavior:
 off   Interfaces will be created with carrier state set to 
off.
 onInterfaces will be created with carrier state set to 
on.
  (charp)
+parm

Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx

2021-11-02 Thread Tudor Cornea
On Tue, 26 Oct 2021 at 17:41, Ferruh Yigit  wrote:

> Hi Tudor,
>
> I have used testpmd, 'txonly' forwarding. Tx recovers after interface up,
> but by adding some debug logs I can see 'poll()' returns with POLLOUT even
> there is no space in the buffer.
>
> According the logic in the PMD, when 'poll()' returns success, it expects
> to have some space in the Tx buffer.
>
> So I agree to add the check.
>
> Only have a question on the POLLERR, should we separate the POLLERR check
> to cover ifdown case, what do you think about following logic:
>
> if (!TP_STATUS_AVAILABLE) {
>  if (poll() < 0)
>  break;
>  if (pfd.revents & POLLERR)
>  break;
> }
>
> if (!TP_STATUS_AVAILABLE)
>  break;
>
>
Hi Ferruh,

Thanks for the suggestion.
I was thinking of adding this check, and intuitively, it seems correct. I
tried to do some further testing.

I tested with the POLLERR check, and I don't see the issue anymore.
However, if I remove the second if (!TP_STATUS_AVAILABLE), I seem to get
bursts of 2048 packets followed by periods of not sending anything when
toggling the interface link state.
Without the second if(!TP_STATUS_AVAILABLE) statement, the issue seems to
reproduce, regardless if I add the check for POLLERR or not.

RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=606208 TxP=2048
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0

I will send an updated version of the patch.

Thanks,
Tudor


[dpdk-dev] [PATCH v3] net/af_packet: fix ignoring full ring on tx

2021-11-02 Thread Tudor Cornea
The poll call can return POLLERR which is ignored, or it can return
POLLOUT, even if there are no free frames in the mmap-ed area.

We can account for both of these cases by re-checking if the next
frame is empty before writing into it.

We have attempted to reproduce this issue with pktgen-dpdk, using the
following configuration.

pktgen -l 1-4 -n 4 --proc-type=primary --no-pci --no-telemetry \
--no-huge -m 512 \
--vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192, \
framecnt=2048,qpairs=1,qdisc_bypass=0 \
-- \
-P \
-T \
-m "3.0" \
-f themes/black-yellow.theme

We configure a low tx rate (~ 335 packets / second) and a small
packet size, of about 300 Bytes from the pktgen CLI.

set 0 size 300
set 0 rate 0.008
set 0 burst 1
start 0

After bringing the interface down, and up again, we seem to arrive
in a state in which the tx rate is inconsistent, and does not recover.

ifconfig eth1 down; sleep 7; ifconfig eth1 up

[1] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md

Signed-off-by: Mihai Pogonaru 
Signed-off-by: Tudor Cornea 

---
v2:
* Added check for POLLERR
* Used tx_ring_status_available() for checking TP_STATUS_AVAILABLE
---
 drivers/net/af_packet/rte_eth_af_packet.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index 559f5a0..d3d3104 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -237,8 +237,30 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
}
 
/* point at the next incoming frame */
-   if (!tx_ring_status_available(ppd->tp_status) &&
-   poll(&pfd, 1, -1) < 0)
+   if (!tx_ring_status_available(ppd->tp_status)) {
+   if (poll(&pfd, 1, -1) < 0)
+   break;
+   if (pfd.revents & POLLERR)
+   break;
+   }
+
+   /*
+* Poll can return POLLERR if the interface is down
+*
+* It will almost always return POLLOUT, even if there
+* are no extra buffers available
+*
+* This happens, because packet_poll() calls datagram_poll()
+* which checks the space left in the socket buffer and,
+* in the case of packet_mmap, the default socket buffer length
+* doesn't match the requested size for the tx_ring.
+* As such, there is almost always space left in socket buffer,
+* which doesn't seem to be correlated to the requested size
+* for the tx_ring in packet_mmap.
+*
+* This results in poll() returning POLLOUT.
+*/
+   if (!tx_ring_status_available(ppd->tp_status))
break;
 
/* copy the tx frame data */
-- 
2.7.4



[dpdk-dev] [PATCH v2] kni: allow configuring the kni thread granularity

2021-11-02 Thread Tudor Cornea
The Kni kthreads seem to be re-scheduled at a granularity of roughly
1 millisecond right now, which seems to be insufficient for performing tests
involving a lot of control plane traffic.

Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it
seems that the existing code cannot reschedule at the desired granularily,
due to precision constraints of schedule_timeout_interruptible().

In our use case, we leverage the Linux Kernel for control plane, and
it is not uncommon to have 60K - 100K pps for some signaling protocols.

Since we are in non-atomic context, the usleep_range() function seems to be
more appropriate for being able to introduce smaller controlled delays,
in the range of 5-10 microseconds. Upon reading the existing code, it would
seem that this was the original intent. Adding sub-millisecond delays,
seems unfeasible with a call to schedule_timeout_interruptible().

KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */
schedule_timeout_interruptible(
usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL));

Below, we attempted a brief comparison between the existing implementation,
which uses schedule_timeout_interruptible() and usleep_range().

insmod rte_kni.ko kthread_mode=single carrier=on

schedule_timeout_interruptible(usecs_to_jiffies(5))
kni_single CPU Usage: 2-4 %
[root@localhost ~]# ping 1.1.1.2 -I eth1
PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data.
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms

usleep_range(5, 10)
kni_single CPU usage: 50%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms

usleep_range(20, 50)
kni_single CPU usage: 24%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms

usleep_range(50, 100)
kni_single CPU usage: 13%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms

usleep_range(100, 200)
kni_single CPU usage: 7%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms

usleep_range(1000, 1100)
kni_single CPU usage: 2%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms

Upon testing, usleep_range(1000, 1100) seems roughly equivalent in
latency and cpu usage to the variant with schedule_timeout_interruptible(),
while usleep_range(100, 200) seems to give a decent tradeoff between
latency and cpu usage, while allowing users to tweak the limits for
improved precision if they have such use cases.

Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a
softlockup on my kernel.

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 1226 Comm: kni_single Tainted: GW  O 3.10 #1
   [] dump_stack+0x19/0x1b
 [] panic+0xcd/0x1e0
 [] watchdog_timer_fn+0x160/0x160
 [] __run_hrtimer.isra.4+0x42/0xd0
 [] hrtimer_interrupt+0xe7/0x1f0
 [] smp_apic_timer_interrupt+0x67/0xa0
 [] apic_timer_interrupt+0x6d/0x80

References:
[1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

Signed-off-by: Tudor Cornea 

---
v2:
* Fixed some spelling errors
---
 doc/guides/prog_guide/kernel_nic_interface.rst | 33 +++
 kernel/linux/kni/kni_dev.h |  2 +-
 kernel/linux/kni/kni_misc.c| 36 +++---
 3 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/doc/guides/prog_guide/kernel_nic_interface.rst 
b/doc/guides/prog_guide/kernel_nic_interface.rst
index 1ce03ec..2dd3481 100644
--- a/doc/guides/prog_guide/kernel_nic_interface.rst
+++ b/doc/guides/prog_guide/kernel_nic_interface.rst
@@ -56,6 +56,10 @@ can be specified when the module is loaded to control its 
behavior:
 off   Interfaces will be created with carrier state set to 
off.
 onInterfaces will be created with carrier state set to 
on

[dpdk-dev] [PATCH v4] net/af_packet: fix ignoring full ring on tx

2021-11-03 Thread Tudor Cornea
The poll call can return POLLERR which is ignored, or it can return
POLLOUT, even if there are no free frames in the mmap-ed area.

We can account for both of these cases by re-checking if the next
frame is empty before writing into it.

We have attempted to reproduce this issue with pktgen-dpdk, using the
following configuration.

pktgen -l 1-4 -n 4 --proc-type=primary --no-pci --no-telemetry \
--no-huge -m 512 \
--vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192, \
framecnt=2048,qpairs=1,qdisc_bypass=0 \
-- \
-P \
-T \
-m "3.0" \
-f themes/black-yellow.theme

We configure a low tx rate (~ 335 packets / second) and a small
packet size, of about 300 bytes from the pktgen CLI.

set 0 size 300
set 0 rate 0.008
set 0 burst 1
start 0

After bringing the interface down, and up again, we seem to arrive
in a state in which the tx rate is inconsistent, and does not recover.

ifconfig eth1 down; sleep 7; ifconfig eth1 up

[1] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md

Signed-off-by: Mihai Pogonaru 
Signed-off-by: Tudor Cornea 
Reviewed-by: Ferruh Yigit 

---
v4:
* Updated comment regarding POLLERR
  Removed some extra details, and left notice about poll() returning
  POLLERR, in case of interface link down.
v3:
* Added check for POLLERR
* Used tx_ring_status_available() for checking TP_STATUS_AVAILABLE
---
 drivers/net/af_packet/rte_eth_af_packet.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index 559f5a0..dd6d165 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -237,8 +237,16 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
}
 
/* point at the next incoming frame */
-   if (!tx_ring_status_available(ppd->tp_status) &&
-   poll(&pfd, 1, -1) < 0)
+   if (!tx_ring_status_available(ppd->tp_status)) {
+   if (poll(&pfd, 1, -1) < 0)
+   break;
+
+   /* poll() can return POLLERR if the interface is down */
+   if (pfd.revents & POLLERR)
+   break;
+   }
+
+   if (!tx_ring_status_available(ppd->tp_status))
break;
 
/* copy the tx frame data */
-- 
2.7.4



Re: [dpdk-dev] [PATCH v2] kni: allow configuring the kni thread granularity

2021-11-03 Thread Tudor Cornea
On Tue, 2 Nov 2021 at 17:53, Stephen Hemminger 
wrote:

> On Tue,  2 Nov 2021 17:51:13 +0200
> Tudor Cornea  wrote:
>
> > +#ifdef RTE_KNI_PREEMPT_DEFAULT
> > +module_param(min_scheduling_interval, long, 0644);
> > +MODULE_PARM_DESC(min_scheduling_interval,
> > +"Kni thread min scheduling interval (default=100 microseconds):\n"
> > +"\t\t"
> > +);
>
> Why the non-standard newline's and tab's?
> Please try to make KNI look like other kernel code.
>

Hi Stephen,

I tried to base the description of the new parameters on an existing
parameter implemented for the rte_kni module - carrier.

module_param(carrier, charp, 0644);
MODULE_PARM_DESC(carrier,
"Default carrier state for KNI interface (default=off):\n"
"\t\toff   Interfaces will be created with carrier state set to off.\n"
"\t\tonInterfaces will be created with carrier state set to on.\n"
"\t\t"
);

I thought about keeping the compatibility in terms of coding style with the
existing Kni module parameters.
Upon browsing the Linux tree, I realise it might not be standard (
checkpatch.pl , interestingly didn't seem to complain about the patch)

I also realise now, that I missed two tabs at the beginning of the params
description.
Should I add the missing tabs, so that the new parameters that I intend to
add through this patch are similar in style to the existing ones, or should
I remove the newlines and tabs altogether, when specifying the description
for min_scheduling_interval and max_scheduling_interval ?

Thanks,
Tudor


[dpdk-dev] [PATCH v3] kni: allow configuring the kni thread granularity

2021-11-08 Thread Tudor Cornea
The Kni kthreads seem to be re-scheduled at a granularity of roughly
1 millisecond right now, which seems to be insufficient for performing
tests involving a lot of control plane traffic.

Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it
seems that the existing code cannot reschedule at the desired granularily,
due to precision constraints of schedule_timeout_interruptible().

In our use case, we leverage the Linux Kernel for control plane, and
it is not uncommon to have 60K - 100K pps for some signaling protocols.

Since we are not in atomic context, the usleep_range() function seems to be
more appropriate for being able to introduce smaller controlled delays,
in the range of 5-10 microseconds. Upon reading the existing code, it would
seem that this was the original intent. Adding sub-millisecond delays,
seems unfeasible with a call to schedule_timeout_interruptible().

KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */
schedule_timeout_interruptible(
usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL));

Below, we attempted a brief comparison between the existing implementation,
which uses schedule_timeout_interruptible() and usleep_range().

We attempt to measure the CPU usage, and RTT between two Kni interfaces,
which are created on top of vmxnet3 adapters, connected by a vSwitch.

insmod rte_kni.ko kthread_mode=single carrier=on

schedule_timeout_interruptible(usecs_to_jiffies(5))
kni_single CPU Usage: 2-4 %
[root@localhost ~]# ping 1.1.1.2 -I eth1
PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data.
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms

usleep_range(5, 10)
kni_single CPU usage: 50%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms

usleep_range(20, 50)
kni_single CPU usage: 24%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms

usleep_range(50, 100)
kni_single CPU usage: 13%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms

usleep_range(100, 200)
kni_single CPU usage: 7%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms

usleep_range(1000, 1100)
kni_single CPU usage: 2%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms

Upon testing, usleep_range(1000, 1100) seems roughly equivalent in
latency and cpu usage to the variant with schedule_timeout_interruptible(),
while usleep_range(100, 200) seems to give a decent tradeoff between
latency and cpu usage, while allowing users to tweak the limits for
improved precision if they have such use cases.

Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a
softlockup on my kernel.

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 1226 Comm: kni_single Tainted: GW  O 3.10 #1
   [] dump_stack+0x19/0x1b
 [] panic+0xcd/0x1e0
 [] watchdog_timer_fn+0x160/0x160
 [] __run_hrtimer.isra.4+0x42/0xd0
 [] hrtimer_interrupt+0xe7/0x1f0
 [] smp_apic_timer_interrupt+0x67/0xa0
 [] apic_timer_interrupt+0x6d/0x80

References:
[1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

Signed-off-by: Tudor Cornea 

---
v3:
* Fixed unwrapped commit description warning
* Changed from hrtimers to Linux High Precision Timers in docs
* Added two tabs at the beginning of the new params description.
  Stephen correctly pointed out that the descriptions of the parameters
  for the Kni module are nonstandard w.r.t existing kernel code.
  I was thinking to preserve compatibility with the existing parameters
  of the Kni module for the moment, while an additional clean-up patch
  could format the descriptions to be closer to the kernel standard.
v2:
* Fixed some spelling errors
---
 doc/guides/prog_guide/kernel_nic_interface.rst | 33 +++
 kernel/linux/kni/kni_dev.h

[dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx

2021-09-13 Thread Tudor Cornea
The poll call can return POLLERR which is ignored, or it can return
POLLOUT, even if there are no free frames in the mmap-ed area.

We can account for both of these cases by re-checking if the next
frame is empty before writing into it.

Signed-off-by: Mihai Pogonaru 
Signed-off-by: Tudor Cornea 
---
 drivers/net/af_packet/rte_eth_af_packet.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..087c196 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
(poll(&pfd, 1, -1) < 0))
break;
 
+   /*
+* Poll can return POLLERR if the interface is down
+*
+* It will almost always return POLLOUT, even if there
+* are no extra buffers available
+*
+* This happens, because packet_poll() calls datagram_poll()
+* which checks the space left in the socket buffer and,
+* in the case of packet_mmap, the default socket buffer length
+* doesn't match the requested size for the tx_ring.
+* As such, there is almost always space left in socket buffer,
+* which doesn't seem to be correlated to the requested size
+* for the tx_ring in packet_mmap.
+*
+* This results in poll() returning POLLOUT.
+*/
+   if (ppd->tp_status != TP_STATUS_AVAILABLE)
+   break;
+
/* copy the tx frame data */
pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
sizeof(struct sockaddr_ll);
-- 
2.7.4



[dpdk-dev] [PATCH] net/af_packet: remove timestamp from packet status

2021-09-13 Thread Tudor Cornea
We should eliminate the timestamp status from the packet
status. This should only matter if timestamping is enabled
on the socket, but we might hit a kernel bug, which is fixed
in newer releases.

For interfaces of type 'veth', the sent skb is forwarded
to the peer and back into the network stack which timestamps
it on the RX path if timestamping is enabled globally
(which happens if any socket enables timestamping).

When the skb is destructed, tpacket_destruct_skb() is called
and it calls __packet_set_timestamp() which doesn't check
the flags on the socket and returns the timestamp if it is
set in the skb (and for veth it is, as mentioned above).

See the following kernel commit for reference [1]:

net: packetmmap: fix only tx timestamp on request

The packetmmap tx ring should only return timestamps if requested
via setsockopt PACKET_TIMESTAMP, as documented. This allows
compatibility with non-timestamp aware user-space code which checks
tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp
flags to be set in tp_status.

[1] https://www.spinics.net/lists/kernel/msg3959391.html

Signed-off-by: Mihai Pogonaru 
Signed-off-by: Tudor Cornea 
---
 drivers/net/af_packet/rte_eth_af_packet.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..a6638a2 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -167,6 +168,23 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
return num_rx;
 }
 
+static inline bool tx_ring_status_unavailable(uint32_t tp_status)
+{
+#if KERNEL_VERSION(5, 10, 0) > LINUX_VERSION_CODE
+   /*
+* We eliminate the timestamp status from the packet status.
+* This should only matter if timestamping is enabled on the socket,
+* but there is a bug in the kernel which is fixed in newer releases.
+*
+* See the following kernel commit for reference:
+* commit 171c3b151118a2fe0fc1e2a9d1b5a1570cfe82d2
+* net: packetmmap: fix only tx timestamp on request
+*/
+   tp_status &= ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
+#endif
+   return tp_status != TP_STATUS_AVAILABLE;
+}
+
 /*
  * Callback to handle sending packets through a real NIC.
  */
@@ -212,8 +230,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
}
 
/* point at the next incoming frame */
-   if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
-   (poll(&pfd, 1, -1) < 0))
+   if (tx_ring_status_unavailable(ppd->tp_status) &&
+   poll(&pfd, 1, -1) < 0)
break;
 
/* copy the tx frame data */
-- 
2.7.4



[dpdk-dev] [PATCH v2] net/af_packet: remove timestamp from packet status

2021-09-13 Thread Tudor Cornea
We should eliminate the timestamp status from the packet
status. This should only matter if timestamping is enabled
on the socket, but we might hit a kernel bug, which is fixed
in newer releases.

For interfaces of type 'veth', the sent skb is forwarded
to the peer and back into the network stack which timestamps
it on the RX path if timestamping is enabled globally
(which happens if any socket enables timestamping).

When the skb is destructed, tpacket_destruct_skb() is called
and it calls __packet_set_timestamp() which doesn't check
the flags on the socket and returns the timestamp if it is
set in the skb (and for veth it is, as mentioned above).

See the following kernel commit for reference [1]:

net: packetmmap: fix only tx timestamp on request

The packetmmap tx ring should only return timestamps if requested
via setsockopt PACKET_TIMESTAMP, as documented. This allows
compatibility with non-timestamp aware user-space code which checks
tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp
flags to be set in tp_status.

[1] https://www.spinics.net/lists/kernel/msg3959391.html

Signed-off-by: Mihai Pogonaru 
Signed-off-by: Tudor Cornea 

---
v2:
* Remove compile-time check for kernel version
---
 drivers/net/af_packet/rte_eth_af_packet.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..7ecea4e 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -167,6 +167,22 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
return num_rx;
 }
 
+static inline bool tx_ring_status_unavailable(uint32_t tp_status)
+{
+   /*
+* We eliminate the timestamp status from the packet status.
+* This should only matter if timestamping is enabled on the socket,
+* but there is a bug in the kernel which is fixed in newer releases.
+*
+* See the following kernel commit for reference:
+* commit 171c3b151118a2fe0fc1e2a9d1b5a1570cfe82d2
+* net: packetmmap: fix only tx timestamp on request
+*/
+   tp_status &= ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
+
+   return tp_status != TP_STATUS_AVAILABLE;
+}
+
 /*
  * Callback to handle sending packets through a real NIC.
  */
@@ -212,8 +228,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
}
 
/* point at the next incoming frame */
-   if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
-   (poll(&pfd, 1, -1) < 0))
+   if (tx_ring_status_unavailable(ppd->tp_status) &&
+   poll(&pfd, 1, -1) < 0)
break;
 
/* copy the tx frame data */
-- 
2.7.4



Re: [dpdk-dev] [PATCH] net/af_packet: remove timestamp from packet status

2021-09-13 Thread Tudor Cornea
Thanks for the observation.

I have removed the compile-time kernel version check in v2 of the patch


On Mon, 13 Sept 2021 at 18:09, Stephen Hemminger 
wrote:

> On Mon, 13 Sep 2021 17:09:11 +0300
> Tudor Cornea  wrote:
>
> > +static inline bool tx_ring_status_unavailable(uint32_t tp_status)
> > +{
> > +#if KERNEL_VERSION(5, 10, 0) > LINUX_VERSION_CODE
>
> No, having kernel dependent userspace in DPDK is not good practice.
>
> Distribution vendors don't number their kernels the same as upstream.
> RHEL for example, keeps same version over life or release but backports
> many fixes.
>
> Also, the system DPDK runs on is often not the system DPDK is built
> on.
>


Re: [dpdk-dev] [PATCH v2] net/af_packet: reinsert the stripped vlan tag

2021-09-21 Thread Tudor Cornea
Thanks, Ferruh

I will perform the suggested recommendations in version 3 of the patch.

On Mon, 20 Sept 2021 at 18:41, Ferruh Yigit  wrote:

> On 9/8/2021 9:59 AM, Tudor Cornea wrote:
> > The af_packet pmd driver binds to a raw socket and allows
> > sending and receiving of packets through the kernel.
> >
> > Since commit [1], the kernel strips the vlan tags early in
> > __netif_receive_skb_core(), so we receive untagged packets while
> > running with the af_packet pmd.
> >
> > Luckily for us, the skb vlan-related fields are still populated from the
> > stripped vlan tags, so we end up having all the information
> > that we need in the mbuf.
> >
> > Having the PMD driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the
> > application to control the desired vlan stripping behavior.
> >
> > [1]
> https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2
> >
> > Signed-off-by: Tudor Cornea 
> >
>
> Hi Tudor,
>
> The concern was unexpected performance degradation (user not setting any
> offload
> will have performance drop). But since your measurements show no
> significant
> drop, I think it is fair to make driver behave same as other drivers.
> (Until we have a way to describe offloads that can't be disabled by PMDs.)
>
> Can you do a few minor updates:
> - Put your performance measurements into to the commit log to record them
> - Update the af_packet documentation (doc/guides/nics/af_packet.rst) to
> document
> PMD behavior with packets with VLAN tag
> - Update release note (doc/guides/rel_notes/release_21_11.rst) with a
> one/two
> sentences to document the change, to notify possible users of the
> af_packet with
> the change.
>
> Thanks,
> ferruh
>
> > ---
> > v2:
> > * Add DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads
> > ---
> >  drivers/net/af_packet/rte_eth_af_packet.c | 12 
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
> b/drivers/net/af_packet/rte_eth_af_packet.c
> > index b73b211..5ed9dd6 100644
> > --- a/drivers/net/af_packet/rte_eth_af_packet.c
> > +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> > @@ -48,6 +48,7 @@ struct pkt_rx_queue {
> >
> >   struct rte_mempool *mb_pool;
> >   uint16_t in_port;
> > + uint8_t vlan_strip;
> >
> >   volatile unsigned long rx_pkts;
> >   volatile unsigned long rx_bytes;
> > @@ -78,6 +79,7 @@ struct pmd_internals {
> >
> >   struct pkt_rx_queue *rx_queue;
> >   struct pkt_tx_queue *tx_queue;
> > + uint8_t vlan_strip;
> >  };
> >
> >  static const char *valid_arguments[] = {
> > @@ -148,6 +150,9 @@ eth_af_packet_rx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >   if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
> >   mbuf->vlan_tci = ppd->tp_vlan_tci;
> >   mbuf->ol_flags |= (PKT_RX_VLAN |
> PKT_RX_VLAN_STRIPPED);
> > +
> > + if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf))
> > + PMD_LOG(ERR, "Failed to reinsert VLAN
> tag");
> >   }
> >
> >   /* release incoming frame and advance ring buffer */
> > @@ -302,6 +307,11 @@ eth_dev_stop(struct rte_eth_dev *dev)
> >  static int
> >  eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
> >  {
> > + struct rte_eth_conf *dev_conf = &dev->data->dev_conf;
> > + const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode;
> > + struct pmd_internals *internals = dev->data->dev_private;
> > +
> > + internals->vlan_strip = !!(rxmode->offloads &
> DEV_RX_OFFLOAD_VLAN_STRIP);
> >   return 0;
> >  }
> >
> > @@ -318,6 +328,7 @@ eth_dev_info(struct rte_eth_dev *dev, struct
> rte_eth_dev_info *dev_info)
> >   dev_info->min_rx_bufsize = 0;
> >   dev_info->tx_offload_capa = DEV_TX_OFFLOAD_MULTI_SEGS |
> >   DEV_TX_OFFLOAD_VLAN_INSERT;
> > + dev_info->rx_offload_capa = DEV_RX_OFFLOAD_VLAN_STRIP;
> >
> >   return 0;
> >  }
> > @@ -448,6 +459,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev,
> >
> >   dev->data->rx_queues[rx_queue_id] = pkt_q;
> >   pkt_q->in_port = dev->data->port_id;
> > + pkt_q->vlan_strip = internals->vlan_strip;
> >
> >   return 0;
> >  }
> >
>
>


Re: [dpdk-dev] [PATCH v2] net/af_packet: remove timestamp from packet status

2021-09-21 Thread Tudor Cornea
Thanks for the suggestion. I will send a new version of the patch with the
required changes.

Tudor

On Mon, 20 Sept 2021 at 20:49, Ferruh Yigit  wrote:

> On 9/13/2021 6:23 PM, Tudor Cornea wrote:
> > We should eliminate the timestamp status from the packet
> > status. This should only matter if timestamping is enabled
> > on the socket, but we might hit a kernel bug, which is fixed
> > in newer releases.
> >
> > For interfaces of type 'veth', the sent skb is forwarded
> > to the peer and back into the network stack which timestamps
> > it on the RX path if timestamping is enabled globally
> > (which happens if any socket enables timestamping).
> >
> > When the skb is destructed, tpacket_destruct_skb() is called
> > and it calls __packet_set_timestamp() which doesn't check
> > the flags on the socket and returns the timestamp if it is
> > set in the skb (and for veth it is, as mentioned above).
> >
> > See the following kernel commit for reference [1]:
> >
> > net: packetmmap: fix only tx timestamp on request
> >
> > The packetmmap tx ring should only return timestamps if requested
> > via setsockopt PACKET_TIMESTAMP, as documented. This allows
> > compatibility with non-timestamp aware user-space code which checks
> > tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp
> > flags to be set in tp_status.
> >
> > [1] https://www.spinics.net/lists/kernel/msg3959391.html
> >
> > Signed-off-by: Mihai Pogonaru 
> > Signed-off-by: Tudor Cornea 
> >
> > ---
> > v2:
> > * Remove compile-time check for kernel version
>
> OK, Stephen's comment makes sense.
>
> > ---
> >  drivers/net/af_packet/rte_eth_af_packet.c | 20 ++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
> b/drivers/net/af_packet/rte_eth_af_packet.c
> > index b73b211..7ecea4e 100644
> > --- a/drivers/net/af_packet/rte_eth_af_packet.c
> > +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> > @@ -167,6 +167,22 @@ eth_af_packet_rx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >   return num_rx;
> >  }
> >
> > +static inline bool tx_ring_status_unavailable(uint32_t tp_status)
> > +{
>
> Minor syntax comment, can you have the 'static inline bool' part in
> separate
> line. And a basic function comment can be good.
>
> Thanks,
> ferruh
>
> > + /*
> > +  * We eliminate the timestamp status from the packet status.
> > +  * This should only matter if timestamping is enabled on the
> socket,
> > +  * but there is a bug in the kernel which is fixed in newer
> releases.
> > +  *
> > +  * See the following kernel commit for reference:
> > +  * commit 171c3b151118a2fe0fc1e2a9d1b5a1570cfe82d2
> > +  * net: packetmmap: fix only tx timestamp on request
> > +  */
> > + tp_status &= ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
> > +
> > + return tp_status != TP_STATUS_AVAILABLE;
> > +}
> > +
> >  /*
> >   * Callback to handle sending packets through a real NIC.
> >   */
> > @@ -212,8 +228,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >   }
> >
> >   /* point at the next incoming frame */
> > - if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
> > - (poll(&pfd, 1, -1) < 0))
> > + if (tx_ring_status_unavailable(ppd->tp_status) &&
> > + poll(&pfd, 1, -1) < 0)
> >   break;
> >
> >   /* copy the tx frame data */
> >
>
>


[dpdk-dev] [PATCH v3] net/af_packet: remove timestamp from packet status

2021-09-23 Thread Tudor Cornea
We should eliminate the timestamp status from the packet
status. This should only matter if timestamping is enabled
on the socket, but we might hit a kernel bug, which is fixed
in newer releases.

For interfaces of type 'veth', the sent skb is forwarded
to the peer and back into the network stack which timestamps
it on the RX path if timestamping is enabled globally
(which happens if any socket enables timestamping).

When the skb is destructed, tpacket_destruct_skb() is called
and it calls __packet_set_timestamp() which doesn't check
the flags on the socket and returns the timestamp if it is
set in the skb (and for veth it is, as mentioned above).

See the following kernel commit for reference [1]:

net: packetmmap: fix only tx timestamp on request

The packetmmap tx ring should only return timestamps if requested
via setsockopt PACKET_TIMESTAMP, as documented. This allows
compatibility with non-timestamp aware user-space code which checks
tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp
flags to be set in tp_status.

[1] https://www.spinics.net/lists/kernel/msg3959391.html

Signed-off-by: Mihai Pogonaru 
Signed-off-by: Tudor Cornea 

---
v3:
* Place function name and parameters on a separate line.
v2:
* Remove compile-time check for kernel version
---
 drivers/net/af_packet/rte_eth_af_packet.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..fcd8090 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -168,6 +168,26 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
 }
 
 /*
+ * Check if there is an available frame in the ring
+ */
+static inline bool
+tx_ring_status_available(uint32_t tp_status)
+{
+   /*
+* We eliminate the timestamp status from the packet status.
+* This should only matter if timestamping is enabled on the socket,
+* but there is a bug in the kernel which is fixed in newer releases.
+*
+* See the following kernel commit for reference:
+* commit 171c3b151118a2fe0fc1e2a9d1b5a1570cfe82d2
+* net: packetmmap: fix only tx timestamp on request
+*/
+   tp_status &= ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
+
+   return tp_status == TP_STATUS_AVAILABLE;
+}
+
+/*
  * Callback to handle sending packets through a real NIC.
  */
 static uint16_t
@@ -212,8 +232,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
}
 
/* point at the next incoming frame */
-   if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
-   (poll(&pfd, 1, -1) < 0))
+   if (!tx_ring_status_available(ppd->tp_status) &&
+   poll(&pfd, 1, -1) < 0)
break;
 
/* copy the tx frame data */
-- 
2.7.4



[dpdk-dev] [PATCH v3] net/af_packet: reinsert the stripped vlan tag

2021-09-24 Thread Tudor Cornea
The af_packet pmd driver binds to a raw socket and allows
sending and receiving of packets through the kernel.

Since commit [1], the kernel strips the vlan tags early in
__netif_receive_skb_core(), so we receive untagged packets while
running with the af_packet pmd.

Luckily for us, the skb vlan-related fields are still populated from the
stripped vlan tags, so we end up having all the information
that we need in the mbuf.

Having the pmd driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the
application to control the desired vlan stripping behavior,
until we have a way to describe offloads that can't be disabled by
pmd drivers.

This patch will cause a change in the default way that the af_packet
pmd treats received vlan-tagged frames. While previously, the
application was required to check the PKT_RX_VLAN_STRIPPED flag, after
this patch, the pmd will re-insert the vlan tag transparently to the
user, unless the DEV_RX_OFFLOAD_VLAN_STRIP is enabled in
rxmode.offloads.

I've attempted a preliminary benchmark to understand if the change could
cause a sizable performance hit.

Setup:
Two virtual machines running on top of an ESXi hypervisor

Tx: DPDK app (running on top of vmxnet3 PMD)
Rx: af_packet (running on top of a kernel vmxnet3 interface)
Packet size :68 (packet contains a vlan tag)

Rates:
Tx - 1.419 Mpps
Rx (without vlan insertion) - 1227636 pps
Rx (with vlan insertion)- 1220081 pps

At a first glance, we don't seem to have a large degradation in terms of packet 
rate

[1] 
https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2

Signed-off-by: Tudor Cornea 

---
v3:
* Updated release note and documentation
* Updated commit with performance measurements
v2:
* Added DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads
---
 doc/guides/nics/af_packet.rst | 38 +++
 doc/guides/rel_notes/release_21_11.rst|  4 
 drivers/net/af_packet/rte_eth_af_packet.c | 12 ++
 3 files changed, 54 insertions(+)

diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst
index efd6f1c..97d5502 100644
--- a/doc/guides/nics/af_packet.rst
+++ b/doc/guides/nics/af_packet.rst
@@ -65,3 +65,41 @@ framecnt=512):
 .. code-block:: console
 
 
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
+
+Features and Limitations of the af_packet PMD
+-
+
+Since the following commit, the Linux kernel strips the vlan tag
+
+.. code-block:: console
+
+commit bcc6d47903612c3861201cc3a866fb604f26b8b2
+Author: Jiri Pirko 
+Date:   Thu Apr 7 19:48:33 2011 +
+
+ net: vlan: make non-hw-accel rx path similar to hw-accel
+
+Running on such a kernel results in receiving untagged frames while using
+the af_packet PMD. Fortunately, the stripped information is still available
+for use in ``mbuf->vlan_tci``, and applications could check 
``PKT_RX_VLAN_STRIPPED``.
+
+However, we currently don't have a way to describe offloads which can't be
+disabled by PMDs, and this creates an inconsistency with the way applications
+expect the PMD offloads to work, and requires them to be aware of which
+underlying driver they use.
+
+Since release 21.11 the af_packet PMD will implement support for the
+``DEV_RX_OFFLOAD_VLAN_STRIP`` offload, and users can control the desired vlan
+stripping behavior.
+
+It's important to note that the default case will change. If previously,
+the vlan tag was stripped, if the application now requires the same behavior,
+it will need to configure ``rxmode.offloads`` with 
``DEV_RX_OFFLOAD_VLAN_STRIP``.
+
+The PMD driver will re-insert the vlan tag transparently to the application
+if the kernel strips it, as long as the ``DEV_RX_OFFLOAD_VLAN_STRIP`` is not
+enabled.
+
+.. code-block:: console
+
+port_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_STRIP
diff --git a/doc/guides/rel_notes/release_21_11.rst 
b/doc/guides/rel_notes/release_21_11.rst
index ad7c1af..095fd5b 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -66,6 +66,10 @@ New Features
 
   * Added rte_flow support for dual VLAN insert and strip actions.
 
+* **Updated af_packet ethdev driver.**
+
+  * Added DEV_RX_OFFLOAD_VLAN_STRIP capability.
+
 * **Updated Marvell cnxk crypto PMD.**
 
   * Added AES-CBC SHA1-HMAC support in lookaside protocol (IPsec) for CN10K.
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..5ed9dd6 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -48,6 +48,7 @@ struct pkt_rx_queue {
 
struct rte_mempool *mb_pool;
uint16_t in_port;
+   uint8_t vlan_strip;
 
volatile unsigned long rx_pkts;
volatile unsigned long rx_bytes;
@@ -78,6 +79,7 @@ struct pmd_internals {
 
struct pkt_rx_queue *rx_queue;
 

[dpdk-dev] [PATCH] net/af_packet: run on kernels without qdisc bypass support

2021-07-14 Thread Tudor Cornea
Some older kernels do not support the PACKET_QDISC_BYPASS socket
option. Such an example is the CentOS 7 kernel (3.10).

If we only check for the definition of PACKET_QDISC_BYPASS, it might mean
that we will not be able to compile the PMD driver on a newer platform,
and run in on a machine with an older kernel.

Setting the socket option only if it is specifically requested from
the EAL arguments, allows us to have a way to run the PMD compiled
against newer kernel headers, on platforms having older kernels.

Signed-off-by: Tudor Cornea 
---
 drivers/net/af_packet/rte_eth_af_packet.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index 2e90e29..60b485a 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -749,13 +749,15 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
}
 
 #if defined(PACKET_QDISC_BYPASS)
-   rc = setsockopt(qsockfd, SOL_PACKET, PACKET_QDISC_BYPASS,
-   &qdisc_bypass, sizeof(qdisc_bypass));
-   if (rc == -1) {
-   PMD_LOG_ERRNO(ERR,
-   "%s: could not set PACKET_QDISC_BYPASS on 
AF_PACKET socket for %s",
-   name, pair->value);
-   goto error;
+   if (qdisc_bypass) {
+   rc = setsockopt(qsockfd, SOL_PACKET, 
PACKET_QDISC_BYPASS,
+   &qdisc_bypass, sizeof(qdisc_bypass));
+   if (rc == -1) {
+   PMD_LOG_ERRNO(ERR,
+   "%s: could not set PACKET_QDISC_BYPASS 
on AF_PACKET socket for %s",
+   name, pair->value);
+   goto error;
+   }
}
 #else
RTE_SET_USED(qdisc_bypass);
-- 
2.7.4



[dpdk-dev] [PATCH] net/iavf: fix overflow in maximum packet length config

2021-08-06 Thread Tudor Cornea
The len variable, used in the computation of max_pkt_len could
overflow, if used to store the result of the following computation:

rxq->rx_buf_len * IAVF_MAX_CHAINED_RX_BUFFERS

Since, we could define the mbuf size to have a large value (i.e 13312),
and IAVF_MAX_CHAINED_RX_BUFFERS is defined as 5, the computation mentioned
above could potentially result in a value which might be bigger than MAX_USHORT.

The result will be that Jumbo Frames will not work properly

A similar fix was submitted for the ice driver

Signed-off-by: Tudor Cornea 
---
 drivers/net/iavf/iavf_ethdev.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c
index 574cfe0..dc5cbc2 100644
--- a/drivers/net/iavf/iavf_ethdev.c
+++ b/drivers/net/iavf/iavf_ethdev.c
@@ -574,13 +574,14 @@ iavf_init_rxq(struct rte_eth_dev *dev, struct 
iavf_rx_queue *rxq)
 {
struct iavf_hw *hw = IAVF_DEV_PRIVATE_TO_HW(dev->data->dev_private);
struct rte_eth_dev_data *dev_data = dev->data;
-   uint16_t buf_size, max_pkt_len, len;
+   uint16_t buf_size, max_pkt_len;
 
buf_size = rte_pktmbuf_data_room_size(rxq->mp) - RTE_PKTMBUF_HEADROOM;
 
/* Calculate the maximum packet length allowed */
-   len = rxq->rx_buf_len * IAVF_MAX_CHAINED_RX_BUFFERS;
-   max_pkt_len = RTE_MIN(len, dev->data->dev_conf.rxmode.max_rx_pkt_len);
+   max_pkt_len = RTE_MIN((uint32_t)
+   rxq->rx_buf_len * IAVF_MAX_CHAINED_RX_BUFFERS,
+   dev->data->dev_conf.rxmode.max_rx_pkt_len);
 
/* Check if the jumbo frame and maximum packet length are set
 * correctly.
-- 
2.7.4



Re: [PATCH v3] kni: allow configuring the kni thread granularity

2021-11-24 Thread Tudor Cornea
Hi Ferruh,



> As I tested both with KNI sample app and KNI PMD, change looks good,
> practically we can't change the scheduler delay with existing code but
> this patch enables it and lets configure performance/CPU usage trade of.
>
> Only possible change is to remove 'RTE_KNI_PREEMPT_DEFAULT' macro as
> mentioned below.
>
>
Thanks for the suggestion.
I will send an updated version of the patch, which will remove the
RTE_KNI_PREEMPT_DEFAULT macro


[PATCH v4] kni: allow configuring the kni thread granularity

2021-11-24 Thread Tudor Cornea
The Kni kthreads seem to be re-scheduled at a granularity of roughly
1 millisecond right now, which seems to be insufficient for performing
tests involving a lot of control plane traffic.

Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it
seems that the existing code cannot reschedule at the desired granularily,
due to precision constraints of schedule_timeout_interruptible().

In our use case, we leverage the Linux Kernel for control plane, and
it is not uncommon to have 60K - 100K pps for some signaling protocols.

Since we are not in atomic context, the usleep_range() function seems to be
more appropriate for being able to introduce smaller controlled delays,
in the range of 5-10 microseconds. Upon reading the existing code, it would
seem that this was the original intent. Adding sub-millisecond delays,
seems unfeasible with a call to schedule_timeout_interruptible().

KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */
schedule_timeout_interruptible(
usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL));

Below, we attempted a brief comparison between the existing implementation,
which uses schedule_timeout_interruptible() and usleep_range().

We attempt to measure the CPU usage, and RTT between two Kni interfaces,
which are created on top of vmxnet3 adapters, connected by a vSwitch.

insmod rte_kni.ko kthread_mode=single carrier=on

schedule_timeout_interruptible(usecs_to_jiffies(5))
kni_single CPU Usage: 2-4 %
[root@localhost ~]# ping 1.1.1.2 -I eth1
PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data.
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms

usleep_range(5, 10)
kni_single CPU usage: 50%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms

usleep_range(20, 50)
kni_single CPU usage: 24%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms

usleep_range(50, 100)
kni_single CPU usage: 13%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms

usleep_range(100, 200)
kni_single CPU usage: 7%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms

usleep_range(1000, 1100)
kni_single CPU usage: 2%
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms
64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms

Upon testing, usleep_range(1000, 1100) seems roughly equivalent in
latency and cpu usage to the variant with schedule_timeout_interruptible(),
while usleep_range(100, 200) seems to give a decent tradeoff between
latency and cpu usage, while allowing users to tweak the limits for
improved precision if they have such use cases.

Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a
softlockup on my kernel.

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 1226 Comm: kni_single Tainted: GW  O 3.10 #1
   [] dump_stack+0x19/0x1b
 [] panic+0xcd/0x1e0
 [] watchdog_timer_fn+0x160/0x160
 [] __run_hrtimer.isra.4+0x42/0xd0
 [] hrtimer_interrupt+0xe7/0x1f0
 [] smp_apic_timer_interrupt+0x67/0xa0
 [] apic_timer_interrupt+0x6d/0x80

This patch also attempts to remove this option.

References:
[1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

Signed-off-by: Tudor Cornea 

---
v4:
* Removed RTE_KNI_PREEMPT_DEFAULT configuration option
v3:
* Fixed unwrapped commit description warning
* Changed from hrtimers to Linux High Precision Timers in docs
* Added two tabs at the beginning of the new params description.
  Stephen correctly pointed out that the descriptions of the parameters
  for the Kni module are nonstandard w.r.t existing kernel code.
  I was thinking to preserve compatibility with the existing parameters
  of the Kni module for the moment, while an additional clean-up patch
  could format the descriptions to be closer to the kernel standard.
v2:
* Fixed some spelling errors

[PATCH] kernel/kni: retry the xmit in case ring is full

2021-12-17 Thread Tudor Cornea
This patch attempts to avoid dropping packets that are to be
transmitted, in case there is no space in the KNI ring.

We have a use case in which we leverage the Linux TCP / IP stack for
control plane, and some protocols might be sensitive to packet drops.

This might mean that the sender (Kernel) might be moving at a faster pace
than the receiver end (DPDK application), or it might have some brief
moments of bursty traffic patterns.

Requeuing the packets could add a kind of backpressure until a transmit
window is available to us.

The burden of retransmitting is shifted to the caller of ndo_start_xmit,
which in our case is the configured queuing discipline. This way, the
user should be able to influence the behavior w.r.t dropping packets,
by picking the desired queuing discipline.

Although it should technically be a good approach, from what
I have tested, stopping the queue prior to returning NETDEV_TX_BUSY seems
to add some extra overhead, and degrade the control-plane performance
a bit.

Signed-off-by: Tudor Cornea 
---
 kernel/linux/kni/kni_net.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c
index 29e5b9e..db0330f 100644
--- a/kernel/linux/kni/kni_net.c
+++ b/kernel/linux/kni/kni_net.c
@@ -321,10 +321,9 @@ kni_net_tx(struct sk_buff *skb, struct net_device *dev)
if (kni_fifo_free_count(kni->tx_q) == 0 ||
kni_fifo_count(kni->alloc_q) == 0) {
/**
-* If no free entry in tx_q or no entry in alloc_q,
-* drops skb and goes out.
+* Tell the caller to requeue, and retry at a later time.
 */
-   goto drop;
+   goto requeue;
}
 
/* dequeue a mbuf from alloc_q */
@@ -371,6 +370,10 @@ kni_net_tx(struct sk_buff *skb, struct net_device *dev)
dev->stats.tx_dropped++;
 
return NETDEV_TX_OK;
+
+requeue:
+   /* Signal the caller to re-transmit at a later time */
+   return NETDEV_TX_BUSY;
 }
 
 /*
-- 
2.7.4



Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx

2021-09-29 Thread Tudor Cornea
Hi Ferruh,

What you described above looks like a ring buffer with single producer and
> single consumer, and producer overwrites the not consumed items.


Indeed. This is also my understanding of the bug.
I am going to try to isolate the issue, and should probably be able to come
up with a script in a few days.

Our of curiosity, are you using an modified af_packet implementation in
> kernel
> for above described usage?


We are currently using an Ubuntu-based distro with a 4.15 Linux kernel.
We don't have any kernel patches for the af_packet implementation to my
knowledge (probably excepting patches that are back-ported by Ubuntu
maintainers from newer releases).


On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit  wrote:

> On 9/13/2021 2:45 PM, Tudor Cornea wrote:
> > The poll call can return POLLERR which is ignored, or it can return
> > POLLOUT, even if there are no free frames in the mmap-ed area.
> >
> > We can account for both of these cases by re-checking if the next
> > frame is empty before writing into it.
> >
> > Signed-off-by: Mihai Pogonaru 
> > Signed-off-by: Tudor Cornea 
> > ---
> >  drivers/net/af_packet/rte_eth_af_packet.c | 19 +++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
> b/drivers/net/af_packet/rte_eth_af_packet.c
> > index b73b211..087c196 100644
> > --- a/drivers/net/af_packet/rte_eth_af_packet.c
> > +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> > @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >   (poll(&pfd, 1, -1) < 0))
> >   break;
> >
> > + /*
> > +  * Poll can return POLLERR if the interface is down
> > +  *
> > +  * It will almost always return POLLOUT, even if there
> > +  * are no extra buffers available
> > +  *
> > +  * This happens, because packet_poll() calls
> datagram_poll()
> > +  * which checks the space left in the socket buffer and,
> > +  * in the case of packet_mmap, the default socket buffer
> length
> > +  * doesn't match the requested size for the tx_ring.
> > +  * As such, there is almost always space left in socket
> buffer,
> > +  * which doesn't seem to be correlated to the requested
> size
> > +  * for the tx_ring in packet_mmap.
> > +  *
> > +  * This results in poll() returning POLLOUT.
> > +  */
> > + if (ppd->tp_status != TP_STATUS_AVAILABLE)
> > + break;
> > +
>
> If 'POLLOUT' doesn't indicate that there is space in the buffer, what is
> the
> point of the 'poll()' at all?
>
> What can we test/reproduce the mentioned behavior? Or is there a way to
> fix the
> behavior of poll() or use an alternative of it?
>
>
> OK to break on the 'POLLERR', I guess it can be detected in the
> 'pfd.revent'.
>
>
> >   /* copy the tx frame data */
> >   pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> >   sizeof(struct sockaddr_ll);
> >
>
>


[dpdk-dev] [PATCH v4] net/af_packet: reinsert the stripped vlan tag

2021-09-29 Thread Tudor Cornea
The af_packet pmd driver binds to a raw socket and allows
sending and receiving of packets through the kernel.

Since commit [1], the kernel strips the vlan tags early in
__netif_receive_skb_core(), so we receive untagged packets while
running with the af_packet pmd.

Luckily for us, the skb vlan-related fields are still populated from the
stripped vlan tags, so we end up having all the information
that we need in the mbuf.

Having the pmd driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the
application to control the desired vlan stripping behavior,
until we have a way to describe offloads that can't be disabled by
pmd drivers.

This patch will cause a change in the default way that the af_packet
pmd treats received vlan-tagged frames. While previously, the
application was required to check the PKT_RX_VLAN_STRIPPED flag, after
this patch, the pmd will re-insert the vlan tag transparently to the
user, unless the DEV_RX_OFFLOAD_VLAN_STRIP is enabled in
rxmode.offloads.

I've attempted a preliminary benchmark to understand if the change could
cause a sizable performance hit.

Setup:
Two virtual machines running on top of an ESXi hypervisor

Tx: DPDK app (running on top of vmxnet3 PMD)
Rx: af_packet (running on top of a kernel vmxnet3 interface)
Packet size :68 (packet contains a vlan tag)

Rates:
Tx - 1.419 Mpps
Rx (without vlan insertion) - 1227636 pps
Rx (with vlan insertion)- 1220081 pps

At a first glance, we don't seem to have a large degradation in terms
of packet rate.

[1] 
https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2

Signed-off-by: Tudor Cornea 

---
v4:
* Updated the af_packet documentation
v3:
* Updated release note and documentation
* Updated commit with performance measurements
v2:
* Added DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads
---
 doc/guides/nics/af_packet.rst |  5 +
 doc/guides/rel_notes/release_21_11.rst|  4 
 drivers/net/af_packet/rte_eth_af_packet.c | 12 
 3 files changed, 21 insertions(+)

diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst
index efd6f1c..c87310b 100644
--- a/doc/guides/nics/af_packet.rst
+++ b/doc/guides/nics/af_packet.rst
@@ -65,3 +65,8 @@ framecnt=512):
 .. code-block:: console
 
 
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
+
+Features and Limitations of the af_packet PMD
+-
+
+Af_packet PMD now works with VLAN's on Linux
diff --git a/doc/guides/rel_notes/release_21_11.rst 
b/doc/guides/rel_notes/release_21_11.rst
index ad7c1af..095fd5b 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -66,6 +66,10 @@ New Features
 
   * Added rte_flow support for dual VLAN insert and strip actions.
 
+* **Updated af_packet ethdev driver.**
+
+  * Added DEV_RX_OFFLOAD_VLAN_STRIP capability.
+
 * **Updated Marvell cnxk crypto PMD.**
 
   * Added AES-CBC SHA1-HMAC support in lookaside protocol (IPsec) for CN10K.
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..5ed9dd6 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -48,6 +48,7 @@ struct pkt_rx_queue {
 
struct rte_mempool *mb_pool;
uint16_t in_port;
+   uint8_t vlan_strip;
 
volatile unsigned long rx_pkts;
volatile unsigned long rx_bytes;
@@ -78,6 +79,7 @@ struct pmd_internals {
 
struct pkt_rx_queue *rx_queue;
struct pkt_tx_queue *tx_queue;
+   uint8_t vlan_strip;
 };
 
 static const char *valid_arguments[] = {
@@ -148,6 +150,9 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
mbuf->vlan_tci = ppd->tp_vlan_tci;
mbuf->ol_flags |= (PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED);
+
+   if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf))
+   PMD_LOG(ERR, "Failed to reinsert VLAN tag");
}
 
/* release incoming frame and advance ring buffer */
@@ -302,6 +307,11 @@ eth_dev_stop(struct rte_eth_dev *dev)
 static int
 eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
 {
+   struct rte_eth_conf *dev_conf = &dev->data->dev_conf;
+   const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode;
+   struct pmd_internals *internals = dev->data->dev_private;
+
+   internals->vlan_strip = !!(rxmode->offloads & 
DEV_RX_OFFLOAD_VLAN_STRIP);
return 0;
 }
 
@@ -318,6 +328,7 @@ eth_dev_info(struct rte_eth_dev *dev, struct 
rte_eth_dev_info *dev_info)
dev_info->min_rx_bufsize = 0;
dev_info->tx_offload_capa = DEV_TX_OFFLOAD_MULTI_SEGS |
DEV_TX_OFFLOAD_VLAN_INSERT;

Re: [dpdk-dev] [PATCH v3] net/af_packet: reinsert the stripped vlan tag

2021-09-29 Thread Tudor Cornea
Thanks Stephen, for the suggestion

I've sent v4 of the patch, which adds the succinct description in the
af_packet documentation..


On Fri, 24 Sept 2021 at 18:11, Stephen Hemminger 
wrote:

> On Fri, 24 Sep 2021 14:44:45 +0300
> Tudor Cornea  wrote:
>
> > +Features and Limitations of the af_packet PMD
> > +-
> > +
> > +Since the following commit, the Linux kernel strips the vlan tag
> > +
> > +.. code-block:: console
> > +
> > +commit bcc6d47903612c3861201cc3a866fb604f26b8b2
> > +Author: Jiri Pirko 
> > +Date:   Thu Apr 7 19:48:33 2011 +
> > +
> > + net: vlan: make non-hw-accel rx path similar to hw-accel
> > +
> > +Running on such a kernel results in receiving untagged frames while
> using
> > +the af_packet PMD. Fortunately, the stripped information is still
> available
> > +for use in ``mbuf->vlan_tci``, and applications could check
> ``PKT_RX_VLAN_STRIPPED``.
> > +
> > +However, we currently don't have a way to describe offloads which can't
> be
> > +disabled by PMDs, and this creates an inconsistency with the way
> applications
> > +expect the PMD offloads to work, and requires them to be aware of which
> > +underlying driver they use.
> > +
> > +Since release 21.11 the af_packet PMD will implement support for the
> > +``DEV_RX_OFFLOAD_VLAN_STRIP`` offload, and users can control the
> desired vlan
> > +stripping behavior.
> > +
> > +It's important to note that the default case will change. If previously,
> > +the vlan tag was stripped, if the application now requires the same
> behavior,
> > +it will need to configure ``rxmode.offloads`` with
> ``DEV_RX_OFFLOAD_VLAN_STRIP``.
> > +
> > +The PMD driver will re-insert the vlan tag transparently to the
> application
> > +if the kernel strips it, as long as the ``DEV_RX_OFFLOAD_VLAN_STRIP``
> is not
> > +enabled.
>
> This seems like the wrong place for this text.
> It is not a new feature, it is a bug fix.
>
> If you want to describe it as an enhancement, the text should be succinct
> and describe
> the situation from user point of view. Something like:
>
> Af_packet PMD now works with VLAN's on Linux
>
>


[dpdk-dev] [PATCH v5] net/af_packet: reinsert the stripped vlan tag

2021-10-01 Thread Tudor Cornea
The af_packet pmd driver binds to a raw socket and allows
sending and receiving of packets through the kernel.

Since commit [1], the kernel strips the vlan tags early in
__netif_receive_skb_core(), so we receive untagged packets while
running with the af_packet pmd.

Luckily for us, the skb vlan-related fields are still populated from the
stripped vlan tags, so we end up having all the information
that we need in the mbuf.

Having the pmd driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the
application to control the desired vlan stripping behavior,
until we have a way to describe offloads that can't be disabled by
pmd drivers.

This patch will cause a change in the default way that the af_packet
pmd treats received vlan-tagged frames. While previously, the
application was required to check the PKT_RX_VLAN_STRIPPED flag, after
this patch, the pmd will re-insert the vlan tag transparently to the
user, unless the DEV_RX_OFFLOAD_VLAN_STRIP is enabled in
rxmode.offloads.

I've attempted a preliminary benchmark to understand if the change could
cause a sizable performance hit.

Setup:
Two virtual machines running on top of an ESXi hypervisor

Tx: DPDK app (running on top of vmxnet3 PMD)
Rx: af_packet (running on top of a kernel vmxnet3 interface)
Packet size :68 (packet contains a vlan tag)

Rates:
Tx - 1.419 Mpps
Rx (without vlan insertion) - 1227636 pps
Rx (with vlan insertion)- 1220081 pps

At a first glance, we don't seem to have a large degradation in terms
of packet rate.

[1] 
https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2

Signed-off-by: Tudor Cornea 

---
v5:
* Updated the af_packet documentation
* Updated the af_packet release notes
v4:
* Updated the af_packet documentation
v3:
* Updated release note and documentation
* Updated commit with performance measurements
v2:
* Added DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads
---
 doc/guides/nics/af_packet.rst |  7 +++
 doc/guides/rel_notes/release_21_11.rst|  5 +
 drivers/net/af_packet/rte_eth_af_packet.c | 12 
 3 files changed, 24 insertions(+)

diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst
index efd6f1c..168a946 100644
--- a/doc/guides/nics/af_packet.rst
+++ b/doc/guides/nics/af_packet.rst
@@ -65,3 +65,10 @@ framecnt=512):
 .. code-block:: console
 
 
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
+
+Features and Limitations of the af_packet PMD
+-
+
+The PMD will re-insert the VLAN tag transparently to the packet
+if the kernel strips it, as long as the ``DEV_RX_OFFLOAD_VLAN_STRIP`` is not
+enabled by the application.
diff --git a/doc/guides/rel_notes/release_21_11.rst 
b/doc/guides/rel_notes/release_21_11.rst
index ad7c1af..3315703 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -66,6 +66,11 @@ New Features
 
   * Added rte_flow support for dual VLAN insert and strip actions.
 
+* **Updated af_packet ethdev driver.**
+
+  * Default VLAN strip behavior changed.
+If previously, the VLAN tag was stripped by the kernel, if the application 
now requires the same behavior, it will need to enable 
``DEV_RX_OFFLOAD_VLAN_STRIP``.
+
 * **Updated Marvell cnxk crypto PMD.**
 
   * Added AES-CBC SHA1-HMAC support in lookaside protocol (IPsec) for CN10K.
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..5ed9dd6 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -48,6 +48,7 @@ struct pkt_rx_queue {
 
struct rte_mempool *mb_pool;
uint16_t in_port;
+   uint8_t vlan_strip;
 
volatile unsigned long rx_pkts;
volatile unsigned long rx_bytes;
@@ -78,6 +79,7 @@ struct pmd_internals {
 
struct pkt_rx_queue *rx_queue;
struct pkt_tx_queue *tx_queue;
+   uint8_t vlan_strip;
 };
 
 static const char *valid_arguments[] = {
@@ -148,6 +150,9 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
mbuf->vlan_tci = ppd->tp_vlan_tci;
mbuf->ol_flags |= (PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED);
+
+   if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf))
+   PMD_LOG(ERR, "Failed to reinsert VLAN tag");
}
 
/* release incoming frame and advance ring buffer */
@@ -302,6 +307,11 @@ eth_dev_stop(struct rte_eth_dev *dev)
 static int
 eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
 {
+   struct rte_eth_conf *dev_conf = &dev->data->dev_conf;
+   const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode;
+   struct pmd_internals *internals = dev->data->dev_private;

Re: [dpdk-dev] [PATCH v4] net/af_packet: reinsert the stripped vlan tag

2021-10-01 Thread Tudor Cornea
Hi Ferruh,

 Also driver was already working with VLAN, the change is VLAN is not
> force stripped anymore.


 Agreed. It makes more sense to inform the users that the default behavior
of the PMD has changed w.r.t VLAN stripping.

I have updated the patch

On Thu, 30 Sept 2021 at 11:14, Ferruh Yigit  wrote:

> On 9/29/2021 3:08 PM, Tudor Cornea wrote:
> > The af_packet pmd driver binds to a raw socket and allows
> > sending and receiving of packets through the kernel.
> >
> > Since commit [1], the kernel strips the vlan tags early in
> > __netif_receive_skb_core(), so we receive untagged packets while
> > running with the af_packet pmd.
> >
> > Luckily for us, the skb vlan-related fields are still populated from the
> > stripped vlan tags, so we end up having all the information
> > that we need in the mbuf.
> >
> > Having the pmd driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the
> > application to control the desired vlan stripping behavior,
> > until we have a way to describe offloads that can't be disabled by
> > pmd drivers.
> >
> > This patch will cause a change in the default way that the af_packet
> > pmd treats received vlan-tagged frames. While previously, the
> > application was required to check the PKT_RX_VLAN_STRIPPED flag, after
> > this patch, the pmd will re-insert the vlan tag transparently to the
> > user, unless the DEV_RX_OFFLOAD_VLAN_STRIP is enabled in
> > rxmode.offloads.
> >
> > I've attempted a preliminary benchmark to understand if the change could
> > cause a sizable performance hit.
> >
> > Setup:
> > Two virtual machines running on top of an ESXi hypervisor
> >
> > Tx: DPDK app (running on top of vmxnet3 PMD)
> > Rx: af_packet (running on top of a kernel vmxnet3 interface)
> > Packet size :68 (packet contains a vlan tag)
> >
> > Rates:
> > Tx - 1.419 Mpps
> > Rx (without vlan insertion) - 1227636 pps
> > Rx (with vlan insertion)    - 1220081 pps
> >
> > At a first glance, we don't seem to have a large degradation in terms
> > of packet rate.
> >
> > [1]
> https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2
> >
> > Signed-off-by: Tudor Cornea 
> >
> > ---
> > v4:
> > * Updated the af_packet documentation
> > v3:
> > * Updated release note and documentation
> > * Updated commit with performance measurements
> > v2:
> > * Added DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads
> > ---
> >  doc/guides/nics/af_packet.rst |  5 +
> >  doc/guides/rel_notes/release_21_11.rst|  4 
> >  drivers/net/af_packet/rte_eth_af_packet.c | 12 
> >  3 files changed, 21 insertions(+)
> >
> > diff --git a/doc/guides/nics/af_packet.rst
> b/doc/guides/nics/af_packet.rst
> > index efd6f1c..c87310b 100644
> > --- a/doc/guides/nics/af_packet.rst
> > +++ b/doc/guides/nics/af_packet.rst
> > @@ -65,3 +65,8 @@ framecnt=512):
> >  .. code-block:: console
> >
> >
> --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
> > +
> > +Features and Limitations of the af_packet PMD
> > +-
> > +
> > +Af_packet PMD now works with VLAN's on Linux
>
> Thanks Tudor.
>
> I see this is suggested by Stephen, but I think 'now' is confusing in the
> driver
> guide. Also driver was already working with VLAN, the change is VLAN is not
> force stripped anymore.
>
> I think following part from your previous version is better, what do you
> think
> something like following?
> "
> The PMD will re-insert the VLAN tag transparently to the packet
> if the kernel strips it, as long as the ``DEV_RX_OFFLOAD_VLAN_STRIP`` is
> not
> enabled by application.
> "
>
> > diff --git a/doc/guides/rel_notes/release_21_11.rst
> b/doc/guides/rel_notes/release_21_11.rst
> > index ad7c1af..095fd5b 100644
> > --- a/doc/guides/rel_notes/release_21_11.rst
> > +++ b/doc/guides/rel_notes/release_21_11.rst
> > @@ -66,6 +66,10 @@ New Features
> >
> >* Added rte_flow support for dual VLAN insert and strip actions.
> >
> > +* **Updated af_packet ethdev driver.**
> > +
> > +  * Added DEV_RX_OFFLOAD_VLAN_STRIP capability.
> > +
>
> I think change in the default behavior is more important in the release
> notes,
> again what do you think to have your following update here:
>
> "
> Default VLAN strip behavior changed. If previously,
> the vlan tag was stripped, if the application now requires the same
> behavior,
> it will need to configure ``DEV_RX_OFFLOAD_VLAN_STRIP``.
> "
>


Re: [dpdk-dev] [PATCH v2] net/af_packet: fix ignoring full ring on tx

2021-10-05 Thread Tudor Cornea
Hi Ferruh,

I have attempted to narrow down the issue.
I have the following bash script, which computes packet rates on an
interface.

[root@localhost ~]# cat compute-rates.sh
#!/usr/bin/env bash

if [[ ${#} -ne 2 ]]; then
echo "Usage: ${0}  "
exit 1
fi

IFACE_NAME="${1}"
SLEEP_INTERVAL_SECONDS="${2}"
TMP_STATS_FILE="/tmp/netstat"

# Clear Previous stats file
echo "0 0 0 0" > "${TMP_STATS_FILE}"

echo "Press CTRL+C to exit..."

while true; do
export "RxB=0" "RxP=0" "TxB=0" "TxP=0"

# Extract Rx{Bytes,Packets} and Tx{Bytes,Packets} and
# format the output. Individual fields will be exported
export $(\
ifconfig "${IFACE_NAME}" \
| grep 'packets' \
| awk '{print $5, $3}' \
| xargs echo \
| sed -E -e \
"s/([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+)/RxB=\1 RxP=\2
TxB=\3 TxP=\4/")

# Print Packet and Byte Rates
# Format: | Rx Bytes | Rx Packets | Tx Bytes | Tx Packets |

echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" $(cat "${TMP_STATS_FILE}") \
| awk '{print "RxB="$1-$5, "RxP="$2-$6, "TxB="$3-$7, "TxP="$4-$8}'

# Save the new values
echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" > "${TMP_STATS_FILE}"

sleep "${SLEEP_INTERVAL_SECONDS}"

done

On the transmit side, I'm using the engine behind [1] with the af_packet
PMD.

The configuration for the af_packet PMD is the following:
--vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0

I'm configuring a Tx rate of 335 packets / second and a packet size of 300
Bytes.
These seem to be the values using which we seem to have better chances of
seeing the problem. I suspect it might also be linked with the af_packet
configuration.

I'm starting traffic using the specified configuration, and in parallel,
running the script that computes the rates as follows:
./compute-rates.sh eth1 0.1

Initially, the packet rates seem steady

RxB=0 RxP=0 TxB=10952 TxP=37
RxB=0 RxP=0 TxB=10656 TxP=36
RxB=0 RxP=0 TxB=10656 TxP=36
RxB=0 RxP=0 TxB=10656 TxP=36
RxB=0 RxP=0 TxB=10952 TxP=37
RxB=0 RxP=0 TxB=10952 TxP=37
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10952 TxP=37

[...]

After a while, we toggle the interface up / down with a sleep between the
steps. I suspect the length of the sleep might be a variable in the
equation.

ifconfig eth1 down; sleep 7; ifconfig eth1 up


What we see, is that even after the interface is toggled back up, the rates
never seem to recover.

RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=2072 TxP=7
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=521256 TxP=1761
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0

[...]


I've attempted to mirror the same behavior using dpdk-pktgen [2] on a
different machine (Ubuntu 20.04). This time, af_packet runs on top of
a Linux virtio_net interface.

I seem to be getting a  similar behavior. I have used the following
dpdk-pktgen configuration and run-time settings


pktgen \
-l 1-4 \
-n 4 \
--proc-type=primary \
--no-pci \
--no-telemetry \
--no-huge \
-m 512 \

--vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0
\
-- \
-P \
-T \
-m "3.0" \
-f themes/black-yellow.theme

set 0 size 300
set 0 rate 0.008
set 0 burst 1
start 0


[1] https://github.com/open-traffic-generator/ixia-c
[2] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md

On Wed, 29 Sept 2021 at 13:03, Tudor Cornea  wrote:

> Hi Ferruh,
>
> What you described above looks like a ring buffer with single producer and
>> single consumer, and producer overwrites the not consumed items.
>
>
> Indeed. This is also my understanding of the bug.
> I am going to try to isolate the issue, and should probably be able to
> come up with a script in a few days.
>
> Our of curiosity, are you using an modified af_packet implementation in
>> kernel
>> for above described usage?
>
>
> We are currently using an Ubuntu-based distro with a 4.15 Linux kernel.
> We don't have any kernel patches for the af_packet implementation to my
> knowledge (probably excepting patches that are back-ported by Ubuntu
> maintainers from newer releases).
>
>
> On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit 
> wrote:
>
>> O

[dpdk-dev] [PATCH] net/af_packet: try to reinsert the stripped vlan tag

2021-08-20 Thread Tudor Cornea
The af_packet pmd driver binds to a raw socket and allows
sending and receiving of packets through the kernel.

Since commit bcc6d47903 [1], the kernel strips the vlan tags early in
__netif_receive_skb_core(), so we receive untagged packets while
running with the af_packet pmd.

Luckily for us, the skb vlan-related fields are still populated from the
stripped vlan tags, so we end up having all the information
that we need in the mbuf.

We would like to have the the vlan tag inside the mbuf.
Let's take a shot at it by trying to reinsert the stripped vlan tag.

As a side note, something similar was done for the netvsc pmd.

[1] 
https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2

Signed-off-by: Tudor Cornea 
---
 drivers/net/af_packet/rte_eth_af_packet.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..d116583 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -148,6 +148,10 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
mbuf->vlan_tci = ppd->tp_vlan_tci;
mbuf->ol_flags |= (PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED);
+
+   /* the kernel always strips the vlan tag, try to 
reinsert it */
+   if (rte_vlan_insert(&mbuf))
+   PMD_LOG(ERR, "Failed to reinsert vlan tag");
}
 
/* release incoming frame and advance ring buffer */
-- 
2.7.4



[dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx

2021-08-20 Thread Tudor Cornea
The poll call can return POLLERR which is ignored, or it can return
POLLOUT, even if there are no free frames in the mmap-ed area.

We can account for both of these cases by re-checking if the next
frame is empty before writing into it.

We also now eliminate the timestamp status from the frame status.

Signed-off-by: Mihai Pogonaru 
Signed-off-by: Tudor Cornea 
---
 drivers/net/af_packet/rte_eth_af_packet.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..3845df5 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
return num_rx;
 }
 
+static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status)
+{
+   return *tp_status &
+   ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
+}
+
 /*
  * Callback to handle sending packets through a real NIC.
  */
@@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
}
}
 
+   /*
+* We must eliminate the timestamp status from the packet
+* status. This should only matter if timestamping is enabled
+* on the socket, but there is a BUG in the kernel which is
+* fixed in newer releases.
+
+* For interfaces of type 'veth', the sent skb is forwarded
+* to the peer and back into the network stack which timestamps
+* it on the RX path if timestamping is enabled globally
+* (which happens if any socket enables timestamping).
+
+* When the skb is destructed, tpacket_destruct_skb() is called
+* and it calls __packet_set_timestamp() which doesn't check
+* the flags on the socket and returns the timestamp if it is
+* set in the skb (and for veth it is, as mentioned above).
+*/
+
/* point at the next incoming frame */
-   if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
-   (poll(&pfd, 1, -1) < 0))
+   if ((tx_ring_status_remove_ts(&ppd->tp_status)
+   != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0))
+   break;
+
+   /*
+* Poll can return POLLERR if the interface is down or POLLOUT,
+* even if there are no extra buffers available.
+* This happens, because packet_poll() calls datagram_poll()
+* which checks the space left in the socket buffer and in the
+* case of packet_mmap the default socket buffer length
+* doesn't match the requested size for the tx_ring so there
+* is always space left in socket buffer, which doesn't seem
+* to be correlated to the requested size for the tx_ring
+* in packet_mmap.
+*/
+   if (tx_ring_status_remove_ts(&ppd->tp_status)
+   != TP_STATUS_AVAILABLE)
break;
 
/* copy the tx frame data */
@@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
rte_pktmbuf_free(mbuf);
}
 
+   /*
+* We might have to ignore a few more errnos here since the packets
+* remain in the mmap-ed queue and will be sent later, presumably.
+*/
+
/* kick-off transmits */
if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 &&
errno != ENOBUFS && errno != EAGAIN) {
-- 
2.7.4



Re: [dpdk-dev] [PATCH] net/af_packet: try to reinsert the stripped vlan tag

2021-09-01 Thread Tudor Cornea
Indeed, the vlan insertion could be a costly operation. We should probably
do it only if the user specifically asks to have the vlan tag in the packet.
Otherwise, af_packet PMD users might pay a price in terms of performance
for something they didn't ask for.

I was thinking of avoiding having to change the application in order to
re-insert the vlan tag.
Doing this operation inside the PMD driver seemed like a good fit.

Looking at the netvsc driver (drivers/net/netvsc), the vlan insertion is
guarded by a check to hv->vlan_strip

  if (!hv->vlan_strip && rte_vlan_insert(&m)) {

hv->vlan_strip seems to be initialized in hn_dev_configure() in the
following way

 hv->vlan_strip = !!(rxmode->offloads & DEV_RX_OFFLOAD_VLAN_STRIP);

while 'hv' seems to be stored in rte_eth_dev->data->dev_private

I am thinking of doing something similar for the af_packet PMD.
The 'pmd_internals' structure could potentially hold a field, say
vlan_strip', which could be initialized if the application enables the
DEV_RX_OFFLOAD_VLAN_STRIP in rxmode->offloads

This way, I'm thinking that the application could potentially control the
effect of vlan stripping for the af_packet PMD, in an uniform way, similar
to other PMDs.
Would this be considered an acceptable solution ?




On Tue, 31 Aug 2021 at 18:31, Ferruh Yigit  wrote:

> On 8/20/2021 1:46 PM, Tudor Cornea wrote:
> > The af_packet pmd driver binds to a raw socket and allows
> > sending and receiving of packets through the kernel.
> >
> > Since commit bcc6d47903 [1], the kernel strips the vlan tags early in
> > __netif_receive_skb_core(), so we receive untagged packets while
> > running with the af_packet pmd.
> >
> > Luckily for us, the skb vlan-related fields are still populated from the
> > stripped vlan tags, so we end up having all the information
> > that we need in the mbuf.
> >
> > We would like to have the the vlan tag inside the mbuf.
> > Let's take a shot at it by trying to reinsert the stripped vlan tag.
> >
>
> PMD already sets 'mbuf->vlan_tci' and 'PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED'
> flags, so application can be aware of the vlan tag and can consume it.
>
> Inserting the vlan tag back to packet is costly, what is the motivation to
> do so?
>
> > As a side note, something similar was done for the netvsc pmd.
> >
> > [1]
> https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2
> >
> > Signed-off-by: Tudor Cornea 
> > ---
> >  drivers/net/af_packet/rte_eth_af_packet.c | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
> b/drivers/net/af_packet/rte_eth_af_packet.c
> > index b73b211..d116583 100644
> > --- a/drivers/net/af_packet/rte_eth_af_packet.c
> > +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> > @@ -148,6 +148,10 @@ eth_af_packet_rx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >   if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
> >   mbuf->vlan_tci = ppd->tp_vlan_tci;
> >   mbuf->ol_flags |= (PKT_RX_VLAN |
> PKT_RX_VLAN_STRIPPED);
> > +
> > + /* the kernel always strips the vlan tag, try to
> reinsert it */
> > + if (rte_vlan_insert(&mbuf))
> > + PMD_LOG(ERR, "Failed to reinsert vlan
> tag");
> >   }
> >
> >   /* release incoming frame and advance ring buffer */
> >
>
>


Re: [dpdk-dev] [PATCH] net/af_packet: try to reinsert the stripped vlan tag

2021-09-03 Thread Tudor Cornea
Thanks,

I hope I understood correctly the above comments.

I'm thinking of adding DEV_RX_OFFLOAD_VLAN_STRIP to
dev_info->rx_offload_capa

eth_dev_info()
+dev_info->rx_offload_capa = DEV_RX_OFFLOAD_VLAN_STRIP;

then populating pmd_internals->vlan_strip with the vlan stripping
option that the application requests

eth_dev_configure()
+internals->vlan_strip = !!(rxmode->offloads & DEV_RX_OFFLOAD_VLAN_STRIP);

>From the internals structure, we could populate a newly-added field in the
pkt_rx_queue structure 'vlan_strip'

eth_rx_queue_setup()
+pkt_q->vlan_strip = internals->vlan_strip;

And attempt to re-insert the vlan only if required in eth_af_packet_rx

eth_af_packet_rx()
+if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf))

I've attempted a simple benchmark to understand if the change could cause a
sizable performance hit.

Setup:
Tx: vmxnet3 PMD
Rx: af_packet (running on top of a vmxnet3 interface)
Packet size :68 (packet contains a vlan tag)

Rates:
Tx - 1.419 Mpps
Rx (without vlan insertion) -   1227636 pps
Rx (with vlan insertion) - 1220081 pps

I don't seem to have a large degradation in terms of packet rate at first
glance, but maybe the experiment could be repeated on different setups as
I'm using a virtual environment.

Would it be reasonable if I send v2 of the patch for review, with the above
changes ?


On Thu, 2 Sept 2021 at 13:49, Ferruh Yigit  wrote:

> On 9/1/2021 10:34 PM, Stephen Hemminger wrote:
> > On Wed, 1 Sep 2021 22:07:22 +0300
> > Tudor Cornea  wrote:
> >
> >> Indeed, the vlan insertion could be a costly operation. We should
> probably
> >> do it only if the user specifically asks to have the vlan tag in the
> packet.
> >> Otherwise, af_packet PMD users might pay a price in terms of performance
> >> for something they didn't ask for.
> >>
> >> I was thinking of avoiding having to change the application in order to
> >> re-insert the vlan tag.
> >> Doing this operation inside the PMD driver seemed like a good fit.
> >>
> >> Looking at the netvsc driver (drivers/net/netvsc), the vlan insertion is
> >> guarded by a check to hv->vlan_strip
> >>
> >>   if (!hv->vlan_strip && rte_vlan_insert(&m)) {
> >>
> >> hv->vlan_strip seems to be initialized in hn_dev_configure() in the
> >> following way
> >>
> >>  hv->vlan_strip = !!(rxmode->offloads & DEV_RX_OFFLOAD_VLAN_STRIP);
> >>
> >> while 'hv' seems to be stored in rte_eth_dev->data->dev_private
> >>
> >> I am thinking of doing something similar for the af_packet PMD.
> >> The 'pmd_internals' structure could potentially hold a field, say
> >> vlan_strip', which could be initialized if the application enables the
> >> DEV_RX_OFFLOAD_VLAN_STRIP in rxmode->offloads
> >>
> >> This way, I'm thinking that the application could potentially control
> the
> >> effect of vlan stripping for the af_packet PMD, in an uniform way,
> similar
> >> to other PMDs.
> >> Would this be considered an acceptable solution ?
> >>
> >>
> >>
> >>
> >> On Tue, 31 Aug 2021 at 18:31, Ferruh Yigit 
> wrote:
> >>
> >>> On 8/20/2021 1:46 PM, Tudor Cornea wrote:
> >>>> The af_packet pmd driver binds to a raw socket and allows
> >>>> sending and receiving of packets through the kernel.
> >>>>
> >>>> Since commit bcc6d47903 [1], the kernel strips the vlan tags early in
> >>>> __netif_receive_skb_core(), so we receive untagged packets while
> >>>> running with the af_packet pmd.
> >>>>
> >>>> Luckily for us, the skb vlan-related fields are still populated from
> the
> >>>> stripped vlan tags, so we end up having all the information
> >>>> that we need in the mbuf.
> >>>>
> >>>> We would like to have the the vlan tag inside the mbuf.
> >>>> Let's take a shot at it by trying to reinsert the stripped vlan tag.
> >>>>
> >>>
> >>> PMD already sets 'mbuf->vlan_tci' and 'PKT_RX_VLAN |
> PKT_RX_VLAN_STRIPPED'
> >>> flags, so application can be aware of the vlan tag and can consume it.
> >>>
> >>> Inserting the vlan tag back to packet is costly, what is the
> motivation to
> >>> do so?
> >>>
> >>>> As a side note, something similar was done for the netvsc pmd.
> >>>>
> >>>> [1]
> >>>
> https:

Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx

2021-09-06 Thread Tudor Cornea
Hi Ferruh,

Would you mind separate timestamp status fix to its own patch? I think
> better to
> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
> patch.


Agreed. There are two issues solved by this patch. We will break it in two
different patches.

I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
> kernel
> versions the bug exists, this flag is not set, but can you please confirm?


And does it only seen with veth, if so I wonder if we can ignore it, not
> sure
> how common to use af_packet PMD over veth interface, do you have this
> usecase?


We've seen the timestamping issue only when running af_packet over
veth interfaces. We have a particular use-case internally, in which we need
to run inside a Kubernetes cluster.
We've found the following resources [1] , [2] related to this behavior in
the kernel.

We believe that issue #2 (the ring getting full), can theoretically occur
on any type of NIC.
We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3
interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the
interface on / off
ifconfig $iface_name down; sleep 10; ifconfig $iface_name up

We will attempt to give more context on the issue below, about what we
think happens:
- we have a 2048 queue shared between the kernel and the dpdk socket,
there's an index the queue in both the kernel and the dpdk driver
- the dpdk driver writes a packet or a burst, advances its idx and tells
the kernel to send the packets via a call to sendto() and the kernel sends
the packets and advances its idx
- once the interface is down the kernel can no longer send packets, but it
doesn't drop them, it just doesn't advance its idx
- for each packet there is header and in the header there is a status
integer which, among others, indicates the owner of the packet: the
userspace or the kernel - the userspace (dpdk driver) sets the status as
owned by the kernel when it adds another packet ; the kernel sets the
status back to owned by the userspace once it sends a packet
- the dpdk driver was ignoring this status bit and,  even after the queue
was full, it would continue to put packets in the queue - its idx would be
"after" that of the kernel
- once the interface is brought up, the kernel would send all the packets
in the queue (since they have the status of being owned by the kernel) on
the next call to sendto() and the idx would be back to where it was before
the interface was brought up (let's call it k1)
- the dpdk driver idx into the queue would point somewhere in the queue
(let's call it d1) and would continue to add packets at that point, but the
kernel wouldn't send any packet anymore since there is now a gap of packets
owned by the userspace between the kernel index (k1) and the dpdk driver
idx (d1)
- the dpdk idx would eventually reach k1 and packets would be transferred
at a normal rate until both the dpdk idx and the kernel idx would reach d1
again
- between d1 and k1 there are only packets with the status as owned by the
kernel - which where added by the dpdk driver while its index was between
d1 and k1 ; thus the kernel would burst all the packets till k1, while the
dpdk idx is at d1
- the cycle repeats

If a new traffic config comes (in our application) while this cycle is
happening, it could be that some of the packets of the old config are still
in queue (between d1 and k1) and will be bursted when the dpdk and kernel
idx reach d1 ; this would explain seeing packets from an old config, but
only in the first 2048 packets (which is the queue size)


[1] https://www.spinics.net/lists/kernel/msg3959391.html
[2] https://www.spinics.net/lists/netdev/msg739372.html

On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit  wrote:

> On 8/20/2021 2:39 PM, Tudor Cornea wrote:
> > The poll call can return POLLERR which is ignored, or it can return
> > POLLOUT, even if there are no free frames in the mmap-ed area.
> >
> > We can account for both of these cases by re-checking if the next
> > frame is empty before writing into it.
> >
> > We also now eliminate the timestamp status from the frame status.
> >
>
> Hi Tudor,
>
> Would you mind separate timestamp status fix to its own patch? I think
> better to
> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
> patch.
>
> > Signed-off-by: Mihai Pogonaru 
> > Signed-off-by: Tudor Cornea 
> > ---
> >  drivers/net/af_packet/rte_eth_af_packet.c | 47
> +--
> >  1 file changed, 45 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
> b/drivers/net/af_packet/rte_eth_af_packet.c
> > index b73b211..3845df5 100644
> > --- a/drivers/net/af_packet/rte_eth_af_packet.c
> > +++ b/drivers

[dpdk-dev] [PATCH v2] net/af_packet: reinsert the stripped vlan tag

2021-09-08 Thread Tudor Cornea
The af_packet pmd driver binds to a raw socket and allows
sending and receiving of packets through the kernel.

Since commit [1], the kernel strips the vlan tags early in
__netif_receive_skb_core(), so we receive untagged packets while
running with the af_packet pmd.

Luckily for us, the skb vlan-related fields are still populated from the
stripped vlan tags, so we end up having all the information
that we need in the mbuf.

Having the PMD driver support DEV_RX_OFFLOAD_VLAN_STRIP allows the
application to control the desired vlan stripping behavior.

[1] 
https://github.com/torvalds/linux/commit/bcc6d47903612c3861201cc3a866fb604f26b8b2

Signed-off-by: Tudor Cornea 

---
v2:
* Add DEV_RX_OFFLOAD_VLAN_STRIP to rxmode->offloads
---
 drivers/net/af_packet/rte_eth_af_packet.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..5ed9dd6 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -48,6 +48,7 @@ struct pkt_rx_queue {
 
struct rte_mempool *mb_pool;
uint16_t in_port;
+   uint8_t vlan_strip;
 
volatile unsigned long rx_pkts;
volatile unsigned long rx_bytes;
@@ -78,6 +79,7 @@ struct pmd_internals {
 
struct pkt_rx_queue *rx_queue;
struct pkt_tx_queue *tx_queue;
+   uint8_t vlan_strip;
 };
 
 static const char *valid_arguments[] = {
@@ -148,6 +150,9 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, 
uint16_t nb_pkts)
if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
mbuf->vlan_tci = ppd->tp_vlan_tci;
mbuf->ol_flags |= (PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED);
+
+   if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf))
+   PMD_LOG(ERR, "Failed to reinsert VLAN tag");
}
 
/* release incoming frame and advance ring buffer */
@@ -302,6 +307,11 @@ eth_dev_stop(struct rte_eth_dev *dev)
 static int
 eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
 {
+   struct rte_eth_conf *dev_conf = &dev->data->dev_conf;
+   const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode;
+   struct pmd_internals *internals = dev->data->dev_private;
+
+   internals->vlan_strip = !!(rxmode->offloads & 
DEV_RX_OFFLOAD_VLAN_STRIP);
return 0;
 }
 
@@ -318,6 +328,7 @@ eth_dev_info(struct rte_eth_dev *dev, struct 
rte_eth_dev_info *dev_info)
dev_info->min_rx_bufsize = 0;
dev_info->tx_offload_capa = DEV_TX_OFFLOAD_MULTI_SEGS |
DEV_TX_OFFLOAD_VLAN_INSERT;
+   dev_info->rx_offload_capa = DEV_RX_OFFLOAD_VLAN_STRIP;
 
return 0;
 }
@@ -448,6 +459,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev,
 
dev->data->rx_queues[rx_queue_id] = pkt_q;
pkt_q->in_port = dev->data->port_id;
+   pkt_q->vlan_strip = internals->vlan_strip;
 
return 0;
 }
-- 
2.7.4



[dpdk-dev] [PATCH] net/ice: fix integer overflow when computing max_pkt_len

2021-06-15 Thread Tudor Cornea
Greetings,

Please review the following patch for the dpdk-next-net-intel branch.

The len variable, used in the computation of max_pkt_len could overflow,
if used to store the result of the following computation:

ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len

Since, we could define the mbuf size to have a large value (i.e 13312),
and ICE_SUPPORT_CHAIN_NUM is defined as 5, the computation mentioned
above could potentially result in a value which might be bigger than
MAX_USHORT.

The result will be that Jumbo Frames will not work properly

Signed-off-by: Tudor Cornea 
---
 drivers/net/ice/ice_dcf_ethdev.c | 7 ---
 drivers/net/ice/ice_rxtx.c   | 6 +++---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ice/ice_dcf_ethdev.c
b/drivers/net/ice/ice_dcf_ethdev.c
index b937cbb..f73dc80 100644
--- a/drivers/net/ice/ice_dcf_ethdev.c
+++ b/drivers/net/ice/ice_dcf_ethdev.c
@@ -54,13 +54,14 @@ ice_dcf_init_rxq(struct rte_eth_dev *dev, struct
ice_rx_queue *rxq)
struct ice_dcf_adapter *dcf_ad = dev->data->dev_private;
struct rte_eth_dev_data *dev_data = dev->data;
struct iavf_hw *hw = &dcf_ad->real_hw.avf;
-   uint16_t buf_size, max_pkt_len, len;
+   uint16_t buf_size, max_pkt_len;

buf_size = rte_pktmbuf_data_room_size(rxq->mp) -
RTE_PKTMBUF_HEADROOM;
rxq->rx_hdr_len = 0;
rxq->rx_buf_len = RTE_ALIGN(buf_size, (1 << ICE_RLAN_CTX_DBUF_S));
-   len = ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len;
-   max_pkt_len = RTE_MIN(len,
dev->data->dev_conf.rxmode.max_rx_pkt_len);
+   max_pkt_len = RTE_MIN((uint32_t)
+ ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len,
+ dev->data->dev_conf.rxmode.max_rx_pkt_len);

/* Check if the jumbo frame and maximum packet length are set
 * correctly.
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a..20352b0 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -258,7 +258,7 @@ ice_program_hw_rx_queue(struct ice_rx_queue *rxq)
struct rte_eth_dev_data *dev_data = rxq->vsi->adapter->pf.dev_data;
struct ice_rlan_ctx rx_ctx;
enum ice_status err;
-   uint16_t buf_size, len;
+   uint16_t buf_size;
struct rte_eth_rxmode *rxmode = &dev_data->dev_conf.rxmode;
uint32_t rxdid = ICE_RXDID_COMMS_OVS;
uint32_t regval;
@@ -268,8 +268,8 @@ ice_program_hw_rx_queue(struct ice_rx_queue *rxq)
  RTE_PKTMBUF_HEADROOM);
rxq->rx_hdr_len = 0;
rxq->rx_buf_len = RTE_ALIGN(buf_size, (1 << ICE_RLAN_CTX_DBUF_S));
-   len = ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len;
-   rxq->max_pkt_len = RTE_MIN(len,
+   rxq->max_pkt_len = RTE_MIN((uint32_t)
+  ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len,

 dev_data->dev_conf.rxmode.max_rx_pkt_len);

if (rxmode->offloads & DEV_RX_OFFLOAD_JUMBO_FRAME) {
-- 
2.7.4


[dpdk-dev] [PATCH] net/ice: fix integer overflow when computing max_pkt_len

2021-06-15 Thread Tudor Cornea
The len variable, used in the computation of max_pkt_len could overflow,
if used to store the result of the following computation:

ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len

Since, we could define the mbuf size to have a large value (i.e 13312),
and ICE_SUPPORT_CHAIN_NUM is defined as 5, the computation mentioned
above could potentially result in a value which might be bigger than
MAX_USHORT.

The result will be that Jumbo Frames will not work properly

Signed-off-by: Tudor Cornea 
---
 drivers/net/ice/ice_dcf_ethdev.c | 7 ---
 drivers/net/ice/ice_rxtx.c   | 6 +++---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ice/ice_dcf_ethdev.c b/drivers/net/ice/ice_dcf_ethdev.c
index b937cbb..f73dc80 100644
--- a/drivers/net/ice/ice_dcf_ethdev.c
+++ b/drivers/net/ice/ice_dcf_ethdev.c
@@ -54,13 +54,14 @@ ice_dcf_init_rxq(struct rte_eth_dev *dev, struct 
ice_rx_queue *rxq)
struct ice_dcf_adapter *dcf_ad = dev->data->dev_private;
struct rte_eth_dev_data *dev_data = dev->data;
struct iavf_hw *hw = &dcf_ad->real_hw.avf;
-   uint16_t buf_size, max_pkt_len, len;
+   uint16_t buf_size, max_pkt_len;
 
buf_size = rte_pktmbuf_data_room_size(rxq->mp) - RTE_PKTMBUF_HEADROOM;
rxq->rx_hdr_len = 0;
rxq->rx_buf_len = RTE_ALIGN(buf_size, (1 << ICE_RLAN_CTX_DBUF_S));
-   len = ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len;
-   max_pkt_len = RTE_MIN(len, dev->data->dev_conf.rxmode.max_rx_pkt_len);
+   max_pkt_len = RTE_MIN((uint32_t)
+ ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len,
+ dev->data->dev_conf.rxmode.max_rx_pkt_len);
 
/* Check if the jumbo frame and maximum packet length are set
 * correctly.
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a..20352b0 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -258,7 +258,7 @@ ice_program_hw_rx_queue(struct ice_rx_queue *rxq)
struct rte_eth_dev_data *dev_data = rxq->vsi->adapter->pf.dev_data;
struct ice_rlan_ctx rx_ctx;
enum ice_status err;
-   uint16_t buf_size, len;
+   uint16_t buf_size;
struct rte_eth_rxmode *rxmode = &dev_data->dev_conf.rxmode;
uint32_t rxdid = ICE_RXDID_COMMS_OVS;
uint32_t regval;
@@ -268,8 +268,8 @@ ice_program_hw_rx_queue(struct ice_rx_queue *rxq)
  RTE_PKTMBUF_HEADROOM);
rxq->rx_hdr_len = 0;
rxq->rx_buf_len = RTE_ALIGN(buf_size, (1 << ICE_RLAN_CTX_DBUF_S));
-   len = ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len;
-   rxq->max_pkt_len = RTE_MIN(len,
+   rxq->max_pkt_len = RTE_MIN((uint32_t)
+  ICE_SUPPORT_CHAIN_NUM * rxq->rx_buf_len,
   dev_data->dev_conf.rxmode.max_rx_pkt_len);
 
if (rxmode->offloads & DEV_RX_OFFLOAD_JUMBO_FRAME) {
-- 
2.7.4



[PATCH v2] net/af_packet: allow changing fanout mode

2025-01-06 Thread Tudor Cornea
This allows us to control the algorithm used to spread traffic between
sockets, adding more fine grained control. If the user does not
specify a fanout mode, the PMD driver will default to
PACKET_FANOUT_HASH.

Signed-off-by: Tudor Cornea 

---
v2:
* Renamed the patch
* Replaced packet_fanout argument with fanout_mode, which allows
more fine grained control
---
 doc/guides/nics/af_packet.rst |  4 +-
 drivers/net/af_packet/rte_eth_af_packet.c | 92 ---
 2 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst
index a343d3a961..3443f95004 100644
--- a/doc/guides/nics/af_packet.rst
+++ b/doc/guides/nics/af_packet.rst
@@ -25,6 +25,8 @@ Some of these, in turn, will be used to configure the 
PACKET_MMAP settings.
 *   ``qpairs`` - number of Rx and Tx queues (optional, default 1);
 *   ``qdisc_bypass`` - set PACKET_QDISC_BYPASS option in AF_PACKET (optional,
 disabled by default);
+*   ``fanout_mode`` - set fanout algorithm. Possible choices: 
hash,lb,cpu,rollover,rnd,qm (optional,
+default hash);
 *   ``blocksz`` - PACKET_MMAP block size (optional, default 4096);
 *   ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: 
multiple
 of 16B);
@@ -64,7 +66,7 @@ framecnt=512):
 
 .. code-block:: console
 
-
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
+
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash
 
 Features and Limitations
 
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index ceb8d9356a..8449975384 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -36,6 +36,7 @@
 #define ETH_AF_PACKET_FRAMESIZE_ARG"framesz"
 #define ETH_AF_PACKET_FRAMECOUNT_ARG   "framecnt"
 #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass"
+#define ETH_AF_PACKET_FANOUT_MODE_ARG  "fanout_mode"
 
 #define DFLT_FRAME_SIZE(1 << 11)
 #define DFLT_FRAME_COUNT   (1 << 9)
@@ -96,6 +97,7 @@ static const char *valid_arguments[] = {
ETH_AF_PACKET_FRAMESIZE_ARG,
ETH_AF_PACKET_FRAMECOUNT_ARG,
ETH_AF_PACKET_QDISC_BYPASS_ARG,
+   ETH_AF_PACKET_FANOUT_MODE_ARG,
NULL
 };
 
@@ -700,6 +702,61 @@ open_packet_iface(const char *key __rte_unused,
return 0;
 }
 
+#if defined(PACKET_FANOUT)
+#define PACKET_FANOUT_INVALID -1
+
+static int
+get_fanout_group_id(int if_index)
+{
+   return (getpid() ^ if_index) & 0x;
+}
+
+static int
+get_fanout_mode(const char *fanout_mode)
+{
+   int mode = PACKET_FANOUT_FLAG_DEFRAG;
+
+#if defined(PACKET_FANOUT_FLAG_ROLLOVER)
+   mode |= PACKET_FANOUT_FLAG_ROLLOVER;
+#endif
+
+   if (!fanout_mode) {
+   /* Default */
+   mode |= PACKET_FANOUT_HASH;
+   } else if (!strcmp(fanout_mode, "hash")) {
+   mode |= PACKET_FANOUT_HASH;
+   } else if (!strcmp(fanout_mode, "lb")) {
+   mode |= PACKET_FANOUT_LB;
+   } else if (!strcmp(fanout_mode, "cpu")) {
+   mode |= PACKET_FANOUT_CPU;
+   } else if (!strcmp(fanout_mode, "rollover")) {
+   mode |= PACKET_FANOUT_ROLLOVER;
+   } else if (!strcmp(fanout_mode, "rnd")) {
+   mode |= PACKET_FANOUT_RND;
+   } else if (!strcmp(fanout_mode, "qm")) {
+   mode |= PACKET_FANOUT_QM;
+   } else {
+   /* Invalid Fanout Mode */
+   mode = PACKET_FANOUT_INVALID;
+   }
+
+   return mode;
+}
+
+static int
+get_fanout(const char *fanout_mode, int if_index)
+{
+   int group_id = get_fanout_group_id(if_index);
+   int mode = get_fanout_mode(fanout_mode);
+   int fanout = PACKET_FANOUT_INVALID;
+
+   if (mode != PACKET_FANOUT_INVALID)
+   fanout = group_id | (mode << 16);
+
+   return fanout;
+}
+#endif
+
 static int
 rte_pmd_init_internals(struct rte_vdev_device *dev,
const int sockfd,
@@ -709,6 +766,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
unsigned int framesize,
unsigned int framecnt,
   unsigned int qdisc_bypass,
+  const char *fanout_mode,
struct pmd_internals **internals,
struct rte_eth_dev **eth_dev,
struct rte_kvargs *kvlist)
@@ -810,11 +868,12 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
sockaddr.sll_ifindex = (*internals)->if_index;
 
 #if defined(PACKET_FANOUT)
-   fanout_arg = (getpid() ^ (*internals)->if_index) & 0x;
-   fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG) << 16;

Re: [PATCH] net/af_packet: allow disabling packet fanout

2025-01-06 Thread Tudor Cornea
> This is a great idea. Would introducing a new devarg, (e.g
> 'fanout_mode') be the proper way to allow the application to customize
> fanout in more detail ?
>
> --vdev=net_af_packet0,iface=eth1,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,fanout_mode=[fanout_hash|fanout_cpu|fanout_rnd|fanout_qm]

I'll send a new version of the patch which will add some additional
customization


[PATCH] net/af_packet: fix socket close on device stop

2025-02-04 Thread Tudor Cornea
Currently, if we call rte_eth_dev_stop(), the sockets are closed.
If we attempt to start the port again, socket related operations
will not work correctly.

This can be alleviated by closing the socket at the same place in
which we currently free the memory, in eth_dev_close().

If an application calls rte_eth_dev_stop() on a port managed
by the af_packet PMD, the port becomes unusable. This is in contrast
with ports managed by other drivers (e.g virtio).

I also managed to reproduce the issue using testpmd.

sudo ip link add test-veth0 type veth peer name test-veth1

sudo ip link set test-veth0 up
sudo ip link set test-veth1 up

AF_PACKET_ARGS=\
"blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0"

sudo ./dpdk-testpmd \
-l 0-3 \
-m 1024 \
--no-huge \
--no-shconf \
--no-pci \
--vdev=net_af_packet0,iface=test-veth0,${AF_PACKET_ARGS} \
--vdev=net_af_packet1,iface=test-veth1,${AF_PACKET_ARGS} \
-- \
-i

testpmd> start tx_first

Forwarding will start, and we will see traffic on the interfaces.

testpmd> stop
testpmd> port stop 0
Stopping ports...
Checking link statuses...
Done
testpmd> port stop 1
Stopping ports...
Checking link statuses...
Done

testpmd> port start 0
AFPACKET: eth_dev_macaddr_set(): receive socket not found
Port 0: CA:65:81:63:81:B2
Checking link statuses...
Done
testpmd> port start 1
AFPACKET: eth_dev_macaddr_set(): receive socket not found
Port 1: CA:12:D0:BE:93:3F
Checking link statuses...
Done

testpmd> start tx_first

When we start forwarding again, we can see that there is no traffic
on the interfaces. This does not happen when testing with other PMD
drivers (e.g virtio).

With the patch, the port should re-initialize correctly.

testpmd> port start 0
Port 0: CA:65:81:63:81:B2
Checking link statuses...
Done

Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices")

Signed-off-by: Tudor Cornea 
---
 drivers/net/af_packet/rte_eth_af_packet.c | 30 +++
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index ceb8d9356a..efae15a493 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -357,27 +357,12 @@ static int
 eth_dev_stop(struct rte_eth_dev *dev)
 {
unsigned i;
-   int sockfd;
struct pmd_internals *internals = dev->data->dev_private;
 
for (i = 0; i < internals->nb_queues; i++) {
-   sockfd = internals->rx_queue[i].sockfd;
-   if (sockfd != -1)
-   close(sockfd);
-
-   /* Prevent use after free in case tx fd == rx fd */
-   if (sockfd != internals->tx_queue[i].sockfd) {
-   sockfd = internals->tx_queue[i].sockfd;
-   if (sockfd != -1)
-   close(sockfd);
-   }
-
-   internals->rx_queue[i].sockfd = -1;
-   internals->tx_queue[i].sockfd = -1;
dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
}
-
dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
return 0;
 }
@@ -474,6 +459,7 @@ eth_dev_close(struct rte_eth_dev *dev)
struct pmd_internals *internals;
struct tpacket_req *req;
unsigned int q;
+   int sockfd;
 
if (rte_eal_process_type() != RTE_PROC_PRIMARY)
return 0;
@@ -484,6 +470,20 @@ eth_dev_close(struct rte_eth_dev *dev)
internals = dev->data->dev_private;
req = &internals->req;
for (q = 0; q < internals->nb_queues; q++) {
+   sockfd = internals->rx_queue[q].sockfd;
+   if (sockfd != -1)
+   close(sockfd);
+
+   /* Prevent use after free in case tx fd == rx fd */
+   if (sockfd != internals->tx_queue[q].sockfd) {
+   sockfd = internals->tx_queue[q].sockfd;
+   if (sockfd != -1)
+   close(sockfd);
+   }
+
+   internals->rx_queue[q].sockfd = -1;
+   internals->tx_queue[q].sockfd = -1;
+
munmap(internals->rx_queue[q].map,
2 * req->tp_block_size * req->tp_block_nr);
rte_free(internals->rx_queue[q].rd);
-- 
2.34.1



Re: [PATCH] net/af_packet: fix socket close on device stop

2025-02-10 Thread Tudor Cornea
> Applied to next-net
> Should this go to stable as well

I think it would probably make sense to have it in stable.


[PATCH] net/af_packet: allow disabling packet fanout

2024-12-12 Thread Tudor Cornea
This allows us to control whether the PMD will attempt to use
the PACKET_FANOUT socket option, and allows the binary compiled
against newer kernel headers to run on an older kernel, which
lacks support for it.

Signed-off-by: Tudor Cornea 
---
 doc/guides/nics/af_packet.rst |  4 ++-
 drivers/net/af_packet/rte_eth_af_packet.c | 32 ++-
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst
index a343d3a961..42837031de 100644
--- a/doc/guides/nics/af_packet.rst
+++ b/doc/guides/nics/af_packet.rst
@@ -25,6 +25,8 @@ Some of these, in turn, will be used to configure the 
PACKET_MMAP settings.
 *   ``qpairs`` - number of Rx and Tx queues (optional, default 1);
 *   ``qdisc_bypass`` - set PACKET_QDISC_BYPASS option in AF_PACKET (optional,
 disabled by default);
+*   ``packet_fanout`` - set PACKET_FANOUT option in AF_PACKET (optional,
+enabled by default);
 *   ``blocksz`` - PACKET_MMAP block size (optional, default 4096);
 *   ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: 
multiple
 of 16B);
@@ -64,7 +66,7 @@ framecnt=512):
 
 .. code-block:: console
 
-
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
+
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,packet_fanout=0
 
 Features and Limitations
 
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index ceb8d9356a..521dbef317 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -36,6 +36,7 @@
 #define ETH_AF_PACKET_FRAMESIZE_ARG"framesz"
 #define ETH_AF_PACKET_FRAMECOUNT_ARG   "framecnt"
 #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass"
+#define ETH_AF_PACKET_PACKET_FANOUT_ARG"packet_fanout"
 
 #define DFLT_FRAME_SIZE(1 << 11)
 #define DFLT_FRAME_COUNT   (1 << 9)
@@ -96,6 +97,7 @@ static const char *valid_arguments[] = {
ETH_AF_PACKET_FRAMESIZE_ARG,
ETH_AF_PACKET_FRAMECOUNT_ARG,
ETH_AF_PACKET_QDISC_BYPASS_ARG,
+   ETH_AF_PACKET_PACKET_FANOUT_ARG,
NULL
 };
 
@@ -709,6 +711,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
unsigned int framesize,
unsigned int framecnt,
   unsigned int qdisc_bypass,
+  unsigned int packet_fanout,
struct pmd_internals **internals,
struct rte_eth_dev **eth_dev,
struct rte_kvargs *kvlist)
@@ -926,14 +929,17 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
goto error;
}
 
+   if (packet_fanout) {
 #if defined(PACKET_FANOUT)
-   rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
-   &fanout_arg, sizeof(fanout_arg));
-   if (rc == -1) {
-   PMD_LOG_ERRNO(ERR,
-   "%s: could not set PACKET_FANOUT on AF_PACKET 
socket for %s",
-   name, pair->value);
-   goto error;
+   rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT,
+   &fanout_arg, sizeof(fanout_arg));
+   if (rc == -1) {
+   PMD_LOG_ERRNO(ERR,
+   "%s: could not set PACKET_FANOUT "
+   "on AF_PACKET socket for %s",
+   name, pair->value);
+   goto error;
+   }
}
 #endif
}
@@ -1003,6 +1009,7 @@ rte_eth_from_packet(struct rte_vdev_device *dev,
unsigned int framecount = DFLT_FRAME_COUNT;
unsigned int qpairs = 1;
unsigned int qdisc_bypass = 1;
+   unsigned int packet_fanout = 1;
 
/* do some parameter checking */
if (*sockfd < 0)
@@ -1065,6 +1072,16 @@ rte_eth_from_packet(struct rte_vdev_device *dev,
}
continue;
}
+   if (strstr(pair->key, ETH_AF_PACKET_PACKET_FANOUT_ARG) != NULL) 
{
+   packet_fanout = atoi(pair->value);
+   if (packet_fanout > 1) {
+   PMD_LOG(ERR,
+   "%s: invalid packet fanout value",
+   name);
+   return -1;
+   }
+   continue;
+   }
}
 
if (framesize > blocksize) {
@@ -1091,6 +1108,7 @@ rte_eth_from_packet(struct rte

Re: [PATCH] net/af_packet: allow disabling packet fanout

2024-12-16 Thread Tudor Cornea
> Controlling fanout more is a good idea but not sure what this patch
> is trying to do with it.

I am maintaining a DPDK application which should run on a large number
of setups.
Unfortunately, I do not have a lot of control over the environment the
application runs on (e.g kernel).

The problem I was trying to solve is that the Af_Packet PMD seems to
fail to initialize properly on some setups.
I managed to reproduce the issue with testpmd.

sudo ./x86_64-native-linuxapp-gcc/app/dpdk-testpmd \
 -l 0-3 \
 -m 1024 \
 --no-huge \
 --no-shconf \
 --no-pci \
 
--vdev=net_af_packet0,iface=eth1,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
\
 
--vdev=net_af_packet1,iface=eth2,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
\
 -- \
 -i

Error: Unknown device type.
EAL: Detected CPU lcores: 8
EAL: Detected NUMA nodes: 1
EAL: Detected static linkage of DPDK
EAL: Selected IOVA mode 'VA'
AFPACKET: rte_pmd_init_internals(): net_af_packet0: could not set
PACKET_FANOUT on AF_PACKET socket for eth1:Invalid argument
VDEV_BUS: vdev_probe(): failed to initialize net_af_packet0 device
AFPACKET: rte_pmd_init_internals(): net_af_packet1: could not set
PACKET_FANOUT on AF_PACKET socket for eth1:Invalid argument
VDEV_BUS: vdev_probe(): failed to initialize net_af_packet1 device
EAL: Bus (vdev) probe failed.
testpmd: No probed ethernet devices

It seems that the call to PACKET_FANOUT setsockopt failed with EINVAL.
I debugged the issue further. It seems that the root cause is that
when the interface is down (e.g eth1), after the bind operation
succeeds, the PACKET_FANOUT setsockopt will fail.

This is a behavior of AF_Packet in Linux which I'm not sure is that
well documented.
It seems to be fixed in a recent Linux Kernel commit. I re-compiled
the kernel with the change, and did not see the issue afterwards.

Commit 2cee3e6e2e4b74bec96694169f01cd3feec1f264
af_packet: allow fanout_add when socket is not RUNNING

[1] 
https://github.com/torvalds/linux/commit/2cee3e6e2e4b74bec96694169f01cd3feec1f264

I was trying to find a solution for my application to run on setups
which don't have the kernel patch.
Introducing a devarg to control when packet_fanout is enabled seemed
like a possible way around the issue.

As a second solution around the issue, I was also thinking of using
the PACKET_FANOUT setsockopt only when nb_queues > 1.
Conceptually, I think it would make some sense, because fanout doesn't
really help when having a single socket.
We would need more sockets in order to load balance incoming packets.
And this would still allow me to control the behavior from the
existing 'qpairs' devarg.

The idea would be something along these lines

if (nb_queues > 1)  {
   rc = setsockopt(qsockfd, SOL_PACKET, PACKET_FANOUT, &fanout_arg,
sizeof(fanout_arg));
 /* ... */
}

Would a patch using this idea be considered more suitable ?

> - DPDK minimum kernel version is now 4.19 so no point in worrying about
>backward compatibility. According to man page for packet, fanout
>was added in 3.1 kernel.
>
> - It would be useful to allow application to control fanout in more detail.

This is a great idea. Would introducing a new devarg, (e.g
'fanout_mode') be the proper way to allow the application to customize
fanout in more detail ?

--vdev=net_af_packet0,iface=eth1,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,fanout_mode=[fanout_hash|fanout_cpu|fanout_rnd|fanout_qm]


[PATCH v3] net/af_packet: allow changing fanout mode

2025-01-20 Thread Tudor Cornea
This allows us to control the algorithm used to spread traffic between
sockets, adding more fine grained control. If the user does not
specify a fanout mode, the PMD driver will default to
PACKET_FANOUT_HASH.

Signed-off-by: Tudor Cornea 

---
v3:
* Removed PACKET_FANOUT ifdef. The feature has existed since Linux 3.1
* Removed PACKET_FANOUT_FLAG_ROLLOVER ifdef
* Simplified the get_fanout() function
* Renamed mode variable to load_balance.
  I was also thinking of load_balance_mode as an alternative.
* Removed space between get_fanout and following if statement
* Improved documentation. Added link to manual page for PACKET_FANOUT
v2:
* Renamed the patch
* Replaced packet_fanout argument with fanout_mode, which allows
more fine grained control
---
 doc/guides/nics/af_packet.rst | 16 +++-
 drivers/net/af_packet/rte_eth_af_packet.c | 89 ++-
 2 files changed, 84 insertions(+), 21 deletions(-)

diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst
index a343d3a961..2ec9e9e674 100644
--- a/doc/guides/nics/af_packet.rst
+++ b/doc/guides/nics/af_packet.rst
@@ -13,8 +13,6 @@ PACKET_MMAP, which provides a mmapped ring buffer, shared 
between user space
 and kernel, that's used to send and receive packets. This helps reducing system
 calls and the copies needed between user space and Kernel.
 
-The PACKET_FANOUT_HASH behavior of AF_PACKET is used for frame reception.
-
 Options and inherent limitations
 
 
@@ -25,11 +23,23 @@ Some of these, in turn, will be used to configure the 
PACKET_MMAP settings.
 *   ``qpairs`` - number of Rx and Tx queues (optional, default 1);
 *   ``qdisc_bypass`` - set PACKET_QDISC_BYPASS option in AF_PACKET (optional,
 disabled by default);
+*   ``fanout_mode`` - set fanout algorithm.
+Possible choices: hash, lb, cpu, rollover, rnd, qm (optional, default 
hash);
 *   ``blocksz`` - PACKET_MMAP block size (optional, default 4096);
 *   ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: 
multiple
 of 16B);
 *   ``framecnt`` - PACKET_MMAP frame count (optional, default 512).
 
+For details regarding ``fanout_mode`` argument, you can consult the
+`PACKET_FANOUT documentation 
<https://www.man7.org/linux/man-pages/man7/packet.7.html>`_.
+
+As an example, when ``fanout_mode=cpu`` is selected, the PACKET_FANOUT_CPU
+mode will be set on the sockets, so that on frame reception, the socket
+will be selected based on the CPU on which the packet arrived.
+
+Only one ``fanout_mode`` can be chosen. If left unspecified, the default is to
+use the PACKET_FANOUT_HASH behavior of AF_PACKET for frame reception.
+
 Because this implementation is based on PACKET_MMAP, and PACKET_MMAP has its
 own pre-requisites, it should be noted that the inner workings of PACKET_MMAP
 should be carefully considered before modifying some of these options (namely,
@@ -64,7 +74,7 @@ framecnt=512):
 
 .. code-block:: console
 
-
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
+
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash
 
 Features and Limitations
 
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index ceb8d9356a..9ca43fc54e 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -36,6 +36,7 @@
 #define ETH_AF_PACKET_FRAMESIZE_ARG"framesz"
 #define ETH_AF_PACKET_FRAMECOUNT_ARG   "framecnt"
 #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass"
+#define ETH_AF_PACKET_FANOUT_MODE_ARG  "fanout_mode"
 
 #define DFLT_FRAME_SIZE(1 << 11)
 #define DFLT_FRAME_COUNT   (1 << 9)
@@ -96,6 +97,7 @@ static const char *valid_arguments[] = {
ETH_AF_PACKET_FRAMESIZE_ARG,
ETH_AF_PACKET_FRAMECOUNT_ARG,
ETH_AF_PACKET_QDISC_BYPASS_ARG,
+   ETH_AF_PACKET_FANOUT_MODE_ARG,
NULL
 };
 
@@ -700,6 +702,53 @@ open_packet_iface(const char *key __rte_unused,
return 0;
 }
 
+#define PACKET_FANOUT_INVALID -1
+
+static int
+get_fanout_group_id(int if_index)
+{
+   return (getpid() ^ if_index) & 0x;
+}
+
+static int
+get_fanout_mode(const char *fanout_mode)
+{
+   int load_balance = PACKET_FANOUT_FLAG_DEFRAG |
+  PACKET_FANOUT_FLAG_ROLLOVER;
+
+   if (!fanout_mode) {
+   /* Default */
+   load_balance |= PACKET_FANOUT_HASH;
+   } else if (!strcmp(fanout_mode, "hash")) {
+   load_balance |= PACKET_FANOUT_HASH;
+   } else if (!strcmp(fanout_mode, "lb")) {
+   load_balance |= PACKET_FANOUT_LB;
+   } else if (!strcmp(fanout_mode, "cpu")) {
+   load_balance |= PACKET_FANOUT_CPU;
+   } else if (!strcmp(fanout_mode, "rollover&quo