[dpdk-dev] DPDK OVS on Ubuntu 14.04
May need to setup huge pages on kernel boot line (this is example, you may need to adjust): The huge page configuration can be added to the default configuration file /etc/default/grub by adding to the GRUB_CMDLINE_LINUX and the grub configuration file regenerated to get an updated configuration file for Linux boot. # vim /etc/default/grub// edit file . . . GRUB_CMDLINE_LINUX_DEFAULT="... default_hugepagesz=1GB hugepagesz=1GB hugepages=4 hugepagesize=2m hugepages=2048 ..." . . . This example sets up huge pages for both 1 GB pages for 4 GB of 1 GB hugepage memory and 2 MB pages for 4 GB of 2 MB hugepage memory. After boot the number of 1 GB pages cannot be changed, but the number of 2 MB pages can be changed. After editing configuration file /etc/default/grub , the new grub.cfg boot file needs to be regenerated: # update-grub And reboot. After reboot memory managers need to be setup: If /dev/hugepages does not exist:# mkdir /dev/hugepages # mount -t hugetlbfs nodev /dev/hugepages # mkdir /dev/hugepages_2mb # mount -t hugetlbfs nodev /dev/hugepages_2mb -o pagesize=2MB Mike -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Abhijeet Karve Sent: Monday, November 30, 2015 10:14 PM To: dev at dpdk.org Cc: bhavya.addep at gmail.com Subject: [dpdk-dev] DPDK OVS on Ubuntu 14.04 Dear All, We are trying to install DPDK OVS on top of the openstack juno in Ubuntu 14.04 single server. We are referring following steps for executing the same. https://software.intel.com/en-us/blogs/2015/06/09/building-vhost-user-for-ovs-today-using-dpdk-200 During execution we are getting some issues with ovs-vswitchd service as its getting hang during starting. _ nfv-dpdk at nfv-dpdk:~$ tail -f /var/log/openvswitch/ovs-vswitchd.log 2015-11-24T10:54:34.036Z|6|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting... 2015-11-24T10:54:34.036Z|7|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected 2015-11-24T10:54:34.064Z|8|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.4.90 2015-11-24T11:03:42.957Z|2|vlog|INFO|opened log file /var/log/openvswitch/ov s-vswitchd.log 2015-11-24T11:03:42.958Z|3|ovs_numa|INFO|Discovered 24 CPU cores on NUMA nod e 0 2015-11-24T11:03:42.958Z|4|ovs_numa|INFO|Discovered 24 CPU cores on NUMA nod e 1 2015-11-24T11:03:42.958Z|5|ovs_numa|INFO|Discovered 2 NUMA nodes and 48 CPU cores 2015-11-24T11:03:42.958Z|6|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting... 2015-11-24T11:03:42.958Z|7|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected 2015-11-24T11:03:42.961Z|8|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.4.90 _ Also, attaching output(Hugepage.txt) of ? ./ovs-vswitchd --dpdk -c 0x0FF8 -n 4 --socket-mem 1024,0 -- --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/oppenvswitch/ovs-vswitchd.pid? - We tried seting up echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages, but couldn?t succeeded. Can anyone please help us in getting the things if we are missing any and causing ovs-vswitchd to stuck while starting? Also, when we create vm in openstack with DPDK OVS, dpdkvhost-user type interfaces are getting created automatically. If this interfaces are getting mapped with regular br-int bridge rather than DPDK bridge br0 then is this mean that we have successfully enabled DPDK with netdev datapath? We really appreciate for all the advice if you have. Thanks, Abhijeet Thanks & Regards Abhijeet Karve =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
[dpdk-dev] Does anybody know OpenDataPlane
I don't think you have researched this enough. Asking this questions shows that you are just beginning your research or do not understand how this fits into current telco NFV/SDN efforts. Why does this exist: "OpenDataPlane using DPDK for Intel NIC", listed below? Why would competing technologies use the competition technology to solve a problem? Maybe you can change your thesis to "Current Open Source Dataplane Methods": and do a comparison between the two. However if you just look at the sales documentation then you may not understand the real difference. Mike -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Kury Nicolas Sent: Wednesday, December 2, 2015 6:22 AM To: dev at dpdk.org Subject: [dpdk-dev] Does anybody know OpenDataPlane Hi! Does anybody know OpenDataPlane ? http://www.opendataplane.org/ It is a framework designed to enable software portability between networking SoCs, regardless of the underlying instruction set architecture. There are several implementations. * OpenDataPlane using DPDK for Intel NIC * OpenDataPlane using DPAA for Freescale platforms (QorIQ) * OpenDataPlane using MCSDK for Texas Insturments platforms (KeyStone II) * etc. When a developer wants to port his application, he just needs to recompile it with the implementation of OpenDataPlane related to the new platform. I'm doing my Master's Thesis on OpenDataPlane and I have some questions. - Now that OpenDataPlane (ODP) exists, schould every developpers start a new project with ODP or are there some reasons to still use DPDK ? What do you think ? Thank you very much Nicolas
[dpdk-dev] Does anybody know OpenDataPlane
A hint of the fundamental difference: One originated somewhat more from the embedded orientation and one originated somewhat more from the server orientation. Both efforts are driving each towards the other and have overlap. Mike -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Polehn, Mike A Sent: Wednesday, December 2, 2015 8:32 AM To: Kury Nicolas; dev at dpdk.org Subject: Re: [dpdk-dev] Does anybody know OpenDataPlane I don't think you have researched this enough. Asking this questions shows that you are just beginning your research or do not understand how this fits into current telco NFV/SDN efforts. Why does this exist: "OpenDataPlane using DPDK for Intel NIC", listed below? Why would competing technologies use the competition technology to solve a problem? Maybe you can change your thesis to "Current Open Source Dataplane Methods": and do a comparison between the two. However if you just look at the sales documentation then you may not understand the real difference. Mike -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Kury Nicolas Sent: Wednesday, December 2, 2015 6:22 AM To: dev at dpdk.org Subject: [dpdk-dev] Does anybody know OpenDataPlane Hi! Does anybody know OpenDataPlane ? http://www.opendataplane.org/ It is a framework designed to enable software portability between networking SoCs, regardless of the underlying instruction set architecture. There are several implementations. * OpenDataPlane using DPDK for Intel NIC * OpenDataPlane using DPAA for Freescale platforms (QorIQ) * OpenDataPlane using MCSDK for Texas Insturments platforms (KeyStone II) * etc. When a developer wants to port his application, he just needs to recompile it with the implementation of OpenDataPlane related to the new platform. I'm doing my Master's Thesis on OpenDataPlane and I have some questions. - Now that OpenDataPlane (ODP) exists, schould every developpers start a new project with ODP or are there some reasons to still use DPDK ? What do you think ? Thank you very much Nicolas
[dpdk-dev] rte_prefetch0() is effective?
Prefetchs make a big difference because a powerful CPU like IA is always trying to find items to prefetch and the priority of these is not always easy to determine. This is especially a problem across subroutine calls since the compiler cannot determine what is of priority in the other subroutines and the runtime CPU logic cannot always have the future well predicted far enough in the future for all possible paths, especially if you have a cache miss, which takes eons of clock cycles to do a memory access probably resulting in a CPU stall. Until we get to the point of the computers full understanding the logic of the program and writing optimum code (putting programmers out of business) , the understanding of what is important as the program progresses gives the programmer knowledge of what is desirable to prefetch. It is difficult to determine if the CPU is going to have the same priority of the prefetch, so having a prefetch may or may not show up as a measureable performance improvement under some conditions, but having the prefetch decision in place can make prefetch priority decision correct in these other cases, which make a performance improvement. Removing a prefetch without thinking through and fully understanding the logic of why it is there, or what he added cost (in the case of calculating an address for the prefetch that affects other current operations) if any, is just plain amateur work. It is not to say people do not make bad judgments on what needs to be prefetched and put poor prefetch placement and should only be removed if not logically proper for expected runtime operation. Only more primitive CPUs with no prefetch capabilities don't benefit from properly placed prefetches. Mike -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Bruce Richardson Sent: Wednesday, January 13, 2016 3:35 AM To: Moon-Sang Lee Cc: dev at dpdk.org Subject: Re: [dpdk-dev] rte_prefetch0() is effective? On Thu, Dec 24, 2015 at 03:35:14PM +0900, Moon-Sang Lee wrote: > I see codes as below in example directory, and I wonder it is effective. > Coherent IO is adopted to modern architectures, so I think that DMA > initiation by rte_eth_rx_burst() might already fulfills cache lines of > RX buffers. > Do I really need to call rte_prefetchX()? > > nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, > MAX_PKT_BURST); > ... > /* Prefetch and forward already prefetched packets */ > for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { > rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ > j + PREFETCH_OFFSET], void *)); > l3fwd_simple_forward(pkts_burst[j], portid, > qconf); > } > Good question. When the first example apps using this style of prefetch were originally written, yes, there was a noticable performance increase achieved by using the prefetch. Thereafter, I'm not sure that anyone has checked with each generation of platforms whether the prefetches are still necessary and how much they help, but I suspect that they still help a bit, and don't hurt performance. It would be an interesting exercise to check whether the prefetch offsets used in code like above can be adjusted to give better performance on our latest supported platforms. /Bruce
[dpdk-dev] [PATCH] vhost: remove lockless enqueue to the virtio ring
SMP operations can be very expensive, sometimes can impact operations by 100s to 1000s of clock cycles depending on what is the circumstances of the synchronization. It is how you arrange the SMP operations within the tasks at hand across the SMP cores that gives methods for top performance. Using traditional general purpose SMP methods will result in traditional general purpose performance. Migrating to general libraries (understood by most general purpose programmers) from expert abilities (understood by much smaller group of expert programmers focused on performance) will greatly reduce the value of DPDK since the end result will be lower performance and/or have less predictable operation where rate performance, predictability, and low latency are the primary goals. The best method to date, is to have multiple outputs to a single port is to use a DPDK queue with multiple producer, single consumer to do an SMP operation for multiple sources to feed a single non SMP task to output to the port (that is why the ports are not SMP protected). Also when considerable contention from multiple sources occur often (data feeding at same time), having DPDK queue with input and output variables in separate cache lines can have a notable throughput improvement. Mike -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Xie, Huawei Sent: Tuesday, January 19, 2016 8:44 AM To: Tan, Jianfeng; dev at dpdk.org Cc: ann.zhuangyanying at huawei.com Subject: Re: [dpdk-dev] [PATCH] vhost: remove lockless enqueue to the virtio ring On 1/20/2016 12:25 AM, Tan, Jianfeng wrote: > Hi Huawei, > > On 1/4/2016 10:46 PM, Huawei Xie wrote: >> This patch removes the internal lockless enqueue implmentation. >> DPDK doesn't support receiving/transmitting packets from/to the same >> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK >> applications normally have their own lock implmentation when enqueue >> packets to the same queue of a port. >> >> The atomic cmpset is a costly operation. This patch should help >> performance a bit. >> >> Signed-off-by: Huawei Xie >> --- >> lib/librte_vhost/vhost_rxtx.c | 86 >> +-- >> 1 file changed, 25 insertions(+), 61 deletions(-) >> >> diff --git a/lib/librte_vhost/vhost_rxtx.c >> b/lib/librte_vhost/vhost_rxtx.c index bbf3fac..26a1b9c 100644 >> --- a/lib/librte_vhost/vhost_rxtx.c >> +++ b/lib/librte_vhost/vhost_rxtx.c > > I think vhost example will not work well with this patch when > vm2vm=software. > > Test case: > Two virtio ports handled by two pmd threads. Thread 0 polls pkts from > physical NIC and sends to virtio0, while thread0 receives pkts from > virtio1 and routes it to virtio0. vhost port will be wrapped as port, by vhost PMD. DPDK APP treats all physical and virtual ports as ports equally. When two DPDK threads try to enqueue to the same port, the APP needs to consider the contention. All the physical PMDs doesn't support concurrent enqueuing/dequeuing. Vhost PMD should expose the same behavior unless absolutely necessary and we expose the difference of different PMD. > >> - >> *(volatile uint16_t *)&vq->used->idx += entry_success; > > Another unrelated question: We ever try to move this assignment out of > loop to save cost as it's a data contention? This operation itself is not that costly, but it has side effect on the cache transfer. It is outside of the loop for non-mergable case. For mergeable case, it is inside the loop. Actually it has pro and cons whether we do this in burst or in a smaller step. I prefer to move it outside of the loop. Let us address this later. > > Thanks, > Jianfeng > >
[dpdk-dev] rte_mbuf size for jumbo frame
Jumbo frames are generally handled by link lists (but called something else) of mbufs. Enabling jumbo frames for the device driver should enable the right portion of the driver which handles the linked lists. Don't make the mbufs huge. Mike -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Masaru OKI Sent: Monday, January 25, 2016 2:41 PM To: Saurabh Mishra; users at dpdk.org; dev at dpdk.org Subject: Re: [dpdk-dev] rte_mbuf size for jumbo frame Hi, 1. Take care of unit size of mempool for mbuf. 2. Call rte_eth_dev_set_mtu() for each interface. Note that some PMDs does not supported change MTU. On 2016/01/26 6:02, Saurabh Mishra wrote: > Hi, > > We wanted to use 10400 bytes size of each rte_mbuf to enable Jumbo frames. > Do you guys see any problem with that? Would all the drivers like > ixgbe, i40e, vmxnet3, virtio and bnx2x work with larger rte_mbuf size? > > We would want to avoid detailing with chained mbufs. > > /Saurabh
[dpdk-dev] [Dpdk-ovs] problem in binding interfaces of virtio-pci on the VM
In this example, the control network 00:03.0, remains unbound to UIO driver but remains attached to Linux device driver (ssh access with putty) and just the target interfaces are bound. Below, it shows all 3 interfaces bound to the uio driver, which are not usable until a task uses the UIO driver. [root at F21vm l3fwd-vf]# lspci -nn 00:00.0 Host bridge [0600]: Intel Corporation 440FX - 82441FX PMC [Natoma] [8086:1237] (rev 02) 00:01.0 ISA bridge [0601]: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] [8086:7000] 00:01.1 IDE interface [0101]: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] [8086:7010] 00:01.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 03) 00:02.0 VGA compatible controller [0300]: Cirrus Logic GD 5446 [1013:00b8] 00:03.0 Ethernet controller [0200]: Red Hat, Inc Virtio network device [1af4:1000] 00:04.0 Ethernet controller [0200]: Intel Corporation XL710/X710 Virtual Function [8086:154c] (rev 01) 00:05.0 Ethernet controller [0200]: Intel Corporation XL710/X710 Virtual Function [8086:154c] (rev 01) [root at F21vm l3fwd-vf]# /usr/src/dpdk/tools/dpdk_nic_bind.py --bind=igb_uio 00:04.0 [root at F21vm l3fwd-vf]# /usr/src/dpdk/tools/dpdk_nic_bind.py --bind=igb_uio 00:05.0 [root at F21vm l3fwd-vf]# /usr/src/dpdk/tools/dpdk_nic_bind.py --status Network devices using DPDK-compatible driver :00:04.0 'XL710/X710 Virtual Function' drv=igb_uio unused=i40evf :00:05.0 'XL710/X710 Virtual Function' drv=igb_uio unused=i40evf Network devices using kernel driver === :00:03.0 'Virtio network device' if= drv=virtio-pci unused=virtio_pci,igb_uio Other network devices = -Original Message- From: Dpdk-ovs [mailto:dpdk-ovs-boun...@lists.01.org] On Behalf Of Srinivasreddy R Sent: Thursday, February 26, 2015 6:11 AM To: dev at dpdk.org; dpdk-ovs at lists.01.org Subject: [Dpdk-ovs] problem in binding interfaces of virtio-pci on the VM hi , I have written sample program for usvhost supported by ovdk. i have initialized VM using the below command . On the VM : I am able to see two interfaces . and working fine with traffic in rawsocket mode . my problem is when i bind the interfaces to pmd driver[ ibg_uio ] my virtual machine is getting hanged . and i am not able to access it further . now my question is . what may be the reason for the behavior . and how can in debug the root cause . please help in finding out the problem . ./tools/dpdk_nic_bind.py --status Network devices using DPDK-compatible driver Network devices using kernel driver === :00:03.0 '82540EM Gigabit Ethernet Controller' if=ens3 drv=e1000 unused=igb_uio *Active* :00:04.0 'Virtio network device' if= drv=virtio-pci unused=igb_uio :00:05.0 'Virtio network device' if= drv=virtio-pci unused=igb_uio Other network devices = ./dpdk_nic_bind.py --bind=igb_uio 00:04.0 00:05.0 ./x86_64-softmmu/qemu-system-x86_64 -cpu host -boot c -hda /home/utils/images/vm1.img -m 2048M -smp 3 --enable-kvm -name 'VM1' -nographic -vnc :1 -pidfile /tmp/vm1.pid -drive file=fat:rw:/tmp/qemu_share,snapshot=off -monitor unix:/tmp/vm1monitor,server,nowait -net none -no-reboot -mem-path /dev/hugepages -mem-prealloc -netdev type=tap,id=net1,script=no,downscript=no,ifname=usvhost1,vhost=on -device virtio-net-pci,netdev=net1,mac=00:16:3e:00:03:03,csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off -netdev type=tap,id=net2,script=no,downscript=no,ifname=usvhost2,vhost=on -device virtio-net-pci,netdev=net2,mac=00:16:3e:00:03:04,csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off -- thanks srinivas. ___ Dpdk-ovs mailing list Dpdk-ovs at lists.01.org https://lists.01.org/mailman/listinfo/dpdk-ovs
[dpdk-dev] [Patch 1/2] i40e RX Bulk Alloc: Larger list size (33 to 128) throughput optimization
Combined 2 subroutines of code into one subroutine with one read operation followed by buffer allocate and load loop. Eliminated the staging queue and subroutine, which removed extra pointer list movements and reduced number of active variable cache pages during for call. Reduced queue position variables to just 2, the next read point and last NIC RX descriptor position, also changed to allowing NIC descriptor table to not always need to be filled. Removed NIC register update write from per loop to one per driver write call to minimize CPU stalls waiting of multiple SMB synchronization points and for earlier NIC register writes that often take large cycle counts to complete. For example with an input packet list of 33, with the default loops size of 32, the second NIC register write will occur just after RX processing for just 1 packet, resulting in large CPU stall time. Eliminated initial rx packet present test before rx processing loop that also checks, since less free time is generally available when packets are present than when not processing any input packets. Used some standard variables to help reduce overhead of non-standard variable sizes. Reduced number of variables, reordered variable structure to put most active variables in first cache line, better utilize memory bytes inside cache line, and reduced active cache line count to 1 cache line during processing call. Other RX subroutine sets might still use more than 1 variable cache line. Signed-off-by: Mike A. Polehn diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index fd656d5..ea63f2f 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -63,6 +63,7 @@ #define DEFAULT_TX_RS_THRESH 32 #define DEFAULT_TX_FREE_THRESH 32 #define I40E_MAX_PKT_TYPE 256 +#define I40E_RX_INPUT_BUF_MAX 256 #define I40E_TX_MAX_BURST 32 @@ -959,115 +960,97 @@ check_rx_burst_bulk_alloc_preconditions(__rte_unused struct i40e_rx_queue *rxq) } #ifdef RTE_LIBRTE_I40E_RX_ALLOW_BULK_ALLOC -#define I40E_LOOK_AHEAD 8 -#if (I40E_LOOK_AHEAD != 8) -#error "PMD I40E: I40E_LOOK_AHEAD must be 8\n" -#endif -static inline int -i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq) + +static inline unsigned +i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts, + unsigned nb_pkts) { volatile union i40e_rx_desc *rxdp; struct i40e_rx_entry *rxep; - struct rte_mbuf *mb; - uint16_t pkt_len; - uint64_t qword1; - uint32_t rx_status; - int32_t s[I40E_LOOK_AHEAD], nb_dd; - int32_t i, j, nb_rx = 0; - uint64_t pkt_flags; + unsigned i, n, tail; - rxdp = &rxq->rx_ring[rxq->rx_tail]; - rxep = &rxq->sw_ring[rxq->rx_tail]; - - qword1 = rte_le_to_cpu_64(rxdp->wb.qword1.status_error_len); - rx_status = (qword1 & I40E_RXD_QW1_STATUS_MASK) >> - I40E_RXD_QW1_STATUS_SHIFT; + /* Wrap tail */ + if (rxq->rx_tail >= rxq->nb_rx_desc) + tail = 0; + else + tail = rxq->rx_tail; + + /* Stop at end of Q, for end, next read alligned at Q start */ + n = rxq->nb_rx_desc - tail; + if (n < nb_pkts) + nb_pkts = n; + + rxdp = &rxq->rx_ring[tail]; + rte_prefetch0(rxdp); + rxep = &rxq->sw_ring[tail]; + rte_prefetch0(rxep); + + i = 0; + while (nb_pkts > 0) { + /* Prefetch NIC descriptors and packet list */ + if (likely(nb_pkts > 4)) { + rte_prefetch0(&rxdp[4]); + if (likely(nb_pkts > 8)) { + rte_prefetch0(&rxdp[8]); + rte_prefetch0(&rxep[8]); + } + } - /* Make sure there is at least 1 packet to receive */ - if (!(rx_status & (1 << I40E_RX_DESC_STATUS_DD_SHIFT))) - return 0; + for (n = 0; (nb_pkts > 0)&&(n < 8); n++, nb_pkts--, i++) { + uint64_t qword1; + uint64_t pkt_flags; + uint16_t pkt_len; + struct rte_mbuf *mb = rxep->mbuf; + rxep++; - /** -* Scan LOOK_AHEAD descriptors at a time to determine which -* descriptors reference packets that are ready to be received. -*/ - for (i = 0; i < RTE_PMD_I40E_RX_MAX_BURST; i+=I40E_LOOK_AHEAD, - rxdp += I40E_LOOK_AHEAD, rxep += I40E_LOOK_AHEAD) { - /* Read desc statuses backwards to avoid race condition */ - for (j = I40E_LOOK_AHEAD - 1; j >= 0; j--) { + /* Translate descriptor info to mbuf parameters */ qword1 = rte_le_to_cpu_64(\ - rxdp[j].wb.qword1.status_error_len); - s[j] = (qword1 & I40E_RXD_QW1_STATUS_MASK) >> -
[dpdk-dev] [Patch] Eth Driver: Optimization for improved NIC processing rates
Prefetch of interface access variables while calling into driver RX and TX subroutines. For converging zero loss packet task tests, a small drop in latency for zero loss measurements and small drop in lost packet counts for the lossy measurement points was observed, indicating some savings of execution clock cycles. Signed-off-by: Mike A. Polehn diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h index 8a8c82b..09f1069 100644 --- a/lib/librte_ether/rte_ethdev.h +++ b/lib/librte_ether/rte_ethdev.h @@ -2357,11 +2357,15 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, const uint16_t nb_pkts) { struct rte_eth_dev *dev; + void *rxq; dev = &rte_eth_devices[port_id]; - int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id], - rx_pkts, nb_pkts); + /* rxq is going to be immediately used, prefetch it */ + rxq = dev->data->rx_queues[queue_id]; + rte_prefetch0(rxq); + + int16_t nb_rx = (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts); #ifdef RTE_ETHDEV_RXTX_CALLBACKS struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id]; @@ -2499,6 +2503,7 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **tx_pkts, uint16_t nb_pkts) { struct rte_eth_dev *dev; + void *txq; dev = &rte_eth_devices[port_id]; @@ -2514,7 +2519,11 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id, } #endif - return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts); + /* txq is going to be immediately used, prefetch it */ + txq = dev->data->tx_queues[queue_id]; + rte_prefetch0(txq); + + return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts); } #endif
[dpdk-dev] [Patch 2/2] i40e rx Bulk Alloc: Larger list size (33 to 128) throughput optimization
Added check of minimum of 2 packet allocation count to eliminate the extra overhead for supporting prefetch for the case of checking for only one packet allocated into the queue at a time. Used some standard variables to help reduce overhead of non-standard variable sizes. Added second level prefetch to get packet address in cache 0 earlier and eliminated calculation inside loop to determine end of prefetch loop. Used old time instruction C optimization methods of, using pointers instead of arrays, and reducing scope of some variables to improve chances of using register variables instead of a stack variables. Signed-off-by: Mike A. Polehn diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index ec62f75..2032e06 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -64,6 +64,7 @@ #define DEFAULT_TX_FREE_THRESH 32 #define I40E_MAX_PKT_TYPE 256 #define I40E_RX_INPUT_BUF_MAX 256 +#define I40E_RX_FREE_THRESH_MIN 2 #define I40E_TX_MAX_BURST 32 @@ -942,6 +943,12 @@ check_rx_burst_bulk_alloc_preconditions(__rte_unused struct i40e_rx_queue *rxq) "rxq->rx_free_thresh=%d", rxq->nb_rx_desc, rxq->rx_free_thresh); ret = -EINVAL; + } else if (rxq->rx_free_thresh < I40E_RX_FREE_THRESH_MIN) { + PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions: " + "rxq->rx_free_thresh=%d, " + "I40E_RX_FREE_THRESH_MIN=%d", + rxq->rx_free_thresh, I40E_RX_FREE_THRESH_MIN); + ret = -EINVAL; } else if (!(rxq->nb_rx_desc < (I40E_MAX_RING_DESC - RTE_PMD_I40E_RX_MAX_BURST))) { PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions: " @@ -1058,9 +1065,8 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq) { volatile union i40e_rx_desc *rxdp; struct i40e_rx_entry *rxep; - struct rte_mbuf *mb; - unsigned alloc_idx, i; - uint64_t dma_addr; + struct rte_mbuf *pk, *npk; + unsigned alloc_idx, i, l; int diag; /* Allocate buffers in bulk */ @@ -1076,22 +1082,36 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq) return -ENOMEM; } + pk = rxep->mbuf; + rte_prefetch0(pk); + rxep++; + npk = rxep->mbuf; + rte_prefetch0(npk); + rxep++; + l = rxq->rx_free_thresh - 2; + rxdp = &rxq->rx_ring[alloc_idx]; for (i = 0; i < rxq->rx_free_thresh; i++) { - if (likely(i < (rxq->rx_free_thresh - 1))) + struct rte_mbuf *mb = pk; + pk = npk; + if (likely(i < l)) { /* Prefetch next mbuf */ - rte_prefetch0(rxep[i + 1].mbuf); - - mb = rxep[i].mbuf; - rte_mbuf_refcnt_set(mb, 1); - mb->next = NULL; + npk = rxep->mbuf; + rte_prefetch0(npk); + rxep++; + } mb->data_off = RTE_PKTMBUF_HEADROOM; + rte_mbuf_refcnt_set(mb, 1); mb->nb_segs = 1; mb->port = rxq->port_id; - dma_addr = rte_cpu_to_le_64(\ - RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb)); - rxdp[i].read.hdr_addr = 0; - rxdp[i].read.pkt_addr = dma_addr; + mb->next = NULL; + { + uint64_t dma_addr = rte_cpu_to_le_64( + RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb)); + rxdp->read.hdr_addr = dma_addr; + rxdp->read.pkt_addr = dma_addr; + } + rxdp++; } rxq->rx_last_pos = alloc_idx + rxq->rx_free_thresh - 1;
[dpdk-dev] [Patch 1/2] i40e simple tx: Larger list size (33 to 128) throughput optimization
Reduce the 32 packet list size focus for better packet list size range handling. Changed maximum new buffer loop process size to the NIC queue free buffer count per loop. Removed redundant single call check to just one call with focused loop. Remove NIC register update write from per loop to one per write driver call to minimize CPU stalls waiting for multiple SMP synchronization points and for earlier NIC register writes that often take large cycle counts to complete. For example with an output list size of 64, the default loops size of 32, when 33 packets are queued on descriptor table, the second NIC register write will occur just after TX processing for 1 packet, resulting in a large CPU stall time. Used some standard variables to help reduce overhead of non-standard variable sizes. Reordered variable structure to put most active variables in first cache line, better utilize memory bytes inside cache line, and reduced active cache line count during call. Signed-off-by: Mike A. Polehn diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index ec62f75..2032e06 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -64,6 +64,7 @@ #define DEFAULT_TX_FREE_THRESH 32 #define I40E_MAX_PKT_TYPE 256 #define I40E_RX_INPUT_BUF_MAX 256 +#define I40E_RX_FREE_THRESH_MIN 2 #define I40E_TX_MAX_BURST 32 @@ -942,6 +943,12 @@ check_rx_burst_bulk_alloc_preconditions(__rte_unused struct i40e_rx_queue *rxq) "rxq->rx_free_thresh=%d", rxq->nb_rx_desc, rxq->rx_free_thresh); ret = -EINVAL; + } else if (rxq->rx_free_thresh < I40E_RX_FREE_THRESH_MIN) { + PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions: " + "rxq->rx_free_thresh=%d, " + "I40E_RX_FREE_THRESH_MIN=%d", + rxq->rx_free_thresh, I40E_RX_FREE_THRESH_MIN); + ret = -EINVAL; } else if (!(rxq->nb_rx_desc < (I40E_MAX_RING_DESC - RTE_PMD_I40E_RX_MAX_BURST))) { PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions: " @@ -1058,9 +1065,8 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq) { volatile union i40e_rx_desc *rxdp; struct i40e_rx_entry *rxep; - struct rte_mbuf *mb; - unsigned alloc_idx, i; - uint64_t dma_addr; + struct rte_mbuf *pk, *npk; + unsigned alloc_idx, i, l; int diag; /* Allocate buffers in bulk */ @@ -1076,22 +1082,36 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq) return -ENOMEM; } + pk = rxep->mbuf; + rte_prefetch0(pk); + rxep++; + npk = rxep->mbuf; + rte_prefetch0(npk); + rxep++; + l = rxq->rx_free_thresh - 2; + rxdp = &rxq->rx_ring[alloc_idx]; for (i = 0; i < rxq->rx_free_thresh; i++) { - if (likely(i < (rxq->rx_free_thresh - 1))) + struct rte_mbuf *mb = pk; + pk = npk; + if (likely(i < l)) { /* Prefetch next mbuf */ - rte_prefetch0(rxep[i + 1].mbuf); - - mb = rxep[i].mbuf; - rte_mbuf_refcnt_set(mb, 1); - mb->next = NULL; + npk = rxep->mbuf; + rte_prefetch0(npk); + rxep++; + } mb->data_off = RTE_PKTMBUF_HEADROOM; + rte_mbuf_refcnt_set(mb, 1); mb->nb_segs = 1; mb->port = rxq->port_id; - dma_addr = rte_cpu_to_le_64(\ - RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb)); - rxdp[i].read.hdr_addr = 0; - rxdp[i].read.pkt_addr = dma_addr; + mb->next = NULL; + { + uint64_t dma_addr = rte_cpu_to_le_64( + RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb)); + rxdp->read.hdr_addr = dma_addr; + rxdp->read.pkt_addr = dma_addr; + } + rxdp++; } rxq->rx_last_pos = alloc_idx + rxq->rx_free_thresh - 1;
[dpdk-dev] [Patch 2/2] i40e simple tx: Larger list size (33 to 128) throughput optimization
Added packet memory prefetch for faster access to variables inside packet buffer needed for the free operation. Signed-off-by: Mike A. Polehn diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 177fb2e..d9bc30a 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -1748,7 +1748,8 @@ static inline int __attribute__((always_inline)) i40e_tx_free_bufs(struct i40e_tx_queue *txq) { struct i40e_tx_entry *txep; - uint16_t i; + unsigned i, l, tx_rs_thresh; + struct rte_mbuf *pk, *pk_next; if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz & rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) != @@ -1757,18 +1758,46 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq) txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]); - for (i = 0; i < txq->tx_rs_thresh; i++) - rte_prefetch0((txep + i)->mbuf); + /* Prefetch first 2 packets */ + pk = txep->mbuf; + rte_prefetch0(pk); + txep->mbuf = NULL; + txep++; + tx_rs_thresh = txq->tx_rs_thresh; + if (likely(txq->tx_rs_thresh > 1)) { + pk_next = txep->mbuf; + rte_prefetch0(pk_next); + txep->mbuf = NULL; + txep++; + l = tx_rs_thresh - 2; + } else { + pk_next = pk; + l = tx_rs_thresh - 1; + } if (!(txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT)) { - for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) { - rte_mempool_put(txep->mbuf->pool, txep->mbuf); - txep->mbuf = NULL; + for (i = 0; i < tx_rs_thresh; ++i) { + struct rte_mbuf *mbuf = pk; + pk = pk_next; + if (likely(i < l)) { + pk_next = txep->mbuf; + rte_prefetch0(pk_next); + txep->mbuf = NULL; + txep++; + } + rte_mempool_put(mbuf->pool, mbuf); } } else { - for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) { - rte_pktmbuf_free_seg(txep->mbuf); - txep->mbuf = NULL; + for (i = 0; i < tx_rs_thresh; ++i) { + struct rte_mbuf *mbuf = pk; + pk = pk_next; + if (likely(i < l)) { + pk_next = txep->mbuf; + rte_prefetch0(pk_next); + txep->mbuf = NULL; + txep++; + } + rte_pktmbuf_free_seg(mbuf); } }
[dpdk-dev] [Patch] Eth Driver: Optimization for improved NIC processing rates
Hi Bruce! Thank you for reviewing, sorry didn't write clearly as possible. I was trying to say more than "The performance improved". I didn't call out RFC 2544 since many people may not know much about it. I was also trying to convey what was observed and the conclusion derived from the observation without getting too big. When the NIC processing loop rate is around 400,000/sec the entry and exit savings are not easily observable when the average data rate variation from test to test is higher than the packet rate gain. If RFC 2544 zero loss convergence is set too fine, the time it takes to make a complete test increases substantially (I set my convergence about 0.25% of line rate) at 60 seconds per measurement point. Unless the current convergence data rate is close to zero loss for the next point, a small improvement is not going to show up as higher zero loss rate. However the test has a series of measurements, which has average latency and packet loss. Also since the test equipment uses a predefined sequence algorithm that cause the same data rate to to a high degree of accuracy be generated for each test, the results for same data rates can be compared across tests. If someone repeats the tests, I am pointing to the particular data to look at. One 60 second measurement itself does not give sufficient accuracy to make a conclusion, but information correlated across multiple measurements gives basis for a correct conclusion. For l3fwd, to be stable with i40e requires the queues to be increased (I use 2k) and the Packet count to also be increased. This then gets 100% zero loss line rate with 64 byte Packets for 2 10 GbE connections (given the correct Fortville firmware). This makes it good to verify the correct NIC firmware but does not work well for testing since the data is network limited. I have my own stable packet processing code which I used for testing. I have multiple programs, but during the optimization cycle, hit line rate and had to move to a 5 tuple processing program for a higher load to proceed. I have a doc that covers this setup and the optimization results, but cannot be shared. Someone making their on measurements needs to have made sufficient tests to understand the stability of their test environment. Mike -Original Message- From: Richardson, Bruce Sent: Wednesday, October 28, 2015 3:45 AM To: Polehn, Mike A Cc: dev at dpdk.org Subject: Re: [dpdk-dev] [Patch] Eth Driver: Optimization for improved NIC processing rates On Tue, Oct 27, 2015 at 08:56:31PM +, Polehn, Mike A wrote: > Prefetch of interface access variables while calling into driver RX and TX > subroutines. > > For converging zero loss packet task tests, a small drop in latency > for zero loss measurements and small drop in lost packet counts for > the lossy measurement points was observed, indicating some savings of > execution clock cycles. > Hi Mike, the commit log message above seems a bit awkward to read. If I understand it correctly, would the below suggestion be a shorter, clearer equivalent? Prefetch RX and TX queue variables in ethdev before driver function call This has been measured to produce higher throughput and reduced latency in RFC 2544 throughput tests. Or perhaps you could suggest yourself some similar wording. It would also be good to clarify with what applications the improvements were seen - was it using testpmd or l3fwd or something else? Regards, /Bruce
[dpdk-dev] [Patch v2] Eth driver optimization: Prefetch variable structure
Adds Eth driver prefetch variable structure to CPU cache 0 while calling into tx or rx device driver operation. RFC 2544 test of NIC task test measurement points show improvement of lower latency and/or better packet throughput indicating clock cycles saved. Signed-off-by: Mike A. Polehn diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h index 8a8c82b..09f1069 100644 --- a/lib/librte_ether/rte_ethdev.h +++ b/lib/librte_ether/rte_ethdev.h @@ -2357,11 +2357,15 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, const uint16_t nb_pkts) { struct rte_eth_dev *dev; + void *rxq; dev = &rte_eth_devices[port_id]; - int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id], - rx_pkts, nb_pkts); + /* rxq is going to be immediately used, prefetch it */ + rxq = dev->data->rx_queues[queue_id]; + rte_prefetch0(rxq); + + int16_t nb_rx = (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts); #ifdef RTE_ETHDEV_RXTX_CALLBACKS struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id]; @@ -2499,6 +2503,7 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **tx_pkts, uint16_t nb_pkts) { struct rte_eth_dev *dev; + void *txq; dev = &rte_eth_devices[port_id]; @@ -2514,7 +2519,11 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id, } #endif - return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts); + /* txq is going to be immediately used, prefetch it */ + txq = dev->data->tx_queues[queue_id]; + rte_prefetch0(txq); + + return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts); } #endif
[dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg
I used the following code snip-it with the i40e device, with 1 second sample time had very high accuracy for IPv4 UDP packets: #define FLOWD_PERF_PACKET_OVERHEAD 24 /* CRC + Preamble + SOF + Interpacket gap */ #define FLOWD_REF_NETWORK_SPEED 10e9 double Ave_Bytes_per_Packet, Data_Rate, Net_Rate; uint64_t Bits; uint64_t Bytes = pFlow->flow.n_bytes - pMatch_Prev->flow.n_bytes; uint64_t Packets = pFlow->flow.n_packets - pMatch_Prev->flow.n_packets; uint64_t Time_us = pFlow->flow.flow_time_us - pMatch_Prev->flow.flow_time_us; if (Bytes == 0) Ave_Bytes_per_Packet = 0.0; else Ave_Bytes_per_Packet = ((double)Bytes / (double)Packets) + 4.0; Bits = (Bytes + (Packets*FLOWD_PERF_PACKET_OVERHEAD)) * 8; if (Bits == 0) Data_Rate = 0.0; else Data_Rate = (((double)Bits) / Time_us) * 1e6; if (Data_Rate == 0.0) Net_Rate = 0.0; else Net_Rate = Data_Rate / FLOWD_REF_NETWORK_SPEED; For packet rate: double pk_rate = (((double)Packets)/ ((double)Time_us)) * 1e6; To calculate elapsed time in DPDK app, used CPU counter (will not work if counter is being modified): Initialization: double flow_time_scale_us; ... flow_time_scale_us = 1e6/rte_get_tsc_hz(); Elapsed time (uSec) example: elapse_us = (rte_rdtsc() - entry->tsc_first_packet) * flow_time_scale_us; /* calc total elapsed us */ -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Wiles, Keith Sent: Tuesday, November 3, 2015 6:33 AM To: Van Haaren, Harry; ???; dev at dpdk.org Subject: Re: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg On 11/3/15, 8:30 AM, "Van Haaren, Harry" wrote: >Hi Keith, > >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wiles, Keith > >> Hmm, I just noticed I did not include the FCS bytes. Does the NIC >> include FCS bytes in the counters? Need to verify that one and if not then >> it becomes a bit more complex. > >The Intel NICs count packet sizes inclusive of CRC / FCS, from eg the >ixgbe/82599 datasheet: >"This register includes bytes received in a packet from the Address> field through the field, inclusively." Thanks I assumed I had known that at the time :-) > >-Harry > Regards, Keith
[dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg
Accessing registers on the NIC has very high access latency and will often stall the CPU in waiting for response especially with multiple register reads and high throughput packet data also being transferred. The size value was derived from the NIC writing a value to the descriptor table which as then written to the packet buffer. The bitrate calculation included the FCS/CRC has packet overhead and the packet size was 4 bytes short. The inclusion or exclusion of the FCS on receive might be a programmable option. For tx, it might be a flag set in the TX descriptor table either use FCS in packet buffer or calculate it on the fly. Where you get the numbers and initialization may affect the calculation. A very important rating for CPU is it's FLOPs performance. Most all modern CPUs do single cycle floating point multiplies (divides are done with shifts and adds and are clock per set bit in float mantissa or in int). Conversion to and from floating point are often done in parallel with other operations, which makes using integer math not always faster. Often additional checks for edge conditions and adjustments needed with integer processing loses the gain but all depends on exact algorithm and end scaling. Being able to do high quality integer processing is a good skill, especially when doing work like signal processing. -Original Message- From: Wiles, Keith Sent: Tuesday, November 3, 2015 11:01 AM To: Polehn, Mike A; Van Haaren, Harry; ???; dev at dpdk.org Subject: Re: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg On 11/3/15, 9:59 AM, "Polehn, Mike A" wrote: >I used the following code snip-it with the i40e device, with 1 second sample >time had very high accuracy for IPv4 UDP packets: > >#define FLOWD_PERF_PACKET_OVERHEAD 24 /* CRC + Preamble + SOF + Interpacket >gap */ >#define FLOWD_REF_NETWORK_SPEED 10e9 > >double Ave_Bytes_per_Packet, Data_Rate, Net_Rate; uint64_t Bits; >uint64_t Bytes = pFlow->flow.n_bytes - pMatch_Prev->flow.n_bytes; >uint64_t Packets = pFlow->flow.n_packets - pMatch_Prev->flow.n_packets; >uint64_t Time_us = pFlow->flow.flow_time_us - >pMatch_Prev->flow.flow_time_us; > >if (Bytes == 0) > Ave_Bytes_per_Packet = 0.0; >else > Ave_Bytes_per_Packet = ((double)Bytes / (double)Packets) + 4.0; > >Bits = (Bytes + (Packets*FLOWD_PERF_PACKET_OVERHEAD)) * 8; if (Bits == >0) > Data_Rate = 0.0; >else > Data_Rate = (((double)Bits) / Time_us) * 1e6; > >if (Data_Rate == 0.0) > Net_Rate = 0.0; >else > Net_Rate = Data_Rate / FLOWD_REF_NETWORK_SPEED; > >For packet rate: double pk_rate = (((double)Packets)/ >((double)Time_us)) * 1e6; > >To calculate elapsed time in DPDK app, used CPU counter (will not work if >counter is being modified): > >Initialization: >double flow_time_scale_us; >... >flow_time_scale_us = 1e6/rte_get_tsc_hz(); > >Elapsed time (uSec) example: > >elapse_us = (rte_rdtsc() - entry->tsc_first_packet) * > flow_time_scale_us; /* calc total elapsed us */ Looks reasonable I assume the n_bytes does not include FCS as is not the case with the NIC counters. Also I decided to avoid using double?s in my code and just used 64bit registers and integer math :-) > >-Original Message- >From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wiles, Keith >Sent: Tuesday, November 3, 2015 6:33 AM >To: Van Haaren, Harry; ???; dev at dpdk.org >Subject: Re: [dpdk-dev] How can I calculate/estimate pps(packet per >seocond) and bps(bit per second) in DPDK pktg > >On 11/3/15, 8:30 AM, "Van Haaren, Harry" wrote: > >>Hi Keith, >> >>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wiles, Keith >> >>> Hmm, I just noticed I did not include the FCS bytes. Does the NIC >>> include FCS bytes in the counters? Need to verify that one and if not then >>> it becomes a bit more complex. >> >>The Intel NICs count packet sizes inclusive of CRC / FCS, from eg the >>ixgbe/82599 datasheet: >>"This register includes bytes received in a packet from the >Address> field through the field, inclusively." > >Thanks I assumed I had known that at the time :-) >> >>-Harry >> > > >Regards, >Keith > > > > > Regards, Keith
[dpdk-dev] FW: [Patch v2] Eth driver optimization: Prefetch variable structure
My email address is my official email address and it can only be used with the official email system, or in other words the corporate MS windows email system. Can I use an oddball junk email address, such as a gmail account or non-returnable IP named Sendmail server (has no registered DNS name), to submit patches into dpdk.org user account, with patches signed with my official email address (which is different than the sending email address which is just junk name)? Mike -Original Message- From: Polehn, Mike A Sent: Tuesday, November 3, 2015 12:17 PM To: St Leger, Jim Subject: RE: [dpdk-dev] [Patch v2] Eth driver optimization: Prefetch variable structure I don't understand why a development system must also support a user email system and then also a full email server needed to deliver it. I have half a dozen servers I have various projects on... Seems like same server used to move git updates could also be made to move patch email for the project. -Original Message- From: St Leger, Jim Sent: Tuesday, November 3, 2015 7:36 AM To: Polehn, Mike A Subject: RE: [dpdk-dev] [Patch v2] Eth driver optimization: Prefetch variable structure Mike: If you need any help/guidance of navigating the DPDK.org forums and community reach out to some of the crew. Our Shannon and Shanghai teams have it down to a science, okay, an artful science anyway. And there are some in the States such as Keith Wiles (and Jeff Shaw up your way) who could also give some BKMs. Jim -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Thomas Monjalon Sent: Tuesday, November 3, 2015 8:03 AM To: Polehn, Mike A Cc: dev at dpdk.org Subject: Re: [dpdk-dev] [Patch v2] Eth driver optimization: Prefetch variable structure Hi, Please use git-send-email and check how titles are formatted in the git tree. Thanks
[dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg
The change in tsc value from rte_rdtsc() needs to be multiplied by the scale to convert from clocks to get change in seconds. For example from below: elapse_us = (rte_rdtsc() - entry->tsc_first_packet) * flow_time_scale_us; The bit rate requires the number of bytes passed in the time period then adjusted by the overhead of the number of packets transferred in the time period. #define FLOWD_PERF_PACKET_OVERHEAD 24 /* CRC + Preamble + SOF + Interpacket gap */ Bits = (Bytes + (Packets*FLOWD_PERF_PACKET_OVERHEAD)) * 8; Data_Rate = (((double)Bits) / Time_us) * 1e6; Integer math is very tricky and often is not any faster than floating point math when using multiplies except on the very low performance processors. Mike From: ??? [mailto:pnk...@naver.com] Sent: Tuesday, November 3, 2015 5:45 PM To: Polehn, Mike A; Wiles, Keith; Van Haaren, Harry; dev at dpdk.org Subject: RE: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg Dear Wiles, Keith , Van Haaren, Harry, Polehn, Mike A, Stephen Hemminger, Kyle Larose, and DPDK experts. I really appreciate for your precious answers and advices. I will find and study the corresponding codes and CRC checking. Last night, I tried to estimate bps and pps by using the following code. // rte_distributor_process() gets 64 mbufs packets at a time. // rte_distributor_process() gets packets from Intel? 82599ES 10 Gigabit Ethernet 2 port Controller (2 10gbE ports). int rte_distributor_process(struct rte_distributor *d, struct rte_mbuf **mbufs, unsigned num_mbufs) { uint64_t ticks_per_ms = rte_get_tsc_hz()/1000 ; uint64_t ticks_per_s = rte_get_tsc_hz() ; uint64_t ticks_per_s_div_8 = rte_get_tsc_hz()/8 ; uint64_t cur_tsc = 0, last_tsc = 0, sum_len, bps, pps ; cur_tsc = rte_rdtsc(); sum_len = 0 ; for (l=0; l < num_mbufs; l++ ) { sum_len += mbufs[l]->pkt_len ; } if ((cur_tsc - last_tsc)!=0) { bps = (sum_len * ticks_per_s_div_8 ) / (cur_tsc - last_tsc) ; pps = num_mbufs * ticks_per_s / (cur_tsc - last_tsc) ; } else bps = pps = 0 ; last_tsc = cur_tsc ; } I got max. bit per second = 6,835,440,833 for 20 Gbps 1500 bytes packet traffic, and got max. bit per second = 6,808,524,220 for 2 Gbps 1500 bytes packet traffic. I guess there can be packet burst, however the estimated value has too many errors. I will try the methods you proposed. Thank you very much. Sincerely Yours, Ick-Sung Choi. -Original Message- From: "Polehn, Mike A"mailto:mike.a.pol...@intel.com>> To: "Wiles, Keith"mailto:keith.wiles at intel.com>>; "Van Haaren, Harry"mailto:harry.van.haaren at intel.com>>; "???"mailto:pnk003 at naver.com>>; "dev at dpdk.org<mailto:dev at dpdk.org>"mailto:dev at dpdk.org>>; Cc: Sent: 2015-11-04 (?) 00:59:34 Subject: RE: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg I used the following code snip-it with the i40e device, with 1 second sample time had very high accuracy for IPv4 UDP packets: #define FLOWD_PERF_PACKET_OVERHEAD 24 /* CRC + Preamble + SOF + Interpacket gap */ #define FLOWD_REF_NETWORK_SPEED 10e9 double Ave_Bytes_per_Packet, Data_Rate, Net_Rate; uint64_t Bits; uint64_t Bytes = pFlow->flow.n_bytes - pMatch_Prev->flow.n_bytes; uint64_t Packets = pFlow->flow.n_packets - pMatch_Prev->flow.n_packets; uint64_t Time_us = pFlow->flow.flow_time_us - pMatch_Prev->flow.flow_time_us; if (Bytes == 0) Ave_Bytes_per_Packet = 0.0; else Ave_Bytes_per_Packet = ((double)Bytes / (double)Packets) + 4.0; Bits = (Bytes + (Packets*FLOWD_PERF_PACKET_OVERHEAD)) * 8; if (Bits == 0) Data_Rate = 0.0; else Data_Rate = (((double)Bits) / Time_us) * 1e6; if (Data_Rate == 0.0) Net_Rate = 0.0; else Net_Rate = Data_Rate / FLOWD_REF_NETWORK_SPEED; For packet rate: double pk_rate = (((double)Packets)/ ((double)Time_us)) * 1e6; To calculate elapsed time in DPDK app, used CPU counter (will not work if counter is being modified): Initialization: double flow_time_scale_us; ... flow_time_scale_us = 1e6/rte_get_tsc_hz(); Elapsed time (uSec) example: elapse_us = (rte_rdtsc() - entry->tsc_first_packet) * flow_time_scale_us; /* calc total elapsed us */ -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Wiles, Keith Sent: Tuesday, November 3, 2015 6:33 AM To: Van Haaren, Harry; ???; dev at dpdk.org<mailto:dev at dpdk.org> Subject: Re: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg On 11/3/15, 8:30 AM, "Van Haaren, Harry" mailto:harry.van.haaren at intel.com>> wrote: >Hi Keith, > >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wiles, Keith > >> Hmm, I jus
[dpdk-dev] SR-IOV: API to tell VF from PF
I can think of a very good reason to want to know if the device is VF or PF. The VF has to go through a layer 2 switch, not allowing it to just receive anything coming across the Ehternet. The PF can receive all the packets, including packets with different NIC addresses. This allow the packets to be just data and allows the processing of data without needing to be adjusting each NIC L2 address before sending through to the Ehternet. So data can be moved through a series of NICs between systems without the extra processing. Not doing unnecessary processing leaves more clock cycles to do high value processing. Mike -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Bruce Richardson Sent: Thursday, November 5, 2015 1:51 AM To: Shaham Fridenberg Cc: dev at dpdk.org Subject: Re: [dpdk-dev] SR-IOV: API to tell VF from PF On Thu, Nov 05, 2015 at 09:39:19AM +, Shaham Fridenberg wrote: > Hey all, > > Is there some API to tell VF from PF? > > Only way I found so far is deducing that from driver name in the > rte_eth_devices struct. > > Thanks, > Shaham Hi Shaham, yes, checking the driver name is probably the only way to do so. However, why do you need or want to know this? If you want to know the capabilities of a device basing it on a list of known device types is probably not the best way. Regards, /Bruce
[dpdk-dev] SR-IOV: API to tell VF from PF
A VF should support promiscuous mode, however this is different than a PF supporting promiscuous mode. What happens to network throughput, which is tied to PCEe throughput, when say when 4 VFs are each in promiscuous mode. It should support it, but very negative effect. Not all NICs are created equal. The program should be able to quarry the device driver and be able to determine if it is the correct NIC type is being used. The device driver type should only match to the device type, which should be specific to VF or PF. Mike -Original Message- From: Richardson, Bruce Sent: Thursday, November 5, 2015 7:51 AM To: Polehn, Mike A; Shaham Fridenberg Cc: dev at dpdk.org Subject: RE: [dpdk-dev] SR-IOV: API to tell VF from PF > -Original Message- > From: Polehn, Mike A > Sent: Thursday, November 5, 2015 3:43 PM > To: Richardson, Bruce ; Shaham Fridenberg > > Cc: dev at dpdk.org > Subject: RE: [dpdk-dev] SR-IOV: API to tell VF from PF > > I can think of a very good reason to want to know if the device is VF > or PF. > > The VF has to go through a layer 2 switch, not allowing it to just > receive anything coming across the Ehternet. > > The PF can receive all the packets, including packets with different > NIC addresses. This allow the packets to be just data and allows the > processing of data without needing to be adjusting each NIC L2 address > before sending through to the Ehternet. So data can be moved through a > series of NICs between systems without the extra processing. Not doing > unnecessary processing leaves more clock cycles to do high value > processing. > > Mike > Yes, the capabilities of the different types of devices are different. However, is a better solution not to provide the ability to query a NIC if it supports promiscuous mode, rather than set up a specific query for a VF? What if (hypothetically) you get a PF that doesn't support promiscuous mode, for instance, or a bifurcated driver where the kernel part prevents the userspace part from enabling promiscuous mode? In both these cases have a direct feature query works better than asking about PF/VF. Regards, /Bruce > -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson > Sent: Thursday, November 5, 2015 1:51 AM > To: Shaham Fridenberg > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] SR-IOV: API to tell VF from PF > > On Thu, Nov 05, 2015 at 09:39:19AM +, Shaham Fridenberg wrote: > > Hey all, > > > > Is there some API to tell VF from PF? > > > > Only way I found so far is deducing that from driver name in the > rte_eth_devices struct. > > > > Thanks, > > Shaham > > Hi Shaham, > > yes, checking the driver name is probably the only way to do so. > However, why do you need or want to know this? If you want to know the > capabilities of a device basing it on a list of known device types is > probably not the best way. > > Regards, > /Bruce
[dpdk-dev] [PATCH v2] ethdev: Prefetch driver variable structure
Adds ethdev driver prefetch of variable structure to CPU cache 0 while calling into tx or rx device driver operation. RFC 2544 test of NIC task test measurement points show improvement of lower latency and/or better packet throughput indicating clock cycles saved. Signed-off-by: Mike A. Polehn --- lib/librte_ether/rte_ethdev.h | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h index 48a540d..f1c35de 100644 --- a/lib/librte_ether/rte_ethdev.h +++ b/lib/librte_ether/rte_ethdev.h @@ -2458,12 +2458,17 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, const uint16_t nb_pkts) { struct rte_eth_dev *dev; + int16_t nb_rx; dev = &rte_eth_devices[port_id]; - int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id], -rx_pkts, nb_pkts); + { /* limit scope of rxq variable */ + /* rxq is going to be immediately used, prefetch it */ + void *rxq = dev->data->rx_queues[queue_id]; + rte_prefetch0(rxq); + nb_rx = (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts); + } #ifdef RTE_ETHDEV_RXTX_CALLBACKS struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id]; @@ -2600,6 +2605,7 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **tx_pkts, uint16_t nb_pkts) { struct rte_eth_dev *dev; + void *txq; dev = &rte_eth_devices[port_id]; @@ -2615,7 +2621,11 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id, } #endif - return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts); + /* txq is going to be immediately used, prefetch it */ + txq = dev->data->tx_queues[queue_id]; + rte_prefetch0(txq); + + return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts); } #endif -- 2.6.0
[dpdk-dev] [PATCH v2] ethdev: Prefetch driver variable structure
It is probably the usual MS operation issues, I'll resubmit. -Original Message- From: Stephen Hemminger [mailto:step...@networkplumber.org] Sent: Tuesday, November 10, 2015 9:03 AM To: Polehn, Mike A Cc: dev at dpdk.org Subject: Re: [dpdk-dev] [PATCH v2] ethdev: Prefetch driver variable structure On Tue, 10 Nov 2015 14:17:41 +0000 "Polehn, Mike A" wrote: > Adds ethdev driver prefetch of variable structure to CPU cache 0 while > calling into tx or rx device driver operation. > > RFC 2544 test of NIC task test measurement points show improvement of > lower latency and/or better packet throughput indicating clock cycles > saved. > > Signed-off-by: Mike A. Polehn Good idea, but lots of whitespace issues. Please also check your mail client.. ERROR: patch seems to be corrupt (line wrapped?) #80: FILE: lib/librte_ether/rte_ethdev.h:2457: , WARNING: please, no spaces at the start of a line #84: FILE: lib/librte_ether/rte_ethdev.h:2460: + int16_t nb_rx;$ WARNING: please, no spaces at the start of a line #89: FILE: lib/librte_ether/rte_ethdev.h:2462: + { /* limit scope of rxq variable */$ ERROR: code indent should use tabs where possible #90: FILE: lib/librte_ether/rte_ethdev.h:2463: + /* rxq is going to be immediately used, prefetch it */$ ERROR: code indent should use tabs where possible #91: FILE: lib/librte_ether/rte_ethdev.h:2464: + void *rxq =3D dev->data->rx_queues[queue_id];$ WARNING: please, no spaces at the start of a line #91: FILE: lib/librte_ether/rte_ethdev.h:2464: + void *rxq =3D dev->data->rx_queues[queue_id];$ ERROR: spaces required around that '=' (ctx:WxV) #91: FILE: lib/librte_ether/rte_ethdev.h:2464: + void *rxq =3D dev->data->rx_queues[queue_id]; ^ ERROR: code indent should use tabs where possible #92: FILE: lib/librte_ether/rte_ethdev.h:2465: + rte_prefetch0(rxq);$ WARNING: Missing a blank line after declarations #92: FILE: lib/librte_ether/rte_ethdev.h:2465: + void *rxq =3D dev->data->rx_queues[queue_id]; + rte_prefetch0(rxq); WARNING: please, no spaces at the start of a line #92: FILE: lib/librte_ether/rte_ethdev.h:2465: + rte_prefetch0(rxq);$ ERROR: code indent should use tabs where possible #93: FILE: lib/librte_ether/rte_ethdev.h:2466: + nb_rx =3D (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);$ WARNING: please, no spaces at the start of a line #93: FILE: lib/librte_ether/rte_ethdev.h:2466: + nb_rx =3D (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);$ WARNING: space prohibited between function name and open parenthesis '(' #93: FILE: lib/librte_ether/rte_ethdev.h:2466: + nb_rx =3D (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts); ERROR: spaces required around that '=' (ctx:WxV) #93: FILE: lib/librte_ether/rte_ethdev.h:2466: + nb_rx =3D (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts); ^ WARNING: please, no spaces at the start of a line #94: FILE: lib/librte_ether/rte_ethdev.h:2467: + }$ WARNING: please, no spaces at the start of a line #102: FILE: lib/librte_ether/rte_ethdev.h:2607: + void *txq;$ WARNING: please, no spaces at the start of a line #110: FILE: lib/librte_ether/rte_ethdev.h:2624: + txq =3D dev->data->tx_queues[queue_id];$ ERROR: spaces required around that '=' (ctx:WxV) #110: FILE: lib/librte_ether/rte_ethdev.h:2624: + txq =3D dev->data->tx_queues[queue_id]; ^ WARNING: please, no spaces at the start of a line #111: FILE: lib/librte_ether/rte_ethdev.h:2625: + rte_prefetch0(txq);$ WARNING: please, no spaces at the start of a line #113: FILE: lib/librte_ether/rte_ethdev.h:2627: + return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts);$ total: 8 errors, 12 warnings, 38 lines checked
[dpdk-dev] [PATCH v2] ethdev: Prefetch driver variable structure
Adds ethdev driver prefetch of variable structure to CPU cache 0 while calling into tx or rx device driver operation. RFC 2544 test of NIC task test measurement points show improvement of lower latency and/or better packet throughput indicating clock cycles saved. Signed-off-by: Mike A. Polehn --- lib/librte_ether/rte_ethdev.h | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h index 48a540d..f1c35de 100644 --- a/lib/librte_ether/rte_ethdev.h +++ b/lib/librte_ether/rte_ethdev.h @@ -2458,12 +2458,17 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, const uint16_t nb_pkts) { struct rte_eth_dev *dev; + int16_t nb_rx; dev = &rte_eth_devices[port_id]; - int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id], - rx_pkts, nb_pkts); + { /* limit scope of rxq variable */ + /* rxq is going to be immediately used, prefetch it */ + void *rxq = dev->data->rx_queues[queue_id]; + rte_prefetch0(rxq); + nb_rx = (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts); + } #ifdef RTE_ETHDEV_RXTX_CALLBACKS struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id]; @@ -2600,6 +2605,7 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **tx_pkts, uint16_t nb_pkts) { struct rte_eth_dev *dev; + void *txq; dev = &rte_eth_devices[port_id]; @@ -2615,7 +2621,11 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id, } #endif - return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts); + /* txq is going to be immediately used, prefetch it */ + txq = dev->data->tx_queues[queue_id]; + rte_prefetch0(txq); + + return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts); } #endif -- 2.6.0