[dpdk-dev] DPDK OVS on Ubuntu 14.04

2015-12-01 Thread Polehn, Mike A
May need to setup huge pages on kernel boot line (this is example, you may need 
to adjust): 

The huge page configuration can be added to the default configuration file 
/etc/default/grub by adding to the GRUB_CMDLINE_LINUX and the grub 
configuration file regenerated to get an updated configuration file for Linux 
boot. 
# vim /etc/default/grub// edit file

. . .
GRUB_CMDLINE_LINUX_DEFAULT="... default_hugepagesz=1GB hugepagesz=1GB 
hugepages=4 hugepagesize=2m hugepages=2048 ..."
. . .


This example sets up huge pages for both 1 GB pages for 4 GB of 1 GB hugepage 
memory and 2 MB pages for 4 GB of 2 MB hugepage memory. After boot the number 
of 1 GB pages cannot be changed, but the number of 2 MB pages can be changed.

After editing configuration file /etc/default/grub , the new grub.cfg boot file 
needs to be regenerated: 
# update-grub

And reboot. After reboot memory managers need to be setup:

If /dev/hugepages does not exist:# mkdir /dev/hugepages

# mount -t hugetlbfs nodev   /dev/hugepages

# mkdir /dev/hugepages_2mb
# mount -t hugetlbfs nodev /dev/hugepages_2mb -o pagesize=2MB

Mike

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Abhijeet Karve
Sent: Monday, November 30, 2015 10:14 PM
To: dev at dpdk.org
Cc: bhavya.addep at gmail.com
Subject: [dpdk-dev] DPDK OVS on Ubuntu 14.04

Dear All,


We are trying to install DPDK OVS on top of the openstack juno in Ubuntu
14.04 single server. We are referring following steps for executing the same.

https://software.intel.com/en-us/blogs/2015/06/09/building-vhost-user-for-ovs-today-using-dpdk-200

During execution we are getting some issues with ovs-vswitchd service as its 
getting hang during starting.
_

nfv-dpdk at nfv-dpdk:~$ tail -f /var/log/openvswitch/ovs-vswitchd.log
2015-11-24T10:54:34.036Z|6|reconnect|INFO|unix:/var/run/openvswitch/db.sock:

connecting...
2015-11-24T10:54:34.036Z|7|reconnect|INFO|unix:/var/run/openvswitch/db.sock:

connected
2015-11-24T10:54:34.064Z|8|bridge|INFO|ovs-vswitchd (Open vSwitch)
2.4.90
2015-11-24T11:03:42.957Z|2|vlog|INFO|opened log file 
/var/log/openvswitch/ov  
 s-vswitchd.log 
2015-11-24T11:03:42.958Z|3|ovs_numa|INFO|Discovered 24 CPU cores on NUMA 
nod   
e 0
2015-11-24T11:03:42.958Z|4|ovs_numa|INFO|Discovered 24 CPU cores on NUMA 
nod   
e 1
2015-11-24T11:03:42.958Z|5|ovs_numa|INFO|Discovered 2 NUMA nodes and 
48 CPU   
 cores
2015-11-24T11:03:42.958Z|6|reconnect|INFO|unix:/var/run/openvswitch/db.sock:

connecting...
2015-11-24T11:03:42.958Z|7|reconnect|INFO|unix:/var/run/openvswitch/db.sock:

connected
2015-11-24T11:03:42.961Z|8|bridge|INFO|ovs-vswitchd (Open vSwitch)
2.4.90
_

Also, attaching output(Hugepage.txt) of  ? ./ovs-vswitchd --dpdk -c 0x0FF8 -n 4 
--socket-mem 1024,0 -- --log-file=/var/log/openvswitch/ovs-vswitchd.log
--pidfile=/var/run/oppenvswitch/ovs-vswitchd.pid? 

-  We tried seting up echo 0 > 
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages, but couldn?t succeeded.
  Can anyone please help us in getting the things if we are missing any and 
causing ovs-vswitchd to stuck while starting?

Also, when we create vm in openstack with DPDK OVS, dpdkvhost-user type 
interfaces are getting created automatically. If this interfaces are getting 
mapped with regular br-int bridge rather than DPDK bridge br0 then is this mean 
that we have successfully enabled DPDK with netdev datapath?



We really appreciate for all the advice if you have.

Thanks,
Abhijeet
Thanks & Regards
Abhijeet Karve

=-=-=
Notice: The information contained in this e-mail message and/or attachments to 
it may contain confidential or privileged information. If you are not the 
intended recipient, any dissemination, use, review, distribution, printing or 
copying of the information contained in this e-mail message and/or attachments 
to it are strictly prohibited. If you have received this communication in 
error, please notify us by reply e-mail or telephone and immediately and 
permanently delete the message and any attachments. Thank you




[dpdk-dev] Does anybody know OpenDataPlane

2015-12-02 Thread Polehn, Mike A
I don't think you have researched this enough. 
Asking this questions shows that you are just beginning your research or do not 
understand how this fits into current telco NFV/SDN efforts.

Why does this exist: "OpenDataPlane using DPDK for Intel NIC", listed below? 
Why would competing technologies use the competition technology to solve a 
problem?

Maybe you can change your thesis to "Current Open Source Dataplane Methods": 
and do a comparison between the two.  However if you just look at the sales 
documentation then you may not understand the real difference.

Mike

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Kury Nicolas
Sent: Wednesday, December 2, 2015 6:22 AM
To: dev at dpdk.org
Subject: [dpdk-dev] Does anybody know OpenDataPlane

Hi!


Does anybody know OpenDataPlane ?  http://www.opendataplane.org/ It is a 
framework designed to enable software portability between networking SoCs, 
regardless of the underlying instruction set architecture. There are several 
implementations.

  *   OpenDataPlane using DPDK for Intel NIC
  *   OpenDataPlane using DPAA for Freescale platforms (QorIQ)
  *   OpenDataPlane using MCSDK for Texas Insturments platforms (KeyStone II)
  *   etc.

When a developer wants to port his application, he just needs to recompile it 
with the implementation of OpenDataPlane related to the new platform.


I'm doing my Master's Thesis on OpenDataPlane  and I have some questions.

- Now that OpenDataPlane (ODP) exists, schould every developpers start a new 
project with ODP or are there some reasons to still use DPDK ? What do you 
think ?


Thank you very much

Nicolas




[dpdk-dev] Does anybody know OpenDataPlane

2015-12-02 Thread Polehn, Mike A
A hint of the fundamental difference:
One originated somewhat more from the embedded orientation and one originated 
somewhat more from the server orientation. Both efforts are driving each 
towards the other and have overlap.

Mike

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Polehn, Mike A
Sent: Wednesday, December 2, 2015 8:32 AM
To: Kury Nicolas; dev at dpdk.org
Subject: Re: [dpdk-dev] Does anybody know OpenDataPlane

I don't think you have researched this enough. 
Asking this questions shows that you are just beginning your research or do not 
understand how this fits into current telco NFV/SDN efforts.

Why does this exist: "OpenDataPlane using DPDK for Intel NIC", listed below? 
Why would competing technologies use the competition technology to solve a 
problem?

Maybe you can change your thesis to "Current Open Source Dataplane Methods": 
and do a comparison between the two.  However if you just look at the sales 
documentation then you may not understand the real difference.

Mike

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Kury Nicolas
Sent: Wednesday, December 2, 2015 6:22 AM
To: dev at dpdk.org
Subject: [dpdk-dev] Does anybody know OpenDataPlane

Hi!


Does anybody know OpenDataPlane ?  http://www.opendataplane.org/ It is a 
framework designed to enable software portability between networking SoCs, 
regardless of the underlying instruction set architecture. There are several 
implementations.

  *   OpenDataPlane using DPDK for Intel NIC
  *   OpenDataPlane using DPAA for Freescale platforms (QorIQ)
  *   OpenDataPlane using MCSDK for Texas Insturments platforms (KeyStone II)
  *   etc.

When a developer wants to port his application, he just needs to recompile it 
with the implementation of OpenDataPlane related to the new platform.


I'm doing my Master's Thesis on OpenDataPlane  and I have some questions.

- Now that OpenDataPlane (ODP) exists, schould every developpers start a new 
project with ODP or are there some reasons to still use DPDK ? What do you 
think ?


Thank you very much

Nicolas




[dpdk-dev] rte_prefetch0() is effective?

2016-01-13 Thread Polehn, Mike A
Prefetchs make a big difference because a powerful CPU like IA is always trying 
to find items to prefetch and the priority of these is not always easy to 
determine. This is especially a problem across subroutine calls since the 
compiler cannot determine what is of priority in the other subroutines and the 
runtime CPU logic cannot always have the future well predicted far enough in 
the future for all possible paths, especially if you have a cache miss, which 
takes eons of clock cycles to do a memory access probably resulting in a CPU 
stall.

Until we get to the point of the computers full understanding the logic of the 
program and writing optimum code (putting programmers out of business) , the 
understanding of what is important as the program progresses gives the 
programmer knowledge of what is desirable to prefetch. It is difficult to 
determine if the CPU is going to have the same priority of the prefetch, so 
having a prefetch may or may not show up as a measureable performance 
improvement under some conditions, but having the prefetch decision in place 
can make prefetch priority decision correct in these other cases, which make a 
performance improvement.

Removing a prefetch without thinking through and fully understanding the logic 
of why it is there, or what he added cost (in the case of calculating an 
address for the prefetch that affects other current operations) if any, is just 
plain amateur  work.  It is not to say people do not make bad judgments on what 
needs to be prefetched and put poor prefetch placement and should only be 
removed if not logically proper for expected runtime operation.

Only more primitive CPUs with no prefetch capabilities don't benefit from 
properly placed prefetches. 

Mike

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Bruce Richardson
Sent: Wednesday, January 13, 2016 3:35 AM
To: Moon-Sang Lee
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] rte_prefetch0() is effective?

On Thu, Dec 24, 2015 at 03:35:14PM +0900, Moon-Sang Lee wrote:
> I see codes as below in example directory, and I wonder it is effective.
> Coherent IO is adopted to modern architectures, so I think that DMA 
> initiation by rte_eth_rx_burst() might already fulfills cache lines of 
> RX buffers.
> Do I really need to call rte_prefetchX()?
> 
> nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, 
> MAX_PKT_BURST);
> ...
> /* Prefetch and forward already prefetched packets */
> for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
> rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
> j + PREFETCH_OFFSET], void *));
> l3fwd_simple_forward(pkts_burst[j], portid,
> qconf);
> }
> 

Good question.
When the first example apps using this style of prefetch were originally 
written, yes, there was a noticable performance increase achieved by using the 
prefetch.
Thereafter, I'm not sure that anyone has checked with each generation of 
platforms whether the prefetches are still necessary and how much they help, 
but I suspect that they still help a bit, and don't hurt performance.
It would be an interesting exercise to check whether the prefetch offsets used 
in code like above can be adjusted to give better performance on our latest 
supported platforms.

/Bruce


[dpdk-dev] [PATCH] vhost: remove lockless enqueue to the virtio ring

2016-01-19 Thread Polehn, Mike A
SMP operations can be very expensive, sometimes can impact operations by 100s 
to 1000s of clock cycles depending on what is the circumstances of the 
synchronization. It is how you arrange the SMP operations within the tasks at 
hand across the SMP cores that gives methods for top performance.  Using 
traditional general purpose SMP methods will result in traditional general 
purpose performance. Migrating to general libraries (understood by most general 
purpose programmers) from expert abilities (understood by much smaller group of 
expert programmers focused on performance) will greatly reduce the value of 
DPDK since the end result will be lower performance and/or have less 
predictable operation where rate performance, predictability, and low latency 
are the primary goals.

The best method to date, is to have multiple outputs to a single port is to use 
a DPDK queue with multiple producer, single consumer to do an SMP operation for 
multiple sources to feed a single non SMP task to output to the port (that is 
why the ports are not SMP protected). Also when considerable contention from 
multiple sources occur often (data feeding at same time), having DPDK queue 
with input and output variables  in separate cache lines can have a notable 
throughput improvement.

Mike 

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Xie, Huawei
Sent: Tuesday, January 19, 2016 8:44 AM
To: Tan, Jianfeng; dev at dpdk.org
Cc: ann.zhuangyanying at huawei.com
Subject: Re: [dpdk-dev] [PATCH] vhost: remove lockless enqueue to the virtio 
ring

On 1/20/2016 12:25 AM, Tan, Jianfeng wrote:
> Hi Huawei,
>
> On 1/4/2016 10:46 PM, Huawei Xie wrote:
>> This patch removes the internal lockless enqueue implmentation.
>> DPDK doesn't support receiving/transmitting packets from/to the same 
>> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK 
>> applications normally have their own lock implmentation when enqueue 
>> packets to the same queue of a port.
>>
>> The atomic cmpset is a costly operation. This patch should help 
>> performance a bit.
>>
>> Signed-off-by: Huawei Xie 
>> ---
>>   lib/librte_vhost/vhost_rxtx.c | 86
>> +--
>>   1 file changed, 25 insertions(+), 61 deletions(-)
>>
>> diff --git a/lib/librte_vhost/vhost_rxtx.c 
>> b/lib/librte_vhost/vhost_rxtx.c index bbf3fac..26a1b9c 100644
>> --- a/lib/librte_vhost/vhost_rxtx.c
>> +++ b/lib/librte_vhost/vhost_rxtx.c
>
> I think vhost example will not work well with this patch when
> vm2vm=software.
>
> Test case:
> Two virtio ports handled by two pmd threads. Thread 0 polls pkts from
> physical NIC and sends to virtio0, while thread0 receives pkts from
> virtio1 and routes it to virtio0.

vhost port will be wrapped as port, by vhost PMD. DPDK APP treats all
physical and virtual ports as ports equally. When two DPDK threads try
to enqueue to the same port, the APP needs to consider the contention.
All the physical PMDs doesn't support concurrent enqueuing/dequeuing.
Vhost PMD should expose the same behavior unless absolutely necessary
and we expose the difference of different PMD.

>
>> -
>>   *(volatile uint16_t *)&vq->used->idx += entry_success;
>
> Another unrelated question: We ever try to move this assignment out of
> loop to save cost as it's a data contention?

This operation itself is not that costly, but it has side effect on the
cache transfer.
It is outside of the loop for non-mergable case. For mergeable case, it
is inside the loop.
Actually it has pro and cons whether we do this in burst or in a smaller
step. I prefer to move it outside of the loop. Let us address this later.

>
> Thanks,
> Jianfeng
>
>



[dpdk-dev] rte_mbuf size for jumbo frame

2016-01-26 Thread Polehn, Mike A
Jumbo frames are generally handled by link lists (but called something else) of 
mbufs.
Enabling jumbo frames for the device driver should enable the right portion of 
the driver which handles the linked lists.

Don't make the mbufs huge.

Mike 

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Masaru OKI
Sent: Monday, January 25, 2016 2:41 PM
To: Saurabh Mishra; users at dpdk.org; dev at dpdk.org
Subject: Re: [dpdk-dev] rte_mbuf size for jumbo frame

Hi,

1. Take care of unit size of mempool for mbuf.
2. Call rte_eth_dev_set_mtu() for each interface.
Note that some PMDs does not supported change MTU.

On 2016/01/26 6:02, Saurabh Mishra wrote:
> Hi,
>
> We wanted to use 10400 bytes size of each rte_mbuf to enable Jumbo frames.
> Do you guys see any problem with that? Would all the drivers like 
> ixgbe, i40e, vmxnet3, virtio and bnx2x work with larger rte_mbuf size?
>
> We would want to avoid detailing with chained mbufs.
>
> /Saurabh


[dpdk-dev] [Dpdk-ovs] problem in binding interfaces of virtio-pci on the VM

2015-02-26 Thread Polehn, Mike A
In this example, the control network 00:03.0, remains unbound to UIO driver but 
remains attached
 to Linux device driver (ssh access with putty) and just the target interfaces 
are bound.
Below, it shows all 3 interfaces bound to the uio driver, which are not usable 
until a task uses the UIO driver. 

[root at F21vm l3fwd-vf]# lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 440FX - 82441FX PMC [Natoma] 
[8086:1237] (rev 02)
00:01.0 ISA bridge [0601]: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton 
II] [8086:7000]
00:01.1 IDE interface [0101]: Intel Corporation 82371SB PIIX3 IDE 
[Natoma/Triton II] [8086:7010]
00:01.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] 
(rev 03)
00:02.0 VGA compatible controller [0300]: Cirrus Logic GD 5446 [1013:00b8]
00:03.0 Ethernet controller [0200]: Red Hat, Inc Virtio network device 
[1af4:1000]
00:04.0 Ethernet controller [0200]: Intel Corporation XL710/X710 Virtual 
Function [8086:154c] (rev 01)
00:05.0 Ethernet controller [0200]: Intel Corporation XL710/X710 Virtual 
Function [8086:154c] (rev 01)

[root at F21vm l3fwd-vf]# /usr/src/dpdk/tools/dpdk_nic_bind.py --bind=igb_uio 
00:04.0
[root at F21vm l3fwd-vf]# /usr/src/dpdk/tools/dpdk_nic_bind.py --bind=igb_uio 
00:05.0
[root at F21vm l3fwd-vf]# /usr/src/dpdk/tools/dpdk_nic_bind.py --status

Network devices using DPDK-compatible driver

:00:04.0 'XL710/X710 Virtual Function' drv=igb_uio unused=i40evf
:00:05.0 'XL710/X710 Virtual Function' drv=igb_uio unused=i40evf

Network devices using kernel driver
===
:00:03.0 'Virtio network device' if= drv=virtio-pci 
unused=virtio_pci,igb_uio

Other network devices
=


-Original Message-
From: Dpdk-ovs [mailto:dpdk-ovs-boun...@lists.01.org] On Behalf Of 
Srinivasreddy R
Sent: Thursday, February 26, 2015 6:11 AM
To: dev at dpdk.org; dpdk-ovs at lists.01.org
Subject: [Dpdk-ovs] problem in binding interfaces of virtio-pci on the VM

hi ,
I have written sample program for usvhost  supported by ovdk.

i have initialized VM using the below command .
On the VM :

I am able to see two interfaces . and working fine with traffic in rawsocket 
mode .
my problem is when i bind the interfaces to pmd driver[ ibg_uio ] my virtual 
machine is getting hanged . and  i am not able to access it further .
now my question is . what may be the reason for the behavior . and how can in 
debug the root cause .
please help in finding out the problem .



 ./tools/dpdk_nic_bind.py --status

Network devices using DPDK-compatible driver 



Network devices using kernel driver
===
:00:03.0 '82540EM Gigabit Ethernet Controller' if=ens3 drv=e1000 
unused=igb_uio *Active*
:00:04.0 'Virtio network device' if= drv=virtio-pci unused=igb_uio
:00:05.0 'Virtio network device' if= drv=virtio-pci unused=igb_uio

Other network devices
=



./dpdk_nic_bind.py --bind=igb_uio 00:04.0 00:05.0



./x86_64-softmmu/qemu-system-x86_64 -cpu host -boot c  -hda 
/home/utils/images/vm1.img  -m 2048M -smp 3 --enable-kvm -name 'VM1'
-nographic -vnc :1 -pidfile /tmp/vm1.pid -drive 
file=fat:rw:/tmp/qemu_share,snapshot=off -monitor 
unix:/tmp/vm1monitor,server,nowait  -net none -no-reboot -mem-path 
/dev/hugepages -mem-prealloc -netdev 
type=tap,id=net1,script=no,downscript=no,ifname=usvhost1,vhost=on -device 
virtio-net-pci,netdev=net1,mac=00:16:3e:00:03:03,csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off
-netdev type=tap,id=net2,script=no,downscript=no,ifname=usvhost2,vhost=on
-device
virtio-net-pci,netdev=net2,mac=00:16:3e:00:03:04,csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off




--
thanks
srinivas.
___
Dpdk-ovs mailing list
Dpdk-ovs at lists.01.org
https://lists.01.org/mailman/listinfo/dpdk-ovs


[dpdk-dev] [Patch 1/2] i40e RX Bulk Alloc: Larger list size (33 to 128) throughput optimization

2015-10-27 Thread Polehn, Mike A
Combined 2 subroutines of code into one subroutine with one read operation 
followed by 
buffer allocate and load loop.

Eliminated the staging queue and subroutine, which removed extra pointer list 
movements 
and reduced number of active variable cache pages during for call.

Reduced queue position variables to just 2, the next read point and last NIC RX 
descriptor 
position, also changed to allowing NIC descriptor table to not always need to 
be filled.

Removed NIC register update write from per loop to one per driver write call to 
minimize CPU 
stalls waiting of multiple SMB synchronization points and for earlier NIC 
register writes that 
often take large cycle counts to complete. For example with an input packet 
list of 33, with 
the default loops size of 32, the second NIC register write will occur just 
after RX processing 
for just 1 packet, resulting in large CPU stall time.

Eliminated initial rx packet present test before rx processing loop that also 
checks, since less 
free time is generally available when packets are present than when not 
processing any input 
packets. 

Used some standard variables to help reduce overhead of non-standard variable 
sizes.

Reduced number of variables, reordered variable structure to put most active 
variables in 
first cache line, better utilize memory bytes inside cache line, and reduced 
active cache line 
count to 1 cache line during processing call. Other RX subroutine sets might 
still use more 
than 1 variable cache line.

Signed-off-by: Mike A. Polehn 

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fd656d5..ea63f2f 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -63,6 +63,7 @@
 #define DEFAULT_TX_RS_THRESH   32
 #define DEFAULT_TX_FREE_THRESH 32
 #define I40E_MAX_PKT_TYPE  256
+#define I40E_RX_INPUT_BUF_MAX  256

 #define I40E_TX_MAX_BURST  32

@@ -959,115 +960,97 @@ check_rx_burst_bulk_alloc_preconditions(__rte_unused 
struct i40e_rx_queue *rxq)
 }

 #ifdef RTE_LIBRTE_I40E_RX_ALLOW_BULK_ALLOC
-#define I40E_LOOK_AHEAD 8
-#if (I40E_LOOK_AHEAD != 8)
-#error "PMD I40E: I40E_LOOK_AHEAD must be 8\n"
-#endif
-static inline int
-i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
+
+static inline unsigned
+i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+   unsigned nb_pkts)
 {
volatile union i40e_rx_desc *rxdp;
struct i40e_rx_entry *rxep;
-   struct rte_mbuf *mb;
-   uint16_t pkt_len;
-   uint64_t qword1;
-   uint32_t rx_status;
-   int32_t s[I40E_LOOK_AHEAD], nb_dd;
-   int32_t i, j, nb_rx = 0;
-   uint64_t pkt_flags;
+   unsigned i, n, tail;

-   rxdp = &rxq->rx_ring[rxq->rx_tail];
-   rxep = &rxq->sw_ring[rxq->rx_tail];
-
-   qword1 = rte_le_to_cpu_64(rxdp->wb.qword1.status_error_len);
-   rx_status = (qword1 & I40E_RXD_QW1_STATUS_MASK) >>
-   I40E_RXD_QW1_STATUS_SHIFT;
+   /* Wrap tail */
+   if (rxq->rx_tail >= rxq->nb_rx_desc)
+   tail = 0;
+   else
+   tail = rxq->rx_tail;
+
+   /* Stop at end of Q, for end, next read alligned at Q start */
+   n = rxq->nb_rx_desc - tail;
+   if (n < nb_pkts)
+   nb_pkts = n;
+
+   rxdp = &rxq->rx_ring[tail];
+   rte_prefetch0(rxdp);
+   rxep = &rxq->sw_ring[tail];
+   rte_prefetch0(rxep);
+
+   i = 0;
+   while (nb_pkts > 0) {
+   /* Prefetch NIC descriptors and packet list */
+   if (likely(nb_pkts > 4)) {
+   rte_prefetch0(&rxdp[4]);
+   if (likely(nb_pkts > 8)) {
+   rte_prefetch0(&rxdp[8]);
+   rte_prefetch0(&rxep[8]);
+   }
+   }

-   /* Make sure there is at least 1 packet to receive */
-   if (!(rx_status & (1 << I40E_RX_DESC_STATUS_DD_SHIFT)))
-   return 0;
+   for (n = 0; (nb_pkts > 0)&&(n < 8); n++, nb_pkts--, i++) {
+   uint64_t qword1;
+   uint64_t pkt_flags;
+   uint16_t pkt_len;
+   struct rte_mbuf *mb = rxep->mbuf;
+   rxep++;

-   /**
-* Scan LOOK_AHEAD descriptors at a time to determine which
-* descriptors reference packets that are ready to be received.
-*/
-   for (i = 0; i < RTE_PMD_I40E_RX_MAX_BURST; i+=I40E_LOOK_AHEAD,
-   rxdp += I40E_LOOK_AHEAD, rxep += I40E_LOOK_AHEAD) {
-   /* Read desc statuses backwards to avoid race condition */
-   for (j = I40E_LOOK_AHEAD - 1; j >= 0; j--) {
+   /* Translate descriptor info to mbuf parameters */
qword1 = rte_le_to_cpu_64(\
-   rxdp[j].wb.qword1.status_error_len);
-   s[j] = (qword1 & I40E_RXD_QW1_STATUS_MASK) >>
-   

[dpdk-dev] [Patch] Eth Driver: Optimization for improved NIC processing rates

2015-10-27 Thread Polehn, Mike A
Prefetch of interface access variables while calling into driver RX and TX 
subroutines.

For converging zero loss packet task tests, a small drop in latency for zero 
loss measurements 
and small drop in lost packet counts for the lossy measurement points was 
observed, 
indicating some savings of execution clock cycles.

Signed-off-by: Mike A. Polehn 

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 8a8c82b..09f1069 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -2357,11 +2357,15 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
 struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
 {
struct rte_eth_dev *dev;
+   void *rxq;

dev = &rte_eth_devices[port_id];

-   int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
-   rx_pkts, nb_pkts);
+   /* rxq is going to be immediately used, prefetch it */
+   rxq = dev->data->rx_queues[queue_id];
+   rte_prefetch0(rxq);
+
+   int16_t nb_rx = (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);

 #ifdef RTE_ETHDEV_RXTX_CALLBACKS
struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id];
@@ -2499,6 +2503,7 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
struct rte_eth_dev *dev;
+   void *txq;

dev = &rte_eth_devices[port_id];

@@ -2514,7 +2519,11 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
}
 #endif

-   return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, 
nb_pkts);
+   /* txq is going to be immediately used, prefetch it */
+   txq = dev->data->tx_queues[queue_id];
+   rte_prefetch0(txq);
+
+   return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts);
 }
 #endif


[dpdk-dev] [Patch 2/2] i40e rx Bulk Alloc: Larger list size (33 to 128) throughput optimization

2015-10-27 Thread Polehn, Mike A
Added check of minimum of 2 packet allocation count to eliminate the extra 
overhead for 
supporting prefetch for the case of checking for only one packet allocated into 
the queue 
at a time.

Used some standard variables to help reduce overhead of non-standard variable 
sizes.

Added second level prefetch to get packet address in cache 0 earlier and 
eliminated
calculation inside loop to determine end of prefetch loop.

Used old time instruction C optimization methods of, using pointers instead of 
arrays, 
and reducing scope of some variables to improve chances of using register 
variables 
instead of a stack variables.

Signed-off-by: Mike A. Polehn 

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index ec62f75..2032e06 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -64,6 +64,7 @@
 #define DEFAULT_TX_FREE_THRESH 32
 #define I40E_MAX_PKT_TYPE  256
 #define I40E_RX_INPUT_BUF_MAX  256
+#define I40E_RX_FREE_THRESH_MIN  2

 #define I40E_TX_MAX_BURST  32

@@ -942,6 +943,12 @@ check_rx_burst_bulk_alloc_preconditions(__rte_unused 
struct i40e_rx_queue *rxq)
 "rxq->rx_free_thresh=%d",
 rxq->nb_rx_desc, rxq->rx_free_thresh);
ret = -EINVAL;
+   } else if (rxq->rx_free_thresh < I40E_RX_FREE_THRESH_MIN) {
+   PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions: "
+   "rxq->rx_free_thresh=%d, "
+   "I40E_RX_FREE_THRESH_MIN=%d",
+   rxq->rx_free_thresh, I40E_RX_FREE_THRESH_MIN);
+   ret = -EINVAL;
} else if (!(rxq->nb_rx_desc < (I40E_MAX_RING_DESC -
RTE_PMD_I40E_RX_MAX_BURST))) {
PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions: "
@@ -1058,9 +1065,8 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
 {
volatile union i40e_rx_desc *rxdp;
struct i40e_rx_entry *rxep;
-   struct rte_mbuf *mb;
-   unsigned alloc_idx, i;
-   uint64_t dma_addr;
+   struct rte_mbuf *pk, *npk;
+   unsigned alloc_idx, i, l;
int diag;

/* Allocate buffers in bulk */
@@ -1076,22 +1082,36 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
return -ENOMEM;
}

+   pk = rxep->mbuf;
+   rte_prefetch0(pk);
+   rxep++;
+   npk = rxep->mbuf;
+   rte_prefetch0(npk);
+   rxep++;
+   l = rxq->rx_free_thresh - 2;
+
rxdp = &rxq->rx_ring[alloc_idx];
for (i = 0; i < rxq->rx_free_thresh; i++) {
-   if (likely(i < (rxq->rx_free_thresh - 1)))
+   struct rte_mbuf *mb = pk;
+   pk = npk;
+   if (likely(i < l)) {
/* Prefetch next mbuf */
-   rte_prefetch0(rxep[i + 1].mbuf);
-
-   mb = rxep[i].mbuf;
-   rte_mbuf_refcnt_set(mb, 1);
-   mb->next = NULL;
+   npk = rxep->mbuf;
+   rte_prefetch0(npk);
+   rxep++;
+   }
mb->data_off = RTE_PKTMBUF_HEADROOM;
+   rte_mbuf_refcnt_set(mb, 1);
mb->nb_segs = 1;
mb->port = rxq->port_id;
-   dma_addr = rte_cpu_to_le_64(\
-   RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb));
-   rxdp[i].read.hdr_addr = 0;
-   rxdp[i].read.pkt_addr = dma_addr;
+   mb->next = NULL;
+   {
+   uint64_t dma_addr = rte_cpu_to_le_64(
+   RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb));
+   rxdp->read.hdr_addr = dma_addr;
+   rxdp->read.pkt_addr = dma_addr;
+   }
+   rxdp++;
}

rxq->rx_last_pos = alloc_idx + rxq->rx_free_thresh - 1;



[dpdk-dev] [Patch 1/2] i40e simple tx: Larger list size (33 to 128) throughput optimization

2015-10-27 Thread Polehn, Mike A
Reduce the 32 packet list size focus for better packet list size range handling.

Changed maximum new buffer loop process size to the NIC queue free buffer count 
per loop.

Removed redundant single call check to just one call with focused loop.

Remove NIC register update write from per loop to one per write driver call to 
minimize CPU
stalls waiting for multiple SMP synchronization points and for earlier NIC 
register writes that
often take large cycle counts to complete. For example with an output list size 
of 64, the default 
loops size of 32, when 33 packets are queued on descriptor table, the second 
NIC register write will occur just after TX processing for 1 packet, resulting 
in a large CPU stall time.

Used some standard variables to help reduce overhead of non-standard variable 
sizes.

Reordered variable structure to put most active variables in first cache line, 
better utilize 
memory bytes inside cache line, and reduced active cache line count during call.

Signed-off-by: Mike A. Polehn 

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index ec62f75..2032e06 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -64,6 +64,7 @@
 #define DEFAULT_TX_FREE_THRESH 32
 #define I40E_MAX_PKT_TYPE  256
 #define I40E_RX_INPUT_BUF_MAX  256
+#define I40E_RX_FREE_THRESH_MIN  2

 #define I40E_TX_MAX_BURST  32

@@ -942,6 +943,12 @@ check_rx_burst_bulk_alloc_preconditions(__rte_unused 
struct i40e_rx_queue *rxq)
 "rxq->rx_free_thresh=%d",
 rxq->nb_rx_desc, rxq->rx_free_thresh);
ret = -EINVAL;
+   } else if (rxq->rx_free_thresh < I40E_RX_FREE_THRESH_MIN) {
+   PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions: "
+   "rxq->rx_free_thresh=%d, "
+   "I40E_RX_FREE_THRESH_MIN=%d",
+   rxq->rx_free_thresh, I40E_RX_FREE_THRESH_MIN);
+   ret = -EINVAL;
} else if (!(rxq->nb_rx_desc < (I40E_MAX_RING_DESC -
RTE_PMD_I40E_RX_MAX_BURST))) {
PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions: "
@@ -1058,9 +1065,8 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
 {
volatile union i40e_rx_desc *rxdp;
struct i40e_rx_entry *rxep;
-   struct rte_mbuf *mb;
-   unsigned alloc_idx, i;
-   uint64_t dma_addr;
+   struct rte_mbuf *pk, *npk;
+   unsigned alloc_idx, i, l;
int diag;

/* Allocate buffers in bulk */
@@ -1076,22 +1082,36 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
return -ENOMEM;
}

+   pk = rxep->mbuf;
+   rte_prefetch0(pk);
+   rxep++;
+   npk = rxep->mbuf;
+   rte_prefetch0(npk);
+   rxep++;
+   l = rxq->rx_free_thresh - 2;
+
rxdp = &rxq->rx_ring[alloc_idx];
for (i = 0; i < rxq->rx_free_thresh; i++) {
-   if (likely(i < (rxq->rx_free_thresh - 1)))
+   struct rte_mbuf *mb = pk;
+   pk = npk;
+   if (likely(i < l)) {
/* Prefetch next mbuf */
-   rte_prefetch0(rxep[i + 1].mbuf);
-
-   mb = rxep[i].mbuf;
-   rte_mbuf_refcnt_set(mb, 1);
-   mb->next = NULL;
+   npk = rxep->mbuf;
+   rte_prefetch0(npk);
+   rxep++;
+   }
mb->data_off = RTE_PKTMBUF_HEADROOM;
+   rte_mbuf_refcnt_set(mb, 1);
mb->nb_segs = 1;
mb->port = rxq->port_id;
-   dma_addr = rte_cpu_to_le_64(\
-   RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb));
-   rxdp[i].read.hdr_addr = 0;
-   rxdp[i].read.pkt_addr = dma_addr;
+   mb->next = NULL;
+   {
+   uint64_t dma_addr = rte_cpu_to_le_64(
+   RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb));
+   rxdp->read.hdr_addr = dma_addr;
+   rxdp->read.pkt_addr = dma_addr;
+   }
+   rxdp++;
}

rxq->rx_last_pos = alloc_idx + rxq->rx_free_thresh - 1;


[dpdk-dev] [Patch 2/2] i40e simple tx: Larger list size (33 to 128) throughput optimization

2015-10-27 Thread Polehn, Mike A
Added packet memory prefetch for faster access to variables inside packet 
buffer needed 
for the free operation.

Signed-off-by: Mike A. Polehn 

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 177fb2e..d9bc30a 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -1748,7 +1748,8 @@ static inline int __attribute__((always_inline))
 i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 {
struct i40e_tx_entry *txep;
-   uint16_t i;
+   unsigned i, l, tx_rs_thresh;
+   struct rte_mbuf *pk, *pk_next;

if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
@@ -1757,18 +1758,46 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)

txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);

-   for (i = 0; i < txq->tx_rs_thresh; i++)
-   rte_prefetch0((txep + i)->mbuf);
+   /* Prefetch first 2 packets */
+   pk = txep->mbuf;
+   rte_prefetch0(pk);
+   txep->mbuf = NULL;
+   txep++;
+   tx_rs_thresh = txq->tx_rs_thresh;
+   if (likely(txq->tx_rs_thresh > 1)) {
+   pk_next = txep->mbuf;
+   rte_prefetch0(pk_next);
+   txep->mbuf = NULL;
+   txep++;
+   l = tx_rs_thresh - 2;
+   } else {
+   pk_next = pk;
+   l = tx_rs_thresh - 1;
+   }

if (!(txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT)) {
-   for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
-   rte_mempool_put(txep->mbuf->pool, txep->mbuf);
-   txep->mbuf = NULL;
+   for (i = 0; i < tx_rs_thresh; ++i) {
+   struct rte_mbuf *mbuf = pk;
+   pk = pk_next;
+   if (likely(i < l)) {
+   pk_next = txep->mbuf;
+   rte_prefetch0(pk_next);
+   txep->mbuf = NULL;
+   txep++;
+   }
+   rte_mempool_put(mbuf->pool, mbuf);
}
} else {
-   for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
-   rte_pktmbuf_free_seg(txep->mbuf);
-   txep->mbuf = NULL;
+   for (i = 0; i < tx_rs_thresh; ++i) {
+   struct rte_mbuf *mbuf = pk;
+   pk = pk_next;
+   if (likely(i < l)) {
+   pk_next = txep->mbuf;
+   rte_prefetch0(pk_next);
+   txep->mbuf = NULL;
+   txep++;
+   }
+   rte_pktmbuf_free_seg(mbuf);
}
}



[dpdk-dev] [Patch] Eth Driver: Optimization for improved NIC processing rates

2015-10-28 Thread Polehn, Mike A
Hi Bruce!

Thank you for reviewing, sorry didn't write clearly as possible.

I was trying to say more than "The performance improved". I didn't call out RFC 
2544 since many 
people may not know much about it. I was also trying to convey what was 
observed and the 
conclusion derived from the observation without getting too big.

When the NIC processing loop rate is around 400,000/sec the entry and exit 
savings are not easily 
observable when the average data rate variation from test to test is higher 
than the packet rate 
gain. If RFC 2544 zero loss convergence is set too fine, the time it takes to 
make a complete test 
increases substantially (I set my convergence about 0.25% of line rate) at 60 
seconds per 
measurement point. Unless the current convergence data rate is close to zero 
loss for the
next point, a small improvement is not going to show up as higher zero loss 
rate. However the
test has a series of measurements, which has average latency and packet loss. 
Also since the
test equipment uses a predefined sequence algorithm that cause the same data 
rate to
to a high degree of accuracy be generated for each test, the results for same 
data rates can be
compared across tests. If someone repeats the tests, I am pointing to the 
particular data to
look at. One 60 second measurement itself does not give sufficient accuracy to 
make a 
conclusion, but information correlated across multiple measurements gives basis 
for a
correct conclusion.

For l3fwd, to be stable with i40e requires the queues to be increased (I use 
2k) and the 
Packet count to also be increased. This then gets 100% zero loss line rate with 
64 byte 
Packets for 2 10 GbE connections (given the correct Fortville firmware). This 
makes it
good to verify the correct NIC firmware but does not work well for testing 
since the 
data is network limited. I have my own stable packet processing code which I 
used for 
testing. I have multiple programs, but during the optimization cycle, hit line 
rate and
had to move to a 5 tuple processing program for a higher load to proceed. I 
have a
doc that covers this setup and the optimization results, but cannot be shared. 
Someone
making their on measurements needs to have made sufficient tests to understand 
the
stability of their test environment.

Mike


-Original Message-
From: Richardson, Bruce 
Sent: Wednesday, October 28, 2015 3:45 AM
To: Polehn, Mike A
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] [Patch] Eth Driver: Optimization for improved NIC 
processing rates

On Tue, Oct 27, 2015 at 08:56:31PM +, Polehn, Mike A wrote:
> Prefetch of interface access variables while calling into driver RX and TX 
> subroutines.
> 
> For converging zero loss packet task tests, a small drop in latency 
> for zero loss measurements and small drop in lost packet counts for 
> the lossy measurement points was observed, indicating some savings of 
> execution clock cycles.
> 
Hi Mike,

the commit log message above seems a bit awkward to read. If I understand it 
correctly, would the below suggestion be a shorter, clearer equivalent?

Prefetch RX and TX queue variables in ethdev before driver function call

This has been measured to produce higher throughput and reduced latency
in RFC 2544 throughput tests.

Or perhaps you could suggest yourself some similar wording. It would also be 
good to clarify with what applications the improvements were seen - was it 
using testpmd or l3fwd or something else?

Regards,
/Bruce



[dpdk-dev] [Patch v2] Eth driver optimization: Prefetch variable structure

2015-11-03 Thread Polehn, Mike A
Adds Eth driver prefetch variable structure to CPU cache 0 while calling into 
tx or rx 
device driver operation.

RFC 2544 test of NIC task test measurement points show improvement of lower 
latency and/or better packet throughput indicating clock cycles saved.

Signed-off-by: Mike A. Polehn 

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 8a8c82b..09f1069 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -2357,11 +2357,15 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
 struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
 {
struct rte_eth_dev *dev;
+   void *rxq;

dev = &rte_eth_devices[port_id];

-   int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
-   rx_pkts, nb_pkts);
+   /* rxq is going to be immediately used, prefetch it */
+   rxq = dev->data->rx_queues[queue_id];
+   rte_prefetch0(rxq);
+
+   int16_t nb_rx = (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);

 #ifdef RTE_ETHDEV_RXTX_CALLBACKS
struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id];
@@ -2499,6 +2503,7 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
struct rte_eth_dev *dev;
+   void *txq;

dev = &rte_eth_devices[port_id];

@@ -2514,7 +2519,11 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
}
 #endif

-   return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, 
nb_pkts);
+   /* txq is going to be immediately used, prefetch it */
+   txq = dev->data->tx_queues[queue_id];
+   rte_prefetch0(txq);
+
+   return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts);
 }
 #endif



[dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg

2015-11-03 Thread Polehn, Mike A
I used the following code snip-it with the i40e device, with 1 second sample 
time had very high accuracy for IPv4 UDP packets:

#define FLOWD_PERF_PACKET_OVERHEAD 24  /* CRC + Preamble + SOF + Interpacket 
gap */
#define FLOWD_REF_NETWORK_SPEED   10e9

double Ave_Bytes_per_Packet, Data_Rate, Net_Rate;
uint64_t Bits;
uint64_t Bytes = pFlow->flow.n_bytes - pMatch_Prev->flow.n_bytes;
uint64_t Packets = pFlow->flow.n_packets - pMatch_Prev->flow.n_packets;
uint64_t Time_us = pFlow->flow.flow_time_us - pMatch_Prev->flow.flow_time_us;

if (Bytes == 0)
Ave_Bytes_per_Packet = 0.0;
else
Ave_Bytes_per_Packet = ((double)Bytes / (double)Packets) + 4.0;

Bits = (Bytes + (Packets*FLOWD_PERF_PACKET_OVERHEAD)) * 8;
if (Bits == 0)
Data_Rate = 0.0;
else
Data_Rate = (((double)Bits) / Time_us) * 1e6;

if (Data_Rate == 0.0)
Net_Rate = 0.0;
else
Net_Rate = Data_Rate / FLOWD_REF_NETWORK_SPEED;

For packet rate: double pk_rate = (((double)Packets)/ ((double)Time_us)) * 1e6;

To calculate elapsed time in DPDK app, used CPU counter (will not work if 
counter is being modified):

Initialization:
double flow_time_scale_us;
...
flow_time_scale_us = 1e6/rte_get_tsc_hz();

Elapsed time (uSec) example: 

elapse_us = (rte_rdtsc() - entry->tsc_first_packet) *
flow_time_scale_us; /* calc total elapsed us */

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Wiles, Keith
Sent: Tuesday, November 3, 2015 6:33 AM
To: Van Haaren, Harry; ???; dev at dpdk.org
Subject: Re: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) 
and bps(bit per second) in DPDK pktg

On 11/3/15, 8:30 AM, "Van Haaren, Harry"  wrote:

>Hi Keith,
>
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wiles, Keith
>
>> Hmm, I just noticed I did not include the FCS bytes. Does the NIC 
>> include FCS bytes in the counters? Need to verify that one and if not then 
>> it becomes a bit more complex.
>
>The Intel NICs count packet sizes inclusive of CRC / FCS, from eg the 
>ixgbe/82599 datasheet:
>"This register includes bytes received in a packet from the Address> field through the  field, inclusively."

Thanks I assumed I had known that at the time :-)
>
>-Harry
>


Regards,
Keith






[dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg

2015-11-03 Thread Polehn, Mike A
Accessing registers on the NIC has very high access latency and will often 
stall the CPU in waiting for response 
especially with multiple register reads and high throughput packet data also 
being transferred. The size value was 
derived from the NIC writing a value to the descriptor table which as then 
written to the packet buffer. The 
bitrate calculation included the FCS/CRC has packet overhead and the packet 
size was 4 bytes short. 

The inclusion or exclusion of the FCS on receive might be a programmable 
option. For tx, it might be a flag set
in the TX descriptor table either use FCS in packet buffer or calculate it on 
the fly. Where you get the numbers
and initialization may affect the calculation. 

A very important rating for CPU is it's FLOPs performance. Most all modern CPUs 
do single cycle floating point 
multiplies (divides are done with shifts and adds and are clock per set bit in 
float mantissa or in int). Conversion 
to and from floating point are often done in parallel with other operations, 
which makes using integer math 
not always faster. Often additional checks for edge conditions and adjustments 
needed with integer 
processing loses the gain but all depends on exact algorithm and end scaling.  
Being able to do high 
quality integer processing is a good skill, especially when doing work like 
signal processing.

-Original Message-
From: Wiles, Keith 
Sent: Tuesday, November 3, 2015 11:01 AM
To: Polehn, Mike A; Van Haaren, Harry; ???; dev at dpdk.org
Subject: Re: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) 
and bps(bit per second) in DPDK pktg

On 11/3/15, 9:59 AM, "Polehn, Mike A"  wrote:

>I used the following code snip-it with the i40e device, with 1 second sample 
>time had very high accuracy for IPv4 UDP packets:
>
>#define FLOWD_PERF_PACKET_OVERHEAD 24  /* CRC + Preamble + SOF + Interpacket 
>gap */
>#define FLOWD_REF_NETWORK_SPEED   10e9
>
>double Ave_Bytes_per_Packet, Data_Rate, Net_Rate; uint64_t Bits; 
>uint64_t Bytes = pFlow->flow.n_bytes - pMatch_Prev->flow.n_bytes; 
>uint64_t Packets = pFlow->flow.n_packets - pMatch_Prev->flow.n_packets; 
>uint64_t Time_us = pFlow->flow.flow_time_us - 
>pMatch_Prev->flow.flow_time_us;
>
>if (Bytes == 0)
>   Ave_Bytes_per_Packet = 0.0;
>else
>   Ave_Bytes_per_Packet = ((double)Bytes / (double)Packets) + 4.0;
>
>Bits = (Bytes + (Packets*FLOWD_PERF_PACKET_OVERHEAD)) * 8; if (Bits == 
>0)
>   Data_Rate = 0.0;
>else
>   Data_Rate = (((double)Bits) / Time_us) * 1e6;
>
>if (Data_Rate == 0.0)
>   Net_Rate = 0.0;
>else
>   Net_Rate = Data_Rate / FLOWD_REF_NETWORK_SPEED;
>
>For packet rate: double pk_rate = (((double)Packets)/ 
>((double)Time_us)) * 1e6;
>
>To calculate elapsed time in DPDK app, used CPU counter (will not work if 
>counter is being modified):
>
>Initialization:
>double flow_time_scale_us;
>...
>flow_time_scale_us = 1e6/rte_get_tsc_hz();
>
>Elapsed time (uSec) example: 
>
>elapse_us = (rte_rdtsc() - entry->tsc_first_packet) *
>   flow_time_scale_us; /* calc total elapsed us */

Looks reasonable I assume the n_bytes does not include FCS as is not the case 
with the NIC counters.

Also I decided to avoid using double?s in my code and just used 64bit registers 
and integer math :-) 

>
>-Original Message-
>From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wiles, Keith
>Sent: Tuesday, November 3, 2015 6:33 AM
>To: Van Haaren, Harry; ???; dev at dpdk.org
>Subject: Re: [dpdk-dev] How can I calculate/estimate pps(packet per 
>seocond) and bps(bit per second) in DPDK pktg
>
>On 11/3/15, 8:30 AM, "Van Haaren, Harry"  wrote:
>
>>Hi Keith,
>>
>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wiles, Keith
>>
>>> Hmm, I just noticed I did not include the FCS bytes. Does the NIC 
>>> include FCS bytes in the counters? Need to verify that one and if not then 
>>> it becomes a bit more complex.
>>
>>The Intel NICs count packet sizes inclusive of CRC / FCS, from eg the 
>>ixgbe/82599 datasheet:
>>"This register includes bytes received in a packet from the >Address> field through the  field, inclusively."
>
>Thanks I assumed I had known that at the time :-)
>>
>>-Harry
>>
>
>
>Regards,
>Keith
>
>
>
>
>


Regards,
Keith






[dpdk-dev] FW: [Patch v2] Eth driver optimization: Prefetch variable structure

2015-11-03 Thread Polehn, Mike A
My email address is my official email address and it can only be used with the 
official email system, or in other words the corporate MS windows email system. 
Can I use an oddball junk email address, such as a gmail account or 
non-returnable IP named Sendmail server (has no registered DNS name), to submit 
patches into dpdk.org user account, with patches signed with my official email 
address (which is different than the sending email address which is just junk 
name)?

Mike

-Original Message-
From: Polehn, Mike A 
Sent: Tuesday, November 3, 2015 12:17 PM
To: St Leger, Jim
Subject: RE: [dpdk-dev] [Patch v2] Eth driver optimization: Prefetch variable 
structure

I don't understand why a development system must also support a user email 
system and then also a full email server needed to deliver it. I have half a 
dozen servers I have various projects on...

Seems like same server used to move git updates could also be made to move 
patch email for the project. 

-Original Message-
From: St Leger, Jim 
Sent: Tuesday, November 3, 2015 7:36 AM
To: Polehn, Mike A
Subject: RE: [dpdk-dev] [Patch v2] Eth driver optimization: Prefetch variable 
structure

Mike:
If you need any help/guidance of navigating the DPDK.org forums and community 
reach out to some of the crew.  Our Shannon and Shanghai teams have it down to 
a science, okay, an artful science anyway.  And there are some in the States 
such as Keith Wiles (and Jeff Shaw up your way) who could also give some BKMs.
Jim


-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Thomas Monjalon
Sent: Tuesday, November 3, 2015 8:03 AM
To: Polehn, Mike A 
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] [Patch v2] Eth driver optimization: Prefetch variable 
structure

Hi,
Please use git-send-email and check how titles are formatted in the git tree.
Thanks


[dpdk-dev] How can I calculate/estimate pps(packet per seocond) and bps(bit per second) in DPDK pktg

2015-11-04 Thread Polehn, Mike A
The change in tsc value from rte_rdtsc() needs to be multiplied by the scale to 
convert from clocks to get change in seconds.
For example from below:

elapse_us = (rte_rdtsc() - entry->tsc_first_packet) * flow_time_scale_us;

The bit rate requires the number of bytes passed in the time period then 
adjusted by the overhead of the number of packets transferred in the time 
period.

#define FLOWD_PERF_PACKET_OVERHEAD 24 /* CRC + Preamble + SOF + Interpacket gap 
*/

Bits = (Bytes + (Packets*FLOWD_PERF_PACKET_OVERHEAD)) * 8;
Data_Rate = (((double)Bits) / Time_us) * 1e6;

Integer math is very tricky and often is not any faster than floating point 
math when using multiplies except on the very low performance processors.

Mike

From: ??? [mailto:pnk...@naver.com]
Sent: Tuesday, November 3, 2015 5:45 PM
To: Polehn, Mike A; Wiles, Keith; Van Haaren, Harry; dev at dpdk.org
Subject: RE: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) 
and bps(bit per second) in DPDK pktg


Dear  Wiles, Keith ,  Van Haaren, Harry,  Polehn, Mike A,  Stephen Hemminger, 
Kyle Larose, and DPDK experts.



I really appreciate for your precious answers and advices.



I will find and study the corresponding codes and CRC checking.





Last night, I tried to estimate bps and pps by using the following code.





// rte_distributor_process() gets 64 mbufs packets at a time.

// rte_distributor_process() gets packets from Intel? 82599ES 10 Gigabit 
Ethernet 2 port Controller (2 10gbE ports).



int  rte_distributor_process(struct rte_distributor *d, struct rte_mbuf 
**mbufs, unsigned num_mbufs)

{

uint64_t ticks_per_ms = rte_get_tsc_hz()/1000 ;

uint64_t ticks_per_s = rte_get_tsc_hz() ;

uint64_t ticks_per_s_div_8 = rte_get_tsc_hz()/8 ;

uint64_t cur_tsc = 0, last_tsc = 0, sum_len, bps, pps ;



cur_tsc = rte_rdtsc();



sum_len = 0 ;

for (l=0; l < num_mbufs; l++ ) { sum_len += mbufs[l]->pkt_len ; }



if ((cur_tsc - last_tsc)!=0) {

   bps = (sum_len * ticks_per_s_div_8 ) / (cur_tsc - last_tsc) ;

   pps = num_mbufs * ticks_per_s / (cur_tsc - last_tsc) ;

} else bps = pps = 0 ;



last_tsc = cur_tsc ;

}



I got  max. bit per second = 6,835,440,833 for 20 Gbps 1500 bytes packet 
traffic, and got max. bit per second = 6,808,524,220 for 2 Gbps 1500 bytes 
packet traffic.



I guess there can be packet burst, however the estimated value has too many 
errors.



I will try the methods you proposed.



Thank you very much.



Sincerely Yours,



Ick-Sung Choi.


-Original Message-
From: "Polehn, Mike A"mailto:mike.a.pol...@intel.com>>
To: "Wiles, Keith"mailto:keith.wiles at intel.com>>; 
"Van Haaren, Harry"mailto:harry.van.haaren at 
intel.com>>; "???"mailto:pnk003 at naver.com>>; "dev at 
dpdk.org<mailto:dev at dpdk.org>"mailto:dev at dpdk.org>>;
Cc:
Sent: 2015-11-04 (?) 00:59:34
Subject: RE: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) 
and bps(bit per second) in DPDK pktg

I used the following code snip-it with the i40e device, with 1 second sample 
time had very high accuracy for IPv4 UDP packets:

#define FLOWD_PERF_PACKET_OVERHEAD 24 /* CRC + Preamble + SOF + Interpacket gap 
*/
#define FLOWD_REF_NETWORK_SPEED 10e9

double Ave_Bytes_per_Packet, Data_Rate, Net_Rate;
uint64_t Bits;
uint64_t Bytes = pFlow->flow.n_bytes - pMatch_Prev->flow.n_bytes;
uint64_t Packets = pFlow->flow.n_packets - pMatch_Prev->flow.n_packets;
uint64_t Time_us = pFlow->flow.flow_time_us - pMatch_Prev->flow.flow_time_us;

if (Bytes == 0)
Ave_Bytes_per_Packet = 0.0;
else
Ave_Bytes_per_Packet = ((double)Bytes / (double)Packets) + 4.0;

Bits = (Bytes + (Packets*FLOWD_PERF_PACKET_OVERHEAD)) * 8;
if (Bits == 0)
Data_Rate = 0.0;
else
Data_Rate = (((double)Bits) / Time_us) * 1e6;

if (Data_Rate == 0.0)
Net_Rate = 0.0;
else
Net_Rate = Data_Rate / FLOWD_REF_NETWORK_SPEED;

For packet rate: double pk_rate = (((double)Packets)/ ((double)Time_us)) * 1e6;

To calculate elapsed time in DPDK app, used CPU counter (will not work if 
counter is being modified):

Initialization:
double flow_time_scale_us;
...
flow_time_scale_us = 1e6/rte_get_tsc_hz();

Elapsed time (uSec) example:

elapse_us = (rte_rdtsc() - entry->tsc_first_packet) *
flow_time_scale_us; /* calc total elapsed us */

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Wiles, Keith
Sent: Tuesday, November 3, 2015 6:33 AM
To: Van Haaren, Harry; ???; dev at dpdk.org<mailto:dev at dpdk.org>
Subject: Re: [dpdk-dev] How can I calculate/estimate pps(packet per seocond) 
and bps(bit per second) in DPDK pktg

On 11/3/15, 8:30 AM, "Van Haaren, Harry" mailto:harry.van.haaren at intel.com>> wrote:

>Hi Keith,
>
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wiles, Keith
>
>> Hmm, I jus

[dpdk-dev] SR-IOV: API to tell VF from PF

2015-11-05 Thread Polehn, Mike A
I can think of a very good reason to want to know if the device is VF or PF. 

The VF has to go through a layer 2 switch, not allowing it to just receive 
anything coming across the Ehternet.

The PF can receive all the packets, including packets with different NIC 
addresses. This allow the packets to be just data and allows the processing of 
data without needing to be adjusting each NIC L2 address before sending through 
to the Ehternet. So data can be moved through a series of NICs between systems 
without the extra processing. Not doing unnecessary  processing leaves more 
clock cycles to do high value processing.

Mike

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Bruce Richardson
Sent: Thursday, November 5, 2015 1:51 AM
To: Shaham Fridenberg
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] SR-IOV: API to tell VF from PF

On Thu, Nov 05, 2015 at 09:39:19AM +, Shaham Fridenberg wrote:
> Hey all,
> 
> Is there some API to tell VF from PF?
> 
> Only way I found so far is deducing that from driver name in the 
> rte_eth_devices struct.
> 
> Thanks,
> Shaham

Hi Shaham,

yes, checking the driver name is probably the only way to do so. However, why 
do you need or want to know this? If you want to know the capabilities of a 
device basing it on a list of known device types is probably not the best way.

Regards,
/Bruce


[dpdk-dev] SR-IOV: API to tell VF from PF

2015-11-05 Thread Polehn, Mike A
A VF should support promiscuous mode, however this is different than a PF 
supporting promiscuous mode.

What happens to network throughput, which is tied to PCEe throughput, when say 
when 4 VFs are each in promiscuous mode. It should support it, but very 
negative effect.

Not all NICs are created equal. The program should be able to quarry the device 
driver and be able to determine if it is the correct NIC type is being used. 
The device driver type should only match to the device type, which should be 
specific to VF or PF.

Mike

-Original Message-
From: Richardson, Bruce 
Sent: Thursday, November 5, 2015 7:51 AM
To: Polehn, Mike A; Shaham Fridenberg
Cc: dev at dpdk.org
Subject: RE: [dpdk-dev] SR-IOV: API to tell VF from PF



> -Original Message-
> From: Polehn, Mike A
> Sent: Thursday, November 5, 2015 3:43 PM
> To: Richardson, Bruce ; Shaham Fridenberg 
> 
> Cc: dev at dpdk.org
> Subject: RE: [dpdk-dev] SR-IOV: API to tell VF from PF
> 
> I can think of a very good reason to want to know if the device is VF 
> or PF.
> 
> The VF has to go through a layer 2 switch, not allowing it to just 
> receive anything coming across the Ehternet.
> 
> The PF can receive all the packets, including packets with different 
> NIC addresses. This allow the packets to be just data and allows the 
> processing of data without needing to be adjusting each NIC L2 address 
> before sending through to the Ehternet. So data can be moved through a 
> series of NICs between systems without the extra processing. Not doing 
> unnecessary  processing leaves more clock cycles to do high value 
> processing.
> 
> Mike
> 

Yes, the capabilities of the different types of devices are different.

However, is a better solution not to provide the ability to query a NIC if it 
supports promiscuous mode, rather than set up a specific query for a VF? What 
if (hypothetically) you get a PF that doesn't support promiscuous mode, for 
instance, or a bifurcated driver where the kernel part prevents the userspace 
part from enabling promiscuous mode? In both these cases have a direct feature 
query works better than asking about PF/VF.

Regards,

/Bruce

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> Sent: Thursday, November 5, 2015 1:51 AM
> To: Shaham Fridenberg
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] SR-IOV: API to tell VF from PF
> 
> On Thu, Nov 05, 2015 at 09:39:19AM +, Shaham Fridenberg wrote:
> > Hey all,
> >
> > Is there some API to tell VF from PF?
> >
> > Only way I found so far is deducing that from driver name in the
> rte_eth_devices struct.
> >
> > Thanks,
> > Shaham
> 
> Hi Shaham,
> 
> yes, checking the driver name is probably the only way to do so. 
> However, why do you need or want to know this? If you want to know the 
> capabilities of a device basing it on a list of known device types is 
> probably not the best way.
> 
> Regards,
> /Bruce


[dpdk-dev] [PATCH v2] ethdev: Prefetch driver variable structure

2015-11-10 Thread Polehn, Mike A
Adds ethdev driver prefetch of variable structure to CPU cache 0
while calling into tx or rx device driver operation.

RFC 2544 test of NIC task test measurement points show improvement
of lower latency and/or better packet throughput indicating clock
cycles saved.

Signed-off-by: Mike A. Polehn 
---
lib/librte_ether/rte_ethdev.h | 16 +---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 48a540d..f1c35de 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -2458,12 +2458,17 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
{
  struct rte_eth_dev *dev;
+  int16_t nb_rx;
   dev = &rte_eth_devices[port_id];
-  int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
-rx_pkts, nb_pkts);
+  { /* limit scope of rxq variable */
+ /* rxq is going to be immediately used, prefetch it */
+ void *rxq = dev->data->rx_queues[queue_id];
+ rte_prefetch0(rxq);
+ nb_rx = (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);
+  }
#ifdef RTE_ETHDEV_RXTX_CALLBACKS
  struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id];
@@ -2600,6 +2605,7 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
  struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
{
  struct rte_eth_dev *dev;
+  void *txq;
   dev = &rte_eth_devices[port_id];
@@ -2615,7 +2621,11 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
  }
#endif
-  return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, 
nb_pkts);
+  /* txq is going to be immediately used, prefetch it */
+  txq = dev->data->tx_queues[queue_id];
+  rte_prefetch0(txq);
+
+  return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts);
}
#endif
--
2.6.0



[dpdk-dev] [PATCH v2] ethdev: Prefetch driver variable structure

2015-11-11 Thread Polehn, Mike A
It is probably the usual MS operation issues, I'll resubmit.

-Original Message-
From: Stephen Hemminger [mailto:step...@networkplumber.org] 
Sent: Tuesday, November 10, 2015 9:03 AM
To: Polehn, Mike A
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] [PATCH v2] ethdev: Prefetch driver variable structure

On Tue, 10 Nov 2015 14:17:41 +0000
"Polehn, Mike A"  wrote:

> Adds ethdev driver prefetch of variable structure to CPU cache 0 while 
> calling into tx or rx device driver operation.
> 
> RFC 2544 test of NIC task test measurement points show improvement of 
> lower latency and/or better packet throughput indicating clock cycles 
> saved.
> 
> Signed-off-by: Mike A. Polehn 

Good idea, but lots of whitespace issues.
Please also check your mail client..


ERROR: patch seems to be corrupt (line wrapped?)
#80: FILE: lib/librte_ether/rte_ethdev.h:2457:
,

WARNING: please, no spaces at the start of a line
#84: FILE: lib/librte_ether/rte_ethdev.h:2460:
+  int16_t nb_rx;$

WARNING: please, no spaces at the start of a line
#89: FILE: lib/librte_ether/rte_ethdev.h:2462:
+  { /* limit scope of rxq variable */$

ERROR: code indent should use tabs where possible
#90: FILE: lib/librte_ether/rte_ethdev.h:2463:
+ /* rxq is going to be immediately used, prefetch it */$

ERROR: code indent should use tabs where possible
#91: FILE: lib/librte_ether/rte_ethdev.h:2464:
+ void *rxq =3D dev->data->rx_queues[queue_id];$

WARNING: please, no spaces at the start of a line
#91: FILE: lib/librte_ether/rte_ethdev.h:2464:
+ void *rxq =3D dev->data->rx_queues[queue_id];$

ERROR: spaces required around that '=' (ctx:WxV)
#91: FILE: lib/librte_ether/rte_ethdev.h:2464:
+ void *rxq =3D dev->data->rx_queues[queue_id];
^

ERROR: code indent should use tabs where possible
#92: FILE: lib/librte_ether/rte_ethdev.h:2465:
+ rte_prefetch0(rxq);$

WARNING: Missing a blank line after declarations
#92: FILE: lib/librte_ether/rte_ethdev.h:2465:
+ void *rxq =3D dev->data->rx_queues[queue_id];
+ rte_prefetch0(rxq);

WARNING: please, no spaces at the start of a line
#92: FILE: lib/librte_ether/rte_ethdev.h:2465:
+ rte_prefetch0(rxq);$

ERROR: code indent should use tabs where possible
#93: FILE: lib/librte_ether/rte_ethdev.h:2466:
+ nb_rx =3D (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);$

WARNING: please, no spaces at the start of a line
#93: FILE: lib/librte_ether/rte_ethdev.h:2466:
+ nb_rx =3D (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);$

WARNING: space prohibited between function name and open parenthesis '('
#93: FILE: lib/librte_ether/rte_ethdev.h:2466:
+ nb_rx =3D (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);

ERROR: spaces required around that '=' (ctx:WxV)
#93: FILE: lib/librte_ether/rte_ethdev.h:2466:
+ nb_rx =3D (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);
^

WARNING: please, no spaces at the start of a line
#94: FILE: lib/librte_ether/rte_ethdev.h:2467:
+  }$

WARNING: please, no spaces at the start of a line
#102: FILE: lib/librte_ether/rte_ethdev.h:2607:
+  void *txq;$

WARNING: please, no spaces at the start of a line
#110: FILE: lib/librte_ether/rte_ethdev.h:2624:
+  txq =3D dev->data->tx_queues[queue_id];$

ERROR: spaces required around that '=' (ctx:WxV)
#110: FILE: lib/librte_ether/rte_ethdev.h:2624:
+  txq =3D dev->data->tx_queues[queue_id];
   ^

WARNING: please, no spaces at the start of a line
#111: FILE: lib/librte_ether/rte_ethdev.h:2625:
+  rte_prefetch0(txq);$

WARNING: please, no spaces at the start of a line
#113: FILE: lib/librte_ether/rte_ethdev.h:2627:
+  return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts);$

total: 8 errors, 12 warnings, 38 lines checked


[dpdk-dev] [PATCH v2] ethdev: Prefetch driver variable structure

2015-11-11 Thread Polehn, Mike A
Adds ethdev driver prefetch of variable structure to CPU cache 0
while calling into tx or rx device driver operation.

RFC 2544 test of NIC task test measurement points show improvement
of lower latency and/or better packet throughput indicating clock
cycles saved.

Signed-off-by: Mike A. Polehn 
---
 lib/librte_ether/rte_ethdev.h | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 48a540d..f1c35de 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -2458,12 +2458,17 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
 struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
 {
struct rte_eth_dev *dev;
+   int16_t nb_rx;

dev = &rte_eth_devices[port_id];

-   int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
-   rx_pkts, nb_pkts);
+   { /* limit scope of rxq variable */
+   /* rxq is going to be immediately used, prefetch it */
+   void *rxq = dev->data->rx_queues[queue_id];
+   rte_prefetch0(rxq);

+   nb_rx = (*dev->rx_pkt_burst)(rxq, rx_pkts, nb_pkts);
+   }
 #ifdef RTE_ETHDEV_RXTX_CALLBACKS
struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id];

@@ -2600,6 +2605,7 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
struct rte_eth_dev *dev;
+   void *txq;

dev = &rte_eth_devices[port_id];

@@ -2615,7 +2621,11 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
}
 #endif

-   return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, 
nb_pkts);
+   /* txq is going to be immediately used, prefetch it */
+   txq = dev->data->tx_queues[queue_id];
+   rte_prefetch0(txq);
+
+   return (*dev->tx_pkt_burst)(txq, tx_pkts, nb_pkts);
 }
 #endif

-- 
2.6.0