Hi Chis,

Here is the email I replied to via pktgen at gmail.com<mailto:pktgen at 
gmail.com> to Jun Han, which happens to match Venky's statements as well :-) 
Let me know if you see anything else that maybe wrong with Pktgen, but the 
numbers are correct :-)

--------------------------------------------------------
Hi Jun,

That does make more sense with 12x10G ports. By other papers do you have some 
links or able to share those papers?

>From what I can tell you now have 12x10G which means you have 6x20GBits or 
>120Gbits of bi-directional bandwidth in the system. My previous email still 
>holds true as Pktgen can send and receive traffic at 10Gbits/s for 64 bytes 
>packets, which for a full-duplex port that is 20GBits of bandwidth. Using 
>(PortCnt/2) * 20Gbit = 120Gbit is the way I calculate the performance. You can 
>check with Intel, but the performance looks correct to me.

Only getting 80Gbits of performance for 64 byte packets seems low to me, as I 
would have expected 120Gbits or the same as 1500 byte packets. It is possible 
with your system has hit some bottleneck around the number of total packets per 
second. Normally this is memory bandwidth or PCI bandwidth or transactions per 
second on the PCI bus.

Run the system with 10 ports, 8 port, 6 ports, ... and see if the 64byte packet 
rate changes as this will tell you something about the system total bandwidth. 
For 10 you should get (10/2) * 20 = 100Gbits, ...

Thank you, ++Keith
-------------------------------
Keith Wiles
pktgen.dpdk at gmail.com<mailto:pktgen.dpdk at gmail.com>
Principal Technologist for Networking
Wind River Systems

On Sep 21, 2013, at 9:05 AM, Jun Han <junhanece at gmail.com<mailto:junhanece 
at gmail.com>> wrote:

Hi Keith,

I think you misunderstood my setup. As mentioned in my previous email, I have 6 
dual-port 10Gbps NICs, meaning a total of 12 10G ports per machine. They are 
connected back to back to another machine with identical setup. Hence, we get a 
total of 120Gbps for 1500 Byte packets, and 80 Gbps for 64 Byte packets. We did 
our theoretical calculation and find that it should theoretically be possible 
as it does not hit the PCIe bandwidth or our machine, nor does it exceed QPI 
bandwidth when packets are forwarded over the NUMA node. Our machine block 
diagram is as shown below, with three NICs per riser slot. We were careful to 
pin the NIC ports appropriately to the cores of CPU sockets that are directly 
connected to their Riser Slots.

Do these numbers make sense to you? As stated in the previous email, we find 
that these numbers are much higher than other papers in this domain so I wanted 
to ask for your input or thought in this.

<image.png>

Thank you very much,

Jun


On Sat, Sep 21, 2013 at 4:58 AM, Pktgen DPDK <pktgen.dpdk at 
gmail.com<mailto:pktgen.dpdk at gmail.com>> wrote:
Hi Jun,

I do not have any numbers with that many ports as I have a very limited number 
of machines and 10G NICs. I can tell you that Pktgen if setup correctly and 
send 14.885 Mpps  (million packet per second) or wire rate for 64 byte packets. 
DPDK L2FWD code is able to forward wire rate for 64 byte packets. If each port 
is sending wire rate traffic and receiving wire rate traffic then you could 
have  10Gbits each direction or 20Gbits per port pair. You have 6 ports or 3 
port pairs doing 20Gbits x 3 = 60Gbits of traffic at 64 byte packets. Assuming 
you do not hit a limit on the PCIe bus or NIC.

On my Westmere machine with total of 4 10G ports on two NIC cards I can not get 
40Gbits of data, but I hit a PCIe bug and can only get about 32Gbits if I 
remember correctly. The newer systems do not have this bug.

Sending larger frames then 64bytes means you send fewer packets per second to 
obtain 10Gbits of data throughput. You can not get more then 10Gbits or 20Gbits 
(bi-directional traffic) per port.

If Pktgen is reporting more then 60Gbits per second for 6 ports of throughput 
then Pktgen has a bug. If Pktgen is reporting more then 10Gbits of traffic RX 
or Tx then Pktgen has a bug. I have never seen Pktgen report more then 10Gbits 
Rx or Tx.

The most thoughtput for 6 ports in this forwarding configuration would be 
60Gbits (3 x 20Gbits). If you had each port sending and receiving traffic and 
not in a forwarding configuration then you could get 20Gbits per port or 
120Gbits. Does this make sense?

Lets say on a single machine you loopback the Tx/Rx on each port so the packet 
sent is received by the same port then you would have 20Gbits of bi-directional 
traffic per port. The problem is that is not how your system is configured you 
are consuming two ports per 20Gbits of traffic.

I hope I have the above correct as it is late for me :-) If you see something 
wrong with my statements please let me know what I did wrong in my logic.

Thank you, ++Keith
-------------------------------
Keith Wiles
pktgen.dpdk at gmail.com<mailto:pktgen.dpdk at gmail.com>
Principal Technologist for Networking
Wind River Systems

On Sep 20, 2013, at 2:11 PM, Jun Han <junhanece at gmail.com<mailto:junhanece 
at gmail.com>> wrote:

Hi Ketih,

Thanks so much for all your prompt replies. Thanks to you, we are now utilizing 
your packet gen code.

We have a question about the performance numbers we are getting measured 
through your packet gen program. The current setup is the following:

We have two machines, each equipped with 6 dual-port 10 GbE NICs. Machine 0 
runs DPDK L2FWD code, and Machine 1 runs your packet gen. L2FWD is modified to 
forward the incoming packets to other statically assigned output port. With 
this setup, we are getting 120 Gbps throughput measured by your packet gen with 
packet size 1500 Bytes. For 64 Byte packets, we are getting around 80 Gbps.

Do these performance numbers make sense? We are reading related papers in this 
domain, and seems like our numbers are unusually high. Could you please give us 
your thoughts on this or share your performance numbers with your setup?

Thank you so much,

JunKeith Wiles, Principal Technologist for Networking member of the CTO office, 
Wind River
direct 972.434.4136  mobile 940.213.5533  fax 000.000.0000
[Powering 30 Years of Innovation]<http://www.windriver.com/announces/wr30/>

On Sep 22, 2013, at 1:41 PM, Venkatesan, Venky <venky.venkatesan at 
intel.com<mailto:venky.venkatesan at intel.com>> wrote:

Chris,

The numbers you are getting are correct. :)

Practically speaking, most motherboards pin out between 4 and 5 x8 slots to 
every CPU socket. At PCI-E Gen 2 speeds (5 GT/s), each slot is capable of 
carrying 20 Gb/s of traffic  (limited to ~16 Gb/s of 64B packets). I would have 
expected the 64-byte  traffic capacity to be a bit higher than 80 Gb/s, but 
either way the numbers you are achieving are well within the capability of the 
system if you are careful about pinning cores to ports, which you seem to be 
doing. QPI is not a limiter either for the amount of traffic you are generating 
currently.

Regards,
-Venky

-----Original Message-----
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Chris Pappas
Sent: Sunday, September 22, 2013 7:32 AM
To: dev at dpdk.org<mailto:dev at dpdk.org>
Subject: [dpdk-dev] Question regarding throughput number with DPDK l2fwd with 
Wind River System's pktgen

Hi,

We have a question about the performance numbers we are getting measured 
through the pktgen application provided by Wind River Systems. The current 
setup is the following:

We have two machines, each equipped with 6 dual-port 10 GbE NICs (with a total 
of 12 ports). Machine 0 runs DPDK L2FWD code, and Machine 1 runs Wind River 
System's pktgen. L2FWD is modified to forward the incoming packets to other 
statically assigned output port.

Our machines have two Intel Xeon E5-2600 CPUs connected via QPI, and has two 
riser slots each having three 10Gbps NICs. Two NICS in riser slot 1
(NIC0 and NIC1) is connected to CPU 1 via PCIe Gen3, while the remaining
NIC2 is connected to CPU2 also via PCIe Gen3. In riser slot 2, all NICs (NICs 
3,4, and 5) are connected to CPU2 via PCIe Gen3. We were careful to assign the 
NIC ports to cores of CPU sockets that have direct physical connection to 
achieve max performance.


With this setup, we are getting 120 Gbps throughput measured by pktgen with 
packet size 1500 Bytes. For 64 Byte packets, we are getting around 80 Gbps.
Do these performance numbers make sense? We are reading related papers in this 
domain, and seems like our numbers are unusually high. We did our theoretical 
calculation and find that it should theoretically be possible because it does 
not hit the PCIe bandwidth or our machine, nor does it exceed QPI bandwidth 
when packets are forwarded over the NUMA node. Can you share your thoughts / 
experience with this?

Thank you,

Chris

Reply via email to