[dpdk-dev] NUMA CPU Sockets and DPDK

2014-02-12 Thread Prashant Upadhyaya
Hi guys,

What has been your experience of using DPDK based app's in NUMA mode with 
multiple sockets where some cores are present on one socket and other cores on 
some other socket.

I am migrating my application from one intel machine with 8 cores, all in one 
socket to a 32 core machine where 16 cores are in one socket and 16 other cores 
in the second socket.
My core 0 does all initialization for mbuf's, nic ports, queues etc. and uses 
SOCKET_ID_ANY for socket related parameters.

The usecase works, but I think I am running into performance issues on the 32 
core machine.
The lscpu output on my 32 core machine shows the following -
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
I am using core 1 to lift all the data from a single queue of an 82599EB port 
and I see that the cpu utilization for this core 1 is way too high even for 
lifting traffic of 1 Gbps with packet size of 650 bytes.

In general, does one need to be careful in working with multiple sockets and so 
forth, any comments would be helpful.

Regards
-Prashant





===
Please refer to http://www.aricent.com/legal/email_disclaimer.html
for important disclosures regarding this electronic communication.
===


[dpdk-dev] NUMA CPU Sockets and DPDK

2014-02-12 Thread Etai Lev Ran
Hi Prashant,

Based on our experience, using DPDK cross CPU sockets may indeed result in
some performance degradation (~10% for our application vs. staying 
in socket. YMMV based on HW, application structure, etc.).

Regarding CPU utilization on core 1, the one picking up traffic: perhaps I
had misunderstood your comment, but I would expect it to always be close 
to 100% since it's  polling the device via the PMD and not driven by
interrupts. 

Regards,
Etai

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Prashant Upadhyaya
Sent: Wednesday, February 12, 2014 1:28 PM
To: dev at dpdk.org
Subject: [dpdk-dev] NUMA CPU Sockets and DPDK

Hi guys,

What has been your experience of using DPDK based app's in NUMA mode with
multiple sockets where some cores are present on one socket and other cores
on some other socket.

I am migrating my application from one intel machine with 8 cores, all in
one socket to a 32 core machine where 16 cores are in one socket and 16
other cores in the second socket.
My core 0 does all initialization for mbuf's, nic ports, queues etc. and
uses SOCKET_ID_ANY for socket related parameters.

The usecase works, but I think I am running into performance issues on the
32 core machine.
The lscpu output on my 32 core machine shows the following - NUMA node0
CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
I am using core 1 to lift all the data from a single queue of an 82599EB
port and I see that the cpu utilization for this core 1 is way too high even
for lifting traffic of 1 Gbps with packet size of 650 bytes.

In general, does one need to be careful in working with multiple sockets and
so forth, any comments would be helpful.

Regards
-Prashant






===
Please refer to http://www.aricent.com/legal/email_disclaimer.html
for important disclosures regarding this electronic communication.

===



[dpdk-dev] NUMA CPU Sockets and DPDK

2014-02-12 Thread Richardson, Bruce
> 
> What has been your experience of using DPDK based app's in NUMA mode
> with multiple sockets where some cores are present on one socket and
> other cores on some other socket.
> 
> I am migrating my application from one intel machine with 8 cores, all in
> one socket to a 32 core machine where 16 cores are in one socket and 16
> other cores in the second socket.
> My core 0 does all initialization for mbuf's, nic ports, queues etc. and uses
> SOCKET_ID_ANY for socket related parameters.


It is recommended that you decide ahead of time on what cores on what numa 
socket different parts of your application are going to run, and then set up 
your objects in memory appropriately. SOCKET_ID_ANY should only be used to 
allocate items that are not for use in the data-path and for which you 
therefore don't care about access time. Any objects for rings or mempools 
should be created by specifying the correct socket to allocate the memory on. 
If you are working using two sockets, in some cases you may want to duplicate 
your data structures, for example, use two memory pools - one on each socket - 
instead of one, so that all data access is local.

> 
> The usecase works, but I think I am running into performance issues on the
> 32 core machine.
> The lscpu output on my 32 core machine shows the following - NUMA
> node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
> NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
> I am using core 1 to lift all the data from a single queue of an 82599EB port
> and I see that the cpu utilization for this core 1 is way too high even for
> lifting traffic of 1 Gbps with packet size of 650 bytes.

How are you measuring the cpu utilization, because when using the Intel DPDK in 
most cases your cpu utilization will always be 100% as you are constantly 
polling? Therefore actual cpu headroom can be hard to judge at times.
Another thing to consider is the numa nodes to which your NICs are connected. 
You can check using the rte_eth_dev_socket_id() what numa socket your NIC is 
connected to - assuming a modern platform where the PCI connects straight to 
the CPUs. Whatever numa node that is connected to, you want to run the code for 
polling the NIC RX queues on that numa node, and do all packet transmission 
using cores on that NUMA node.

> 
> In general, does one need to be careful in working with multiple sockets and
> so forth, any comments would be helpful.

In general, yes, you need to be a bit more careful, but the basic rules as 
outlined above should give you a good start.


[dpdk-dev] NUMA CPU Sockets and DPDK

2014-02-12 Thread Prashant Upadhyaya
Hi Etai,

Ofcourse all DPDK threads consume 100 % (unless some waits are introduced for 
some power saving etc., all typical DPDK threads are while(1) loops)
When I said core 1 is unusually busy, I meant to say that it is not able to 
read beyond 2 Gbps or so and the packets are dropping at NIC.
(I have my own custom way of calculating the cpu utilization of core 1 based on 
how many empty polls were done and how many polls got me data which I then 
process)
On the 8 core machine with single socket, the core 1 was being able to lift 
successfully much higher data rates, hence the question.

Regards
-Prashant


-Original Message-
From: Etai Lev Ran [mailto:elev...@gmail.com]
Sent: Wednesday, February 12, 2014 5:18 PM
To: Prashant Upadhyaya
Cc: dev at dpdk.org
Subject: RE: [dpdk-dev] NUMA CPU Sockets and DPDK

Hi Prashant,

Based on our experience, using DPDK cross CPU sockets may indeed result in some 
performance degradation (~10% for our application vs. staying in socket. YMMV 
based on HW, application structure, etc.).

Regarding CPU utilization on core 1, the one picking up traffic: perhaps I had 
misunderstood your comment, but I would expect it to always be close to 100% 
since it's  polling the device via the PMD and not driven by interrupts.

Regards,
Etai

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Prashant Upadhyaya
Sent: Wednesday, February 12, 2014 1:28 PM
To: dev at dpdk.org
Subject: [dpdk-dev] NUMA CPU Sockets and DPDK

Hi guys,

What has been your experience of using DPDK based app's in NUMA mode with 
multiple sockets where some cores are present on one socket and other cores on 
some other socket.

I am migrating my application from one intel machine with 8 cores, all in one 
socket to a 32 core machine where 16 cores are in one socket and 16 other cores 
in the second socket.
My core 0 does all initialization for mbuf's, nic ports, queues etc. and uses 
SOCKET_ID_ANY for socket related parameters.

The usecase works, but I think I am running into performance issues on the
32 core machine.
The lscpu output on my 32 core machine shows the following - NUMA node0
CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
I am using core 1 to lift all the data from a single queue of an 82599EB port 
and I see that the cpu utilization for this core 1 is way too high even for 
lifting traffic of 1 Gbps with packet size of 650 bytes.

In general, does one need to be careful in working with multiple sockets and so 
forth, any comments would be helpful.

Regards
-Prashant






===
Please refer to http://www.aricent.com/legal/email_disclaimer.html
for important disclosures regarding this electronic communication.

===





===
Please refer to http://www.aricent.com/legal/email_disclaimer.html
for important disclosures regarding this electronic communication.
===


[dpdk-dev] NUMA CPU Sockets and DPDK

2014-02-12 Thread François-Frédéric Ozog
Hi Prashant,

May be you could monitor RAM, QPI and PCIe activity with
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a
-better-way-to-measure-cpu-utilization

It may ease investigating the issue.

Fran?ois-Fr?d?ric


> -Message d'origine-
> De?: dev [mailto:dev-bounces at dpdk.org] De la part de Prashant Upadhyaya
> Envoy??: mercredi 12 f?vrier 2014 13:03
> ??: Etai Lev Ran
> Cc?: dev at dpdk.org
> Objet?: Re: [dpdk-dev] NUMA CPU Sockets and DPDK
> 
> Hi Etai,
> 
> Ofcourse all DPDK threads consume 100 % (unless some waits are introduced
> for some power saving etc., all typical DPDK threads are while(1) loops)
> When I said core 1 is unusually busy, I meant to say that it is not able
to
> read beyond 2 Gbps or so and the packets are dropping at NIC.
> (I have my own custom way of calculating the cpu utilization of core 1
> based on how many empty polls were done and how many polls got me data
> which I then process) On the 8 core machine with single socket, the core 1
> was being able to lift successfully much higher data rates, hence the
> question.
> 
> Regards
> -Prashant
> 
> 
> -Original Message-
> From: Etai Lev Ran [mailto:elevran at gmail.com]
> Sent: Wednesday, February 12, 2014 5:18 PM
> To: Prashant Upadhyaya
> Cc: dev at dpdk.org
> Subject: RE: [dpdk-dev] NUMA CPU Sockets and DPDK
> 
> Hi Prashant,
> 
> Based on our experience, using DPDK cross CPU sockets may indeed result in
> some performance degradation (~10% for our application vs. staying in
> socket. YMMV based on HW, application structure, etc.).
> 
> Regarding CPU utilization on core 1, the one picking up traffic: perhaps I
> had misunderstood your comment, but I would expect it to always be close
to
> 100% since it's  polling the device via the PMD and not driven by
> interrupts.
> 
> Regards,
> Etai
> 
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Prashant Upadhyaya
> Sent: Wednesday, February 12, 2014 1:28 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] NUMA CPU Sockets and DPDK
> 
> Hi guys,
> 
> What has been your experience of using DPDK based app's in NUMA mode with
> multiple sockets where some cores are present on one socket and other
cores
> on some other socket.
> 
> I am migrating my application from one intel machine with 8 cores, all in
> one socket to a 32 core machine where 16 cores are in one socket and 16
> other cores in the second socket.
> My core 0 does all initialization for mbuf's, nic ports, queues etc. and
> uses SOCKET_ID_ANY for socket related parameters.
> 
> The usecase works, but I think I am running into performance issues on the
> 32 core machine.
> The lscpu output on my 32 core machine shows the following - NUMA node0
> CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
> NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
> I am using core 1 to lift all the data from a single queue of an 82599EB
> port and I see that the cpu utilization for this core 1 is way too high
> even for lifting traffic of 1 Gbps with packet size of 650 bytes.
> 
> In general, does one need to be careful in working with multiple sockets
> and so forth, any comments would be helpful.
> 
> Regards
> -Prashant
> 
> 
> 
> 
> 
>
===
> =
> ===
> Please refer to http://www.aricent.com/legal/email_disclaimer.html
> for important disclosures regarding this electronic communication.
>
===
> =
> ===
> 
> 
> 
> 
> 
>
===
> 
> Please refer to http://www.aricent.com/legal/email_disclaimer.html
> for important disclosures regarding this electronic communication.
>
===
> 



[dpdk-dev] Is it possible to have dpdk running with no dependency on a nic ?

2014-02-12 Thread Ymo Lists
1) I have two apps that need to communicate on the same machine . Is it
possible to have these two apps communicating via dpdk without referencing
a nic ?

2) The apps need to run on an amazon vm. How can you run dpdk on an amazon
vm with only one nic if the above is not possible ?


[dpdk-dev] condition for calling ixgbe_xmit_cleanup

2014-02-12 Thread Qing Wan
Hi,



There are following code in function ixgbe_xmit_pkts,



if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh) {

ixgbe_xmit_cleanup(txq);

}



My understanding is, nb_tx_desc means total number of descriptors in
ring and nx_tx_free represents how many descriptors are available, so
txq->nb_tx_desc - txq->nb_tx_free means how many we have used. I'm not
quite understand the meaning of this comparison.  Why is the condition
not something like "if (txq->nb_tx_free < tx_free_thresh)". 



really appreciate if someone could help me on this.



Thanks

Qing



[dpdk-dev] condition for calling ixgbe_xmit_cleanup

2014-02-12 Thread Shaw, Jeffrey B
Hi Qing,

The idea is that we do not want to clean the descriptor ring until we have used 
"enough" descriptors.
So (nb_tx_desc -nb_tx_free) tells us how many descriptors we've used.  Once 
we've used "enough" (i.e. tx_free_thresh) then we will try to clean the 
descriptor ring.
If you look at the simpler "tx_xmit_pkts()" (simple is kind of a misnomer 
here... it refers to simplicity of features, not simplicity of implementation), 
we chose to implement the "nb_tx_free < tx_free_thresh" variant.
The only real difference is that the semantics of "tx_free_thresh" change from 
"free descriptors after this many are used" to "free descriptors after this 
many are remaining".

Thanks,
Jeff

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Qing Wan
Sent: Wednesday, February 12, 2014 3:50 PM
To: dev at dpdk.org
Subject: [dpdk-dev] condition for calling ixgbe_xmit_cleanup

Hi,



There are following code in function ixgbe_xmit_pkts,



if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh) {

ixgbe_xmit_cleanup(txq);

}



My understanding is, nb_tx_desc means total number of descriptors in ring and 
nx_tx_free represents how many descriptors are available, so
txq->nb_tx_desc - txq->nb_tx_free means how many we have used. I'm not
quite understand the meaning of this comparison.  Why is the condition not 
something like "if (txq->nb_tx_free < tx_free_thresh)". 



really appreciate if someone could help me on this.



Thanks

Qing