[dpdk-dev] Rx-errors with testpmd (only 75% line rate)

2014-01-24 Thread François-Frédéric Ozog
> -Message d'origine-
> De?: dev [mailto:dev-bounces at dpdk.org] De la part de Michael Quicquaro
> Envoy??: vendredi 24 janvier 2014 00:23
> ??: Robert Sanford
> Cc?: dev at dpdk.org; mayhan at mayhan.org
> Objet?: Re: [dpdk-dev] Rx-errors with testpmd (only 75% line rate)
> 
> Thank you, everyone, for all of your suggestions, but unfortunately I'm
> still having the problem.
> 
> I have reduced the test down to using 2 cores (one is the master core)
both
> of which are on the socket in which the NIC's PCI slot is connected.  I am
> running in rxonly mode, so I am basically just counting the packets.  I've
> tried all different burst sizes.  Nothing seems to make any difference.
> 
> Since my original post, I have acquired an IXIA tester so I have better
> control over my testing.   I send 250,000,000 packets to the interface.  I
> am getting roughly 25,000,000 Rx-errors with every run.  I have verified
> that the number of Rx-errors is consistent in the value in the RXMPC of
the
> NIC.
> 
> Just for sanity's sake, I tried switching the cores to the other socket
and
> run the same test.  As expected I got more packet loss.  Roughly
87,000,000
> 
> I am running Red Hat 6.4 which uses kernel 2.6.32-358
> 
> This is a numa supported system, but whether or not I use --numa doesn't
> seem to make a difference.
> 

Is the BIOS configured NUMA? If not, the BIOS may program System Address
Decoding so that memory address space is interleaved between sockets on 64MB
boundaries (you may have a look at Xeon 7500 datasheet volume 2 - a public
document - ?4.4 for an "explanation" of this). 

In general you don't want memory interleaving: QPI bandwidth tops at 16GBps
on the latest processors while single node aggregated memory bandwidth can
be over 60GB/s.


> Looking at the Intel documentation it appears that I should be able to
> easily do what I am trying to do.  Actually, the documentation infers that
> I should be able to do roughly 40 Gbps with a single 2.x GHz processor
core
> with other configuration (memory, os, etc.) similar to my system.  It
> appears to me that much of the details of these benchmarks are missing.
> 
> Can someone on this list actually verify for me that what I am trying to
do
> is possible and that they have done it with success?

I have done a NAT64 proof of concept that handled 40Gbps throughput on a
single Xeon E5 2697v2.
Intel NIC chip was 82599ES (if I recall correctly, I don't have the card
handy anymore), 4 rx queues 4 tx queues per port, 32768 descriptors per
queue, Intel DCA on, Ethernet pause parameters OFF: 14.8Mpps per port, no
packet loss.
However this was with a kernel based proprietary packet framework. I expect
DPDK to achieve the same results.

> 
> Much appreciation for all the help.
> - Michael
> 
> 
> On Wed, Jan 22, 2014 at 3:38 PM, Robert Sanford
> wrote:
> 
> > Hi Michael,
> >
> > > What can I do to trace down this problem?
> >
> > May I suggest that you try to be more selective in the core masks on
> > the command line. The test app may choose some cores from "other" CPU
> sockets.
> > Only enable cores of the one socket to which the NIC is attached.
> >
> >
> > > It seems very similar to a
> > > thread on this list back in May titled "Best example for showing
> > > throughput?" where no resolution was ever mentioned in the thread.
> >
> > After re-reading *that* thread, it appears that their problem may have
> > been trying to achieve ~40 Gbits/s of bandwidth (2 ports x 10 Gb Rx +
> > 2 ports x 10 Gb Tx), plus overhead, over a typical dual-port NIC whose
> > total bus bandwidth is a maximum of 32 Gbits/s (PCI express 2.1 x8).


PCIe is "32Gbps" full duplex, meaning on each direction.
On a single dual port card you have 20Gbps inbound traffic (below 32Gbps)
and 20Gbps outbound traffic (below 32Gbps).

A 10Gbos port running at  10,000,000,000bps (10^10bps, *not* a power of
two). A 64 byte frame (incl. CRC) has preamble, interframe gap... So on the
wire there are 
7+1+64+12=84bytes=672bits. The max packet rate is thus 10^10 / 672 =
14,880,952 pps.

On the PCIexpress side there will be 60 byte (frame excluding CRC)
transferred in a single DMA transaction with additional overhead, plus
8b/10b encoding per packet:
(60 + 8 + 16) = 84 bytes (fits into a 128 byte typical max payload) or 840
'bits' (8b/10b encoding). I 
An 8 lane 5GT/s (GigaTransaction = 5*10^10 "transaction" per second; i.e. a
"bit" every 200picosecond) can be viewed as a 40GT/s link, so we can have
4*10^10/840=47,619,047pps per direction (PCIe is full duplex).

So two fully loaded ports generate 29,761,904pps on each direction which can
be absorbed on the PCIexpress Gen x8 even taking account overhead of DMA
stuff.

> >
> > --
> > Regards,
> > Robert
> >
> >



[dpdk-dev] [PATCH] timer: add lfence before TSC read

2014-01-24 Thread Didier Pallard
According to Intel Developer's Manual:

"The RDTSC instruction is not a serializing instruction. It does not 
necessarily wait
 until all previous instructions have been executed before reading the counter. 
Simi-
 larly, subsequent instructions may begin execution before the read operation is
 performed. If software requires RDTSC to be executed only after all previous 
instruc-
 tions have completed locally, it can either use RDTSCP (if the processor 
supports that
 instruction) or execute the sequence LFENCE;RDTSC."

So add a lfence instruction before rdtsc to synchronize read operations and 
ensure
that the read is done at the expected instant.

Signed-off-by: Didier Pallard 
---
 lib/librte_eal/common/include/rte_cycles.h |3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/librte_eal/common/include/rte_cycles.h 
b/lib/librte_eal/common/include/rte_cycles.h
index cc6fe71..487dba6 100644
--- a/lib/librte_eal/common/include/rte_cycles.h
+++ b/lib/librte_eal/common/include/rte_cycles.h
@@ -110,6 +110,9 @@ rte_rdtsc(void)
};
} tsc;

+   /* serialize previous load instructions in pipe */
+   asm volatile("lfence");
+
 #ifdef RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT
if (unlikely(rte_cycles_vmware_tsc_map)) {
/* ecx = 0x1 corresponds to the physical TSC for VMware */
-- 
1.7.10.4



[dpdk-dev] [PATCH] timer: add lfence before TSC read

2014-01-24 Thread François-Frédéric Ozog
Hi,

Most of the time rdtsc is used for timestamping and a few cycles incorrect
are most of the time not an issue (a precision of 0.1us for session start is
usually enough).

Sometimes you need to serialize because the time you want to measure is very
short, in the order of few nanoseconds.

If the code is running in a VM, which usually virtualize rdtsc instruction,
then it even make no sense to have more "precision".

IMHO, adding the lfence for all cases is introducing an un-necessary
performance penalty.

What about adding rte_rdtsc_sync() or rte_rdtsc_serial() with the comment
about the rdtsc instruction behavior so that developers can choose which
form they want?

Fran?ois-Fr?d?ric


> -Message d'origine-
> De?: dev [mailto:dev-bounces at dpdk.org] De la part de Didier Pallard
> Envoy??: vendredi 24 janvier 2014 12:18
> ??: dev at dpdk.org
> Objet?: [dpdk-dev] [PATCH] timer: add lfence before TSC read
> 
> According to Intel Developer's Manual:
> 
> "The RDTSC instruction is not a serializing instruction. It does not
> necessarily wait  until all previous instructions have been executed
before
> reading the counter. Simi-  larly, subsequent instructions may begin
> execution before the read operation is  performed. If software requires
> RDTSC to be executed only after all previous instruc-  tions have
completed
> locally, it can either use RDTSCP (if the processor supports that
>  instruction) or execute the sequence LFENCE;RDTSC."
> 
> So add a lfence instruction before rdtsc to synchronize read operations
and
> ensure that the read is done at the expected instant.
> 
> Signed-off-by: Didier Pallard 
> ---
>  lib/librte_eal/common/include/rte_cycles.h |3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/lib/librte_eal/common/include/rte_cycles.h
> b/lib/librte_eal/common/include/rte_cycles.h
> index cc6fe71..487dba6 100644
> --- a/lib/librte_eal/common/include/rte_cycles.h
> +++ b/lib/librte_eal/common/include/rte_cycles.h
> @@ -110,6 +110,9 @@ rte_rdtsc(void)
>   };
>   } tsc;
> 
> + /* serialize previous load instructions in pipe */
> + asm volatile("lfence");
> +
>  #ifdef RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT
>   if (unlikely(rte_cycles_vmware_tsc_map)) {
>   /* ecx = 0x1 corresponds to the physical TSC for VMware
*/
> --
> 1.7.10.4



[dpdk-dev] [PATCH] timer: add lfence before TSC read

2014-01-24 Thread Ananyev, Konstantin
Hi,
Totally agree with Fran?ois-Fr?d?ric.
Actually was going to suggest exactly the same thing.
BTW, there is rte_rmb() defined inside rte_atomic.h, that should produce an 
lfence instruction.
Probably better to use it to keep code consistent.
Konstantin 

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Fran?ois-Fr?d?ric Ozog
Sent: Friday, January 24, 2014 11:43 AM
To: 'Didier Pallard'; dev at dpdk.org
Subject: Re: [dpdk-dev] [PATCH] timer: add lfence before TSC read

Hi,

Most of the time rdtsc is used for timestamping and a few cycles incorrect are 
most of the time not an issue (a precision of 0.1us for session start is 
usually enough).

Sometimes you need to serialize because the time you want to measure is very 
short, in the order of few nanoseconds.

If the code is running in a VM, which usually virtualize rdtsc instruction, 
then it even make no sense to have more "precision".

IMHO, adding the lfence for all cases is introducing an un-necessary 
performance penalty.

What about adding rte_rdtsc_sync() or rte_rdtsc_serial() with the comment about 
the rdtsc instruction behavior so that developers can choose which form they 
want?

Fran?ois-Fr?d?ric


> -Message d'origine-
> De?: dev [mailto:dev-bounces at dpdk.org] De la part de Didier Pallard 
> Envoy??: vendredi 24 janvier 2014 12:18 ??: dev at dpdk.org Objet?: 
> [dpdk-dev] [PATCH] timer: add lfence before TSC read
> 
> According to Intel Developer's Manual:
> 
> "The RDTSC instruction is not a serializing instruction. It does not 
> necessarily wait  until all previous instructions have been executed
before
> reading the counter. Simi-  larly, subsequent instructions may begin 
> execution before the read operation is  performed. If software 
> requires RDTSC to be executed only after all previous instruc-  tions 
> have
completed
> locally, it can either use RDTSCP (if the processor supports that
>  instruction) or execute the sequence LFENCE;RDTSC."
> 
> So add a lfence instruction before rdtsc to synchronize read 
> operations
and
> ensure that the read is done at the expected instant.
> 
> Signed-off-by: Didier Pallard 
> ---
>  lib/librte_eal/common/include/rte_cycles.h |3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/lib/librte_eal/common/include/rte_cycles.h
> b/lib/librte_eal/common/include/rte_cycles.h
> index cc6fe71..487dba6 100644
> --- a/lib/librte_eal/common/include/rte_cycles.h
> +++ b/lib/librte_eal/common/include/rte_cycles.h
> @@ -110,6 +110,9 @@ rte_rdtsc(void)
>   };
>   } tsc;
> 
> + /* serialize previous load instructions in pipe */
> + asm volatile("lfence");
> +
>  #ifdef RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT
>   if (unlikely(rte_cycles_vmware_tsc_map)) {
>   /* ecx = 0x1 corresponds to the physical TSC for VMware
*/
> --
> 1.7.10.4

--
Intel Shannon Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263
Business address: Dromore House, East Park, Shannon, Co. Clare

This e-mail and any attachments may contain confidential material for the sole 
use of the intended recipient(s). Any review or distribution by others is 
strictly prohibited. If you are not the intended recipient, please contact the 
sender and delete all copies.




[dpdk-dev] [PATCH v4 0/2] introduce if_index in device info

2014-01-24 Thread liljegren.ma...@gmail.com
Changes since v1:
- Split into two patches: Generic and pcap specific.
- Changed interface name to interface index

Changes since v2:
- Interface index is now unsigned
- Value 0 used as error rather than 0
- Added missing include of net/if.h in rte_eth_pcap.c
- Declared struct args_dict in rte_eth_pcap.h

Changes since v3:
- Changed commit messages
- Removed compiler warning about unused variable pair



[dpdk-dev] [PATCH v4 1/2] ethdev: introduce if_index in device info

2014-01-24 Thread liljegren.ma...@gmail.com
From: Mats Liljegren 

This field is intended for pcap to describe the name of the interface
as known to Linux. It is an interface index, but can be translated into
an interface name using if_indextoname() function.

When using pcap, interrupt affinity becomes important, and this field
gives the application a chance to ensure that interrupt affinity is set
to the lcore handling the device.

Signed-off-by: Mats Liljegren 
---
 lib/librte_ether/rte_ethdev.c |1 +
 lib/librte_ether/rte_ethdev.h |2 ++
 2 files changed, 3 insertions(+)

diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c
index 859ec92..09cc4c7 100644
--- a/lib/librte_ether/rte_ethdev.c
+++ b/lib/librte_ether/rte_ethdev.c
@@ -1037,6 +1037,7 @@ rte_eth_dev_info_get(uint8_t port_id, struct 
rte_eth_dev_info *dev_info)
/* Default device offload capabilities to zero */
dev_info->rx_offload_capa = 0;
dev_info->tx_offload_capa = 0;
+   dev_info->if_index = 0;
FUNC_PTR_OR_RET(*dev->dev_ops->dev_infos_get);
(*dev->dev_ops->dev_infos_get)(dev, dev_info);
dev_info->pci_dev = dev->pci_dev;
diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 302d378..89e343c 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -787,6 +787,8 @@ struct rte_eth_conf {
 struct rte_eth_dev_info {
struct rte_pci_device *pci_dev; /**< Device PCI information. */
const char *driver_name; /**< Device Driver name. */
+   unsigned int if_index; /**< Index to bounded host interface, or 0 if 
none.
+   Use if_indextoname() to translate into an interface name. */
uint32_t min_rx_bufsize; /**< Minimum size of RX buffer. */
uint32_t max_rx_pktlen; /**< Maximum configurable length of RX pkt. */
uint16_t max_rx_queues; /**< Maximum number of RX queues. */
-- 
1.7.10.4



[dpdk-dev] [PATCH v4 2/2] pcap: save if_index of the bound device

2014-01-24 Thread liljegren.ma...@gmail.com
From: Mats Liljegren 

Use command line parameters to get the name of the interface.
This name is converted into if_index, which is provided as
device info.

Signed-off-by: Mats Liljegren 
---
 lib/librte_pmd_pcap/rte_eth_pcap.c |   36 
 lib/librte_pmd_pcap/rte_eth_pcap.h |9 +++--
 2 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/lib/librte_pmd_pcap/rte_eth_pcap.c 
b/lib/librte_pmd_pcap/rte_eth_pcap.c
index 87d1306..2345ecd 100644
--- a/lib/librte_pmd_pcap/rte_eth_pcap.c
+++ b/lib/librte_pmd_pcap/rte_eth_pcap.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "rte_eth_pcap.h"
 #include "rte_eth_pcap_arg_parser.h"
@@ -86,6 +87,8 @@ struct pmd_internals {
unsigned nb_rx_queues;
unsigned nb_tx_queues;

+   int if_index;
+
struct pcap_rx_queue rx_queue[RTE_PMD_RING_MAX_RX_RINGS];
struct pcap_tx_queue tx_queue[RTE_PMD_RING_MAX_TX_RINGS];
 };
@@ -300,6 +303,7 @@ eth_dev_info(struct rte_eth_dev *dev,
 {
struct pmd_internals *internals = dev->data->dev_private;
dev_info->driver_name = drivername;
+   dev_info->if_index = internals->if_index;
dev_info->max_mac_addrs = 1;
dev_info->max_rx_pktlen = (uint32_t) -1;
dev_info->max_rx_queues = (uint16_t)internals->nb_rx_queues;
@@ -543,10 +547,19 @@ rte_pmd_init_internals(const unsigned nb_rx_queues,
const unsigned nb_tx_queues,
const unsigned numa_node,
struct pmd_internals **internals,
-   struct rte_eth_dev **eth_dev)
+   struct rte_eth_dev **eth_dev,
+   struct args_dict *dict)
 {
struct rte_eth_dev_data *data = NULL;
struct rte_pci_device *pci_dev = NULL;
+   unsigned k_idx;
+   struct key_value *pair = NULL;
+
+   for (k_idx = 0; k_idx < dict->index; k_idx++) {
+   pair = &dict->pairs[k_idx];
+   if (strstr(pair->key, ETH_PCAP_IFACE_ARG) != NULL)
+   break;
+   }

RTE_LOG(INFO, PMD,
"Creating pcap-backed ethdev on numa socket %u\n", 
numa_node);
@@ -583,6 +596,11 @@ rte_pmd_init_internals(const unsigned nb_rx_queues,
(*internals)->nb_rx_queues = nb_rx_queues;
(*internals)->nb_tx_queues = nb_tx_queues;

+   if (pair == NULL)
+   (*internals)->if_index = 0;
+   else
+   (*internals)->if_index = if_nametoindex(pair->value);
+
pci_dev->numa_node = numa_node;

data->dev_private = *internals;
@@ -612,7 +630,8 @@ rte_eth_from_pcaps_n_dumpers(pcap_t * const rx_queues[],
const unsigned nb_rx_queues,
pcap_dumper_t * const tx_queues[],
const unsigned nb_tx_queues,
-   const unsigned numa_node)
+   const unsigned numa_node,
+   struct args_dict *dict)
 {
struct pmd_internals *internals = NULL;
struct rte_eth_dev *eth_dev = NULL;
@@ -625,7 +644,7 @@ rte_eth_from_pcaps_n_dumpers(pcap_t * const rx_queues[],
return -1;

if (rte_pmd_init_internals(nb_rx_queues, nb_tx_queues, numa_node,
-   &internals, ð_dev) < 0)
+   &internals, ð_dev, dict) < 0)
return -1;

for (i = 0; i < nb_rx_queues; i++) {
@@ -646,7 +665,8 @@ rte_eth_from_pcaps(pcap_t * const rx_queues[],
const unsigned nb_rx_queues,
pcap_t * const tx_queues[],
const unsigned nb_tx_queues,
-   const unsigned numa_node)
+   const unsigned numa_node,
+   struct args_dict *dict)
 {
struct pmd_internals *internals = NULL;
struct rte_eth_dev *eth_dev = NULL;
@@ -659,7 +679,7 @@ rte_eth_from_pcaps(pcap_t * const rx_queues[],
return -1;

if (rte_pmd_init_internals(nb_rx_queues, nb_tx_queues, numa_node,
-   &internals, ð_dev) < 0)
+   &internals, ð_dev, dict) < 0)
return -1;

for (i = 0; i < nb_rx_queues; i++) {
@@ -707,7 +727,7 @@ rte_pmd_pcap_init(const char *name, const char *params)
if (ret < 0)
return -1;

-   return rte_eth_from_pcaps(pcaps.pcaps, 1, pcaps.pcaps, 1, 
numa_node);
+   return rte_eth_from_pcaps(pcaps.pcaps, 1, pcaps.pcaps, 1, 
numa_node, &dict);
}

/*
@@ -748,10 +768,10 @@ rte_pmd_pcap_init(const char *name, const char *params)

if (using_dumpers)
return rte_eth_from_pcaps_n_dumpers(pcaps.pcaps, 
pcaps.num_of_rx,
-   dumpers.dumpers, dumpers.num_of_tx, numa_node);
+   dumpers.dumpers, dumpers.num_of_tx, numa_node, 
&dict);

return rte_eth_from_pcaps(pcaps.pcaps, pcaps.num_of_rx, dumpers.pcaps,
-   dumpers.num_of_tx, numa_node);
+  

[dpdk-dev] [PATCH] spinlock: fix build with clang

2014-01-24 Thread Olivier Matz
LLVM clang requires an explicitly sized "cmp" assembly instruction.

Signed-off-by: Olivier Matz 
---
 lib/librte_eal/common/include/rte_spinlock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/rte_spinlock.h 
b/lib/librte_eal/common/include/rte_spinlock.h
index f7a245a..c530993 100644
--- a/lib/librte_eal/common/include/rte_spinlock.h
+++ b/lib/librte_eal/common/include/rte_spinlock.h
@@ -98,7 +98,7 @@ rte_spinlock_lock(rte_spinlock_t *sl)
"jz 3f\n"
"2:\n"
"pause\n"
-   "cmp $0, %[locked]\n"
+   "cmpl $0, %[locked]\n"
"jnz 2b\n"
"jmp 1b\n"
"3:\n"
-- 
1.8.4.rc3



[dpdk-dev] [PATCH 0/2] pci: allow to run without hotplug

2014-01-24 Thread Olivier Matz
The default behavior of DPDK is to wait the creation of /dev/uioX
devices. This work is usually done by hotplug thanks to the
kernel notification.

On some embedded systems, there is no hotplug program. These 2
patches introduce a new EAL option "--create-uio-dev" that tells
the DPDK to create the /dev/uioX devices using mknod().

Olivier Matz (2):
  pci: split the function providing uio device and mappings
  pci: add option --create-uio-dev to run without hotplug

 lib/librte_eal/linuxapp/eal/eal.c  |   6 +
 lib/librte_eal/linuxapp/eal/eal_pci.c  | 127 +++--
 .../linuxapp/eal/include/eal_internal_cfg.h|   1 +
 3 files changed, 102 insertions(+), 32 deletions(-)

-- 
1.8.4.rc3



[dpdk-dev] [PATCH 1/2] pci: split the function providing uio device and mappings

2014-01-24 Thread Olivier Matz
Add a new function pci_get_uio_dev() that parses /sys/bus/pci/devices
to get the uio device associated with a PCI device. This patch just
moves some code that was in pci_uio_map_resource() in the new function
without any functional change.

Thanks to this change, the next commit will be easier to understand.
Moreover it improves readability: having smaller functions help to
understand what pci_uio_map_resource() does.

Signed-off-by: Olivier Matz 
---
 lib/librte_eal/linuxapp/eal/eal_pci.c | 82 +--
 1 file changed, 50 insertions(+), 32 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c 
b/lib/librte_eal/linuxapp/eal/eal_pci.c
index 37ee6f1..1039777 100644
--- a/lib/librte_eal/linuxapp/eal/eal_pci.c
+++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
@@ -460,34 +460,20 @@ pci_uio_map_secondary(struct rte_pci_device *dev)
return -1;
 }

-/* map the PCI resource of a PCI device in virtual memory */
-static int
-pci_uio_map_resource(struct rte_pci_device *dev)
+/*
+ * Return the uioX char device used for a pci device. On success, return
+ * the UIO number and fill dstbuf string with the path of the device in
+ * sysfs. On error, return a negative value. In this case dstbuf is
+ * invalid.
+ */
+static int pci_get_uio_dev(struct rte_pci_device *dev, char *dstbuf,
+  unsigned int buflen)
 {
-   int i, j;
+   struct rte_pci_addr *loc = &dev->addr;
+   unsigned int uio_num;
struct dirent *e;
DIR *dir;
char dirname[PATH_MAX];
-   char filename[PATH_MAX];
-   char dirname2[PATH_MAX];
-   char devname[PATH_MAX]; /* contains the /dev/uioX */
-   void *mapaddr;
-   unsigned uio_num;
-   unsigned long start,size;
-   uint64_t phaddr;
-   uint64_t offset;
-   uint64_t pagesz;
-   ssize_t nb_maps;
-   struct rte_pci_addr *loc = &dev->addr;
-   struct uio_resource *uio_res;
-   struct uio_map *maps;
-
-   dev->intr_handle.fd = -1;
-
-   /* secondary processes - use already recorded details */
-   if ((rte_eal_process_type() != RTE_PROC_PRIMARY) &&
-   (dev->id.vendor_id != PCI_VENDOR_ID_QUMRANET))
-   return (pci_uio_map_secondary(dev));

/* depending on kernel version, uio can be located in uio/uioX
 * or uio:uioX */
@@ -525,8 +511,7 @@ pci_uio_map_resource(struct rte_pci_device *dev)
errno = 0;
uio_num = strtoull(e->d_name + shortprefix_len, &endptr, 10);
if (errno == 0 && endptr != (e->d_name + shortprefix_len)) {
-   rte_snprintf(dirname2, sizeof(dirname2),
-"%s/uio%u", dirname, uio_num);
+   rte_snprintf(dstbuf, buflen, "%s/uio%u", dirname, 
uio_num);
break;
}

@@ -534,15 +519,48 @@ pci_uio_map_resource(struct rte_pci_device *dev)
errno = 0;
uio_num = strtoull(e->d_name + longprefix_len, &endptr, 10);
if (errno == 0 && endptr != (e->d_name + longprefix_len)) {
-   rte_snprintf(dirname2, sizeof(dirname2),
-"%s/uio:uio%u", dirname, uio_num);
+   rte_snprintf(dstbuf, buflen, "%s/uio:uio%u", dirname, 
uio_num);
break;
}
}
closedir(dir);

/* No uio resource found */
-   if (e == NULL) {
+   if (e == NULL)
+   return -1;
+
+   return 0;
+}
+
+/* map the PCI resource of a PCI device in virtual memory */
+static int
+pci_uio_map_resource(struct rte_pci_device *dev)
+{
+   int i, j;
+   char dirname[PATH_MAX];
+   char filename[PATH_MAX];
+   char devname[PATH_MAX]; /* contains the /dev/uioX */
+   void *mapaddr;
+   int uio_num;
+   unsigned long start,size;
+   uint64_t phaddr;
+   uint64_t offset;
+   uint64_t pagesz;
+   ssize_t nb_maps;
+   struct rte_pci_addr *loc = &dev->addr;
+   struct uio_resource *uio_res;
+   struct uio_map *maps;
+
+   dev->intr_handle.fd = -1;
+
+   /* secondary processes - use already recorded details */
+   if ((rte_eal_process_type() != RTE_PROC_PRIMARY) &&
+   (dev->id.vendor_id != PCI_VENDOR_ID_QUMRANET))
+   return (pci_uio_map_secondary(dev));
+
+   /* find uio resource */
+   uio_num = pci_get_uio_dev(dev, dirname, sizeof(dirname));
+   if (uio_num < 0) {
RTE_LOG(WARNING, EAL, "  "PCI_PRI_FMT" not managed by UIO 
driver, "
"skipping\n", loc->domain, loc->bus, 
loc->devid, loc->function);
return -1;
@@ -551,7 +569,7 @@ pci_uio_map_resource(struct rte_pci_device *dev)
if(dev->id.vendor_id == PCI_VENDOR_ID_QUMRANET) {
/* get portio size */
rte_snprintf(filename, sizeof(filename),
-"%s/portio/port0/siz

[dpdk-dev] [PATCH 2/2] pci: add option --create-uio-dev to run without hotplug

2014-01-24 Thread Olivier Matz
When the user specifies --create-uio-dev in dpdk eal start options, the
DPDK will create the /dev/uioX instead of waiting that a program does it
(which is usually hotplug).

This option is useful in embedded environments where there is no hotplug
to do the work.

Signed-off-by: Olivier Matz 
---
 lib/librte_eal/linuxapp/eal/eal.c  |  6 +++
 lib/librte_eal/linuxapp/eal/eal_pci.c  | 47 +-
 .../linuxapp/eal/include/eal_internal_cfg.h|  1 +
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal.c 
b/lib/librte_eal/linuxapp/eal/eal.c
index bd20331..9168b3f 100644
--- a/lib/librte_eal/linuxapp/eal/eal.c
+++ b/lib/librte_eal/linuxapp/eal/eal.c
@@ -94,6 +94,7 @@
 #define OPT_SOCKET_MEM  "socket-mem"
 #define OPT_USE_DEVICE  "use-device"
 #define OPT_SYSLOG  "syslog"
+#define OPT_CREATE_UIO_DEV "create-uio-dev"

 #define RTE_EAL_BLACKLIST_SIZE 0x100

@@ -357,6 +358,7 @@ eal_usage(const char *prgname)
   "  --"OPT_NO_PCI"   : disable pci\n"
   "  --"OPT_NO_HPET"  : disable hpet\n"
   "  --"OPT_NO_SHCONF": no shared config (mmap'd files)\n"
+  "  --"OPT_CREATE_UIO_DEV": create /dev/uioX (usually done by 
hotplug)\n"
   "\n",
   prgname);
/* Allow the application to print its usage message too if hook is set 
*/
@@ -608,6 +610,7 @@ eal_parse_args(int argc, char **argv)
{OPT_SOCKET_MEM, 1, 0, 0},
{OPT_USE_DEVICE, 1, 0, 0},
{OPT_SYSLOG, 1, NULL, 0},
+   {OPT_CREATE_UIO_DEV, 1, NULL, 0},
{0, 0, 0, 0}
};
struct shared_driver *solib;
@@ -747,6 +750,9 @@ eal_parse_args(int argc, char **argv)
return -1;
}
}
+   else if (!strcmp(lgopts[option_index].name, 
OPT_CREATE_UIO_DEV)) {
+   internal_config.create_uio_dev = 1;
+   }
break;

default:
diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c 
b/lib/librte_eal/linuxapp/eal/eal_pci.c
index 1039777..af5a8b9 100644
--- a/lib/librte_eal/linuxapp/eal/eal_pci.c
+++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
@@ -460,6 +460,47 @@ pci_uio_map_secondary(struct rte_pci_device *dev)
return -1;
 }

+static int pci_mknod_uio_dev(const char *sysfs_uio_path, unsigned uio_num)
+{
+   FILE *f;
+   char filename[PATH_MAX];
+   int ret;
+   unsigned major, minor;
+   dev_t dev;
+
+   /* get the name of the sysfs file that contains the major and minor
+* of the uio device and read its content */
+   rte_snprintf(filename, sizeof(filename), "%s/dev", sysfs_uio_path);
+
+   f = fopen(filename, "r");
+   if (f == NULL) {
+   RTE_LOG(ERR, EAL, "%s(): cannot open sysfs to get 
major:minor\n",
+   __func__);
+   return -1;
+   }
+
+   ret = fscanf(f, "%d:%d", &major, &minor);
+   if (ret != 2) {
+   RTE_LOG(ERR, EAL, "%s(): cannot parse sysfs to get 
major:minor\n",
+   __func__);
+   fclose(f);
+   return -1;
+   }
+   fclose(f);
+
+   /* create the char device "mknod /dev/uioX c major minor" */
+   rte_snprintf(filename, sizeof(filename), "/dev/uio%u", uio_num);
+   dev = makedev(major, minor);
+   ret = mknod(filename, S_IFCHR | S_IRUSR | S_IWUSR, dev);
+   if (f == NULL) {
+   RTE_LOG(ERR, EAL, "%s(): mknod() failed %s\n",
+   __func__, strerror(errno));
+   return -1;
+   }
+
+   return ret;
+}
+
 /*
  * Return the uioX char device used for a pci device. On success, return
  * the UIO number and fill dstbuf string with the path of the device in
@@ -529,7 +570,11 @@ static int pci_get_uio_dev(struct rte_pci_device *dev, 
char *dstbuf,
if (e == NULL)
return -1;

-   return 0;
+   /* create uio device if we've been asked to */
+   if (internal_config.create_uio_dev && pci_mknod_uio_dev(dstbuf, 
uio_num) < 0)
+   RTE_LOG(WARNING, EAL, "Cannot create /dev/uio%u\n", uio_num);
+
+   return uio_num;
 }

 /* map the PCI resource of a PCI device in virtual memory */
diff --git a/lib/librte_eal/linuxapp/eal/include/eal_internal_cfg.h 
b/lib/librte_eal/linuxapp/eal/include/eal_internal_cfg.h
index 45ec058..649bd2b 100644
--- a/lib/librte_eal/linuxapp/eal/include/eal_internal_cfg.h
+++ b/lib/librte_eal/linuxapp/eal/include/eal_internal_cfg.h
@@ -68,6 +68,7 @@ struct internal_config {
volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping

* instead of native TSC */
volatile unsigned no_shconf;  /**< true if there is no shared 
config */
+  

[dpdk-dev] [memnic PATCH] pmd: fix attributes

2014-01-24 Thread Olivier Matz
Add missing "const" and remove useless "rte_unused" attributes.

Signed-off-by: Olivier Matz 
---
 pmd/pmd_memnic.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pmd/pmd_memnic.c b/pmd/pmd_memnic.c
index d16eb0d..bc01746 100644
--- a/pmd/pmd_memnic.c
+++ b/pmd/pmd_memnic.c
@@ -57,7 +57,7 @@ struct memnic_adapter {
struct ether_addr mac_addr;
 };

-static inline struct memnic_adapter *get_adapter(struct rte_eth_dev *dev)
+static inline struct memnic_adapter *get_adapter(const struct rte_eth_dev *dev)
 {
return (struct memnic_adapter *)(dev->data->dev_private);
 }
@@ -67,7 +67,7 @@ struct memnic_queue {
uint8_t port_id;
 };

-static struct memnic_queue *memnic_queue_alloc(struct rte_eth_dev *dev,
+static struct memnic_queue *memnic_queue_alloc(const struct rte_eth_dev *dev,
   int tx, uint16_t id)
 {
struct memnic_adapter *adapter = get_adapter(dev);
@@ -119,7 +119,7 @@ static void memnic_dev_stop(struct rte_eth_dev *dev)
return;
 }

-static void memnic_dev_infos_get(__rte_unused struct rte_eth_dev *dev,
+static void memnic_dev_infos_get(struct rte_eth_dev *dev,
 struct rte_eth_dev_info *dev_info)
 {
dev_info->driver_name = dev->driver->pci_drv.name;
-- 
1.8.4.rc3



[dpdk-dev] [memnic PATCH] pmd: use memory barrier function instead of asm volatile

2014-01-24 Thread Olivier Matz
Use the DPDK specific function rte_mb() instead of
the GCC statement asm volatile ("" ::: "memory").

Signed-off-by: Olivier Matz 
---
 common/memnic.h  | 2 --
 pmd/pmd_memnic.c | 6 +++---
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/common/memnic.h b/common/memnic.h
index 6ff38a0..fdc9fa3 100644
--- a/common/memnic.h
+++ b/common/memnic.h
@@ -123,8 +123,6 @@ struct memnic_area {
 /* for userspace */
 #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

-#define barrier() do { asm volatile("": : :"memory"); } while (0)
-
 static inline uint32_t cmpxchg(uint32_t *dst, uint32_t old, uint32_t new)
 {
volatile uint32_t *ptr = (volatile uint32_t *)dst;
diff --git a/pmd/pmd_memnic.c b/pmd/pmd_memnic.c
index bc01746..1586222 100644
--- a/pmd/pmd_memnic.c
+++ b/pmd/pmd_memnic.c
@@ -100,7 +100,7 @@ static int memnic_dev_start(struct rte_eth_dev *dev)

/* invalidate */
adapter->nic->hdr.valid = 0;
-   barrier();
+   rte_mb();
/* reset */
adapter->nic->hdr.reset = 1;
/* no need to wait here */
@@ -242,7 +242,7 @@ static uint16_t memnic_recv_pkts(void *rx_queue,
mb->pkt.data_len = p->len;
rx_pkts[nr] = mb;

-   barrier();
+   rte_mb();
p->status = MEMNIC_PKT_ST_FREE;

if (++idx >= MEMNIC_NR_PACKET)
@@ -290,7 +290,7 @@ retry:

rte_memcpy(p->data, rte_pktmbuf_mtod(tx_pkts[nr], void *), len);

-   barrier();
+   rte_mb();
p->status = MEMNIC_PKT_ST_FILLED;

rte_pktmbuf_free(tx_pkts[nr]);
-- 
1.8.4.rc3



[dpdk-dev] [memnic PATCH] linux: fix build with kernel >= 3.3

2014-01-24 Thread Olivier Matz
Signed-off-by: Olivier Matz 
---
 linux/memnic_net.c | 28 ++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/linux/memnic_net.c b/linux/memnic_net.c
index 747ae51..b6018fb 100644
--- a/linux/memnic_net.c
+++ b/linux/memnic_net.c
@@ -2,6 +2,7 @@
  *   BSD LICENSE
  *
  *   Copyright(c) 2013-2014 NEC All rights reserved.
+ *   Copyright(c) 2014 6WIND S.A.
  *
  *   Redistribution and use in source and binary forms, with or without
  *   modification, are permitted provided that the following conditions
@@ -29,6 +30,7 @@
  */
 /* Dual BSD/GPL */

+#include 
 #include 
 #include 

@@ -259,13 +261,35 @@ static void memnic_tx_timeout(struct net_device *netdev)
 {
 }

-static void memnic_vlan_rx_add_vid(struct net_device *netdev, unsigned short 
vid)
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0)
+static int memnic_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 
vid)
+{
+   return 0;
+}
+
+static int memnic_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, 
u16 vid)
+{
+   return 0;
+}
+#elif LINUX_VERSION_CODE >= KERNEL_VERSION(3,3,0)
+static int memnic_vlan_rx_add_vid(struct net_device *netdev, uint16_t vid)
+{
+   return 0;
+}
+
+static int memnic_vlan_rx_kill_vid(struct net_device *netdev, uint16_t vid)
+{
+   return 0;
+}
+#else
+static void memnic_vlan_rx_add_vid(struct net_device *netdev, uint16_t vid)
 {
 }

-static void memnic_vlan_rx_kill_vid(struct net_device *netdev, unsigned short 
vid)
+static void memnic_vlan_rx_kill_vid(struct net_device *netdev, uint16_t vid)
 {
 }
+#endif

 static int memnic_ioctl(struct net_device *netdev, struct ifreq *req, int cmd)
 {
-- 
1.8.4.rc3



[dpdk-dev] [memnic PATCH 0/3] cosmetic improvements

2014-01-24 Thread Thomas Monjalon
These patches are some minor improvements for the new memnic PMD.

---

Thomas Monjalon (3):
  pmd: remove symlink
  pmd: remove useless includes
  common: remove double underscores

 common/memnic.h  |   10 +++---
 pmd/Makefile |2 +-
 pmd/memnic.h |1 -
 pmd/pmd_memnic.c |4 
 4 files changed, 4 insertions(+), 13 deletions(-)
 delete mode 12 pmd/memnic.h

-- 
1.7.10.4



[dpdk-dev] [memnic PATCH 1/3] pmd: remove symlink

2014-01-24 Thread Thomas Monjalon
No need to have a symbolic link to a common file
when it can be simply included.

Signed-off-by: Thomas Monjalon 
---
 pmd/Makefile |2 +-
 pmd/memnic.h |1 -
 2 files changed, 1 insertion(+), 2 deletions(-)
 delete mode 12 pmd/memnic.h

diff --git a/pmd/Makefile b/pmd/Makefile
index a96e125..7f96af1 100644
--- a/pmd/Makefile
+++ b/pmd/Makefile
@@ -59,7 +59,7 @@ ifeq '$(RTE_INCLUDE)' ''
 endif
$(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) \
-I$(RTE_INCLUDE) -include $(RTE_CONFIG) \
-   -o $@ $<
+   -I$S/../common -o $@ $<

 install : $(DESTDIR)$(libdir)/$(SOLIB)
install -D -m 644 $S/README.rst $(DESTDIR)$(docdir)/README.rst
diff --git a/pmd/memnic.h b/pmd/memnic.h
deleted file mode 12
index 5303ad4..000
--- a/pmd/memnic.h
+++ /dev/null
@@ -1 +0,0 @@
-../common/memnic.h
\ No newline at end of file
-- 
1.7.10.4



[dpdk-dev] [memnic PATCH 2/3] pmd: remove useless includes

2014-01-24 Thread Thomas Monjalon
Signed-off-by: Thomas Monjalon 
---
 common/memnic.h  |4 
 pmd/pmd_memnic.c |4 
 2 files changed, 8 deletions(-)

diff --git a/common/memnic.h b/common/memnic.h
index 6ff38a0..58dd019 100644
--- a/common/memnic.h
+++ b/common/memnic.h
@@ -31,10 +31,6 @@
 #ifndef __MEMNIC_H__
 #define __MEMNIC_H__

-#ifndef __KERNEL__
-#include 
-#endif /* __KERNEL__ */
-
 #define MEMNIC_MAGIC   0x43494e76
 #define MEMNIC_VERSION 0x0001
 #define MEMNIC_VERSION_1   0x0001
diff --git a/pmd/pmd_memnic.c b/pmd/pmd_memnic.c
index d16eb0d..619941a 100644
--- a/pmd/pmd_memnic.c
+++ b/pmd/pmd_memnic.c
@@ -30,18 +30,14 @@
  */

 #include 
-
 #include 
 #include 
 #include 
-#include 

 #include "memnic.h"

 #include 
-#include 
 #include 
-#include 
 #include 
 #include 

-- 
1.7.10.4



[dpdk-dev] [memnic PATCH 3/3] common: remove double underscores

2014-01-24 Thread Thomas Monjalon
The usage of double underscores is reserved.

Signed-off-by: Thomas Monjalon 
---
 common/memnic.h |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/common/memnic.h b/common/memnic.h
index 58dd019..e5b3c6f 100644
--- a/common/memnic.h
+++ b/common/memnic.h
@@ -28,8 +28,8 @@
  *
  */

-#ifndef __MEMNIC_H__
-#define __MEMNIC_H__
+#ifndef MEMNIC_H
+#define MEMNIC_H

 #define MEMNIC_MAGIC   0x43494e76
 #define MEMNIC_VERSION 0x0001
@@ -135,4 +135,4 @@ static inline uint32_t cmpxchg(uint32_t *dst, uint32_t old, 
uint32_t new)
 }
 #endif /* __KERNEL__ */

-#endif /* __MEMNIC_H__ */
+#endif /* MEMNIC_H */
-- 
1.7.10.4



[dpdk-dev] DPDK L2fwd benchmark with 64byte packets

2014-01-24 Thread Chris Pappas
Hi,

we are benchmarking DPDK l2fwd performance by using DPDK Pktgen (both up to
date). We have connected two server machines back-to-back, and each machine
is a dual-socket server with 6 dual-port 10G NICs (12 ports in total with
120 Gbps). Four of the NICs (8 ports in total) are connected to socket 0
and the other two (4 ports in total) are connected to socket 1. With 1500
byte packets we saturate line rate, however, with 64 byte packets we do not.

By running l2fwd (./l2fwd -c 0xff0f -n 4 -- -p 0xfff) we get following
performance reported by Pktgen:

Rx/Tx
7386/9808  7386/9807  7413/9837  7413/9827   7397/9816   7397/9822
7400/9823  7400/9823  7394/9820  7394/9807   7372/9768   7372/9788

L2fwd reports 0 dropped packets in total.
Another observation is that Pktgen does not saturate exactly the line rate
as for 1500 byte packets we observe exactly 10 Gbps Tx.

* The way the coremask (-c) works is quite clear (for our case the 4 LSB
are cores of socket 0, the next 4 LSB of socket 1, then socket 0 and socket
1 again). However, the port mask only defines which NICs are enabled and we
would like to know how do we ensure that the cores that are assigned to the
NICs are on the same socket as the corresponding NICs, or is this done
automatically?

The command we use to run l2fwd is the following:
./l2fwd -c 0xff0f -n 4 -- -p 0xfff

* The next observation is that if we run again l2fwd with a different
coremask and enable all our cores (./l2fwd -c 0x -n 4 -- -p 0xfff),
performance drops significantly, and results are the following:

Rx/Tx
7380/9807  7380/9806  7422/9850  7423/9789  2467/9585  2467/9624
1399/9809  1399/9806  7391/9816  7392/9802  7370/9789  7370/9789

We observe that ports P4-P7 have a very low throughput, and they correspond
to the cores we enabled in the coremask. This result seems weird and make
the assignment of cores to NICs seem as a logical explanation. Moreover,
l2fwd reports many dropped packets only for these 4 NICs.

We would like to know if there is an obvious mistake in our configuration,
or if there are some steps we can take to debug this. 6Wind reports a
platform limit of 160 Mpps, but we are below this with a similar platform.
Is the bottleneck the PCIe?

Thank you in advance for your time.

Best regards,
Chris Pappas


[dpdk-dev] pktgen offload checksum flag not able to make it work with pacp packets.

2014-01-24 Thread Banashankar KV
I was modifying a packet in pktgen_pcap_mbuf_ctor()
and after modifying I wanted to offload the checksum calculation to h/w
so I am setting these flags in pktgen_pcap_mbuf_ctor function.

m->pkt.vlan_macip.f.l2_len = sizeof(struct ether_hdr);
m->pkt.vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);

m->ol_flags = PKT_TX_IP_CKSUM


I even tried with setting .txq_flags = 0 in rte_eth_txconf struct in pktgen.c.

But still not able to get the h/w checksum. Am I missing anything ?



Thanks
Banashankar


[dpdk-dev] pktgen offload checksum flag not able to make it work with pacp packets.

2014-01-24 Thread Wiles, Roger Keith
I have not enabled that feature myself, but I would expect it to work as long 
as the hardware does. What does the docs say about enabling hardware offload 
support? Did you look at the following files:

ip_reassembly/ipv4_rsmbl.h: m->ol_flags |= PKT_TX_IP_CKSUM;
ipv4_frag/rte_ipv4_frag.h:  out_pkt->ol_flags |= PKT_TX_IP_CKSUM;

Thanks
++Keith

Keith Wiles, Principal Technologist for Networking member of the CTO office, 
Wind River
mobile 940.213.5533
[Powering 30 Years of Innovation]

On Jan 24, 2014, at 12:54 PM, Banashankar KV mailto:banveerad at gmail.com>> wrote:

I was modifying a packet in pktgen_pcap_mbuf_ctor()
and after modifying I wanted to offload the checksum calculation to h/w
so I am setting these flags in pktgen_pcap_mbuf_ctor function.

m->pkt.vlan_macip.f.l2_len = sizeof(struct ether_hdr);
m->pkt.vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);

m->ol_flags = PKT_TX_IP_CKSUM


I even tried with setting .txq_flags = 0 in rte_eth_txconf struct in pktgen.c.

But still not able to get the h/w checksum. Am I missing anything ?



Thanks
Banashankar