[dpdk-dev] TX performance regression caused by the mbuf cachline split

2015-05-11 Thread Paul Emmerich
Hi,

this is a follow-up to my post from 3 weeks ago [1]. I'm starting a new 
thread here since I now got a completely new test setup for improved 
reproducibility.

Background for anyone that didn't catch my last post:
I'm investigating a performance regression in my packet generator [2] 
that occurs since I tried to upgrade from DPDK 1.7.1 to 1.8 or 2.0. DPDK 
1.7.1 is about 25% faster than 2.0 in my application.
I suspected that this is due to the new 2-cacheline mbufs, which I now 
confirmed with a bisect.

My old test setup was based on the l2fwd example and required an 
external packet generator and was kind of hard to reproduce.

I built a simple tx benchmark application that simply sends nonsensical 
packets with a sequence number as fast as possible on two ports with a 
single single core. You can download the benchmark app at [3].

Hardware setup:
CPU: E5-2620 v3 underclocked to 1.2 GHz
RAM: 4x 8 GB 1866 MHz DDR4 memory
NIC: X540-T2


Baseline test results:

DPDK   simple tx  full-featured tx
1.7.1  14.1 Mpps  10.7 Mpps
2.0.0  11.0 Mpps   9.3 Mpps

DPDK 1.7.1 is 28%/15% faster than 2.0 with simple/full-featured tx in 
this benchmark.


I then did a few runs of git bisect to identify commits that caused a 
significant drop in performance. You can find the script that I used to 
quickly test the performance of a version at [4].


Commitsimple  full-featured
7869536f3f8edace05043be6f322b835702b201c  13.910.4
mbuf: flatten struct vlan_macip

The commit log explains that there is a perf regression and that it 
cannot be avoided to be future-compatible. The log claims < 5% which is 
consistent with my test results (old code is 4% faster). I guess that is 
okay and cannot be avoided.


Commitsimple  full-featured
08b563ffb19d8baf59dd84200f25bc85031d18a7  12.810.4
mbuf: replace data pointer by an offset

This affects the simple tx path significantly.
This performance regression is probably simply be caused by the 
(temporarily) disabled vector tx code that is mentioned in the commit 
log. Not investigated further.



Commitsimple  full-featured
f867492346bd271742dd34974e9cf8ac55ddb869  10.79.1
mbuf: split mbuf across two cache lines.

This one is the real culprit.
The commit log does not mention any performance evaluations and a quick 
scan of the mailing list also doesn't reveal any evaluations of the 
impact of this change.

It looks like the main problem for tx is that the mempool pointer is in 
the second cacheline.

I think the new mbuf structure is too bloated. It forces you to pay for 
features that you don't need or don't want. I understand that it needs 
to support all possible filters and offload features. But it's kind of 
hard to justify 25% difference in performance for a framework that sets 
performance above everything (Does it? I Picked that up from the 
discussion in the "Beyond DPDK 2.0" thread).

I've counted 56 bytes in use in the first cacheline in v2.0.0.

Would it be possible to move the pool pointer and tx offload fields to 
the first cacheline?

We would just need to free up 8 bytes. One candidate would be the seqn 
field, does it really have to be in the first cache line? Another 
candidate is the size of the ol_flags field? Do we really need 64 flags? 
Sharing bits between rx and tx worked fine.


I naively tried to move the pool pointer into the first cache line in 
the v2.0.0 tag and the performance actually decreased, I'm not yet sure 
why this happens. There are probably assumptions about the cacheline 
locations and prefetching in the code that would need to be adjusted.


Another possible solution would be a more dynamic approach to mbufs: the 
mbuf struct could be made configurable to fit the requirements of the 
application. This would probably require code generation or a lot of 
ugly preprocessor hacks and add a lot of complexity to the code.
The question would be if DPDK really values performance above everything 
else.


Paul


P.S.: I'm kind of disappointed by the lack of regression tests for the 
performance. I think that such tests should be an integral part of a 
framework with the explicit goal to be fast. For example, the main page 
at dpdk.org claims a performance of "usually less than 80 cycles" for a 
rx or tx operation. This claim is no longer true :(
Touching the layout of a core data structure like the mbuf shouldn't be 
done without carefully evaluating the performance impacts.
But this discussion probably belongs in the "Beyond DPDK 2.0" thread.


P.P.S.: Benchmarking an rx-only application (e.g. traffic analysis) 
would also be interesting, but that's not really on my todo list right 
now. Mixed rx/tx like forwarding is also affected as discussed in my 
last thread [1]).

[1] http://dpdk.org/ml/archives/dev/2015-April/016921.html
[2] https://github.com/emmericp/MoonGen
[3] https://github.com/emmericp/dpdk-tx-performance
[4] https://g

[dpdk-dev] Intel fortville not working with multi-segment

2015-05-11 Thread Zhang, Helin
Hi Nissim

Are you using PF pass-through or VF pass-through?
For PF pass-through, you might have already gotten the fix.
For VF pass-through, there is a bug fix which is needed for supporting jumbo 
frame and multiple mbuf. http://www.dpdk.org/dev/patchwork/patch/4641/


Regards,
Helin

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Nissim Nisimov
> Sent: Monday, May 11, 2015 3:48 AM
> To: Nissim Nisimov; 'dev at dpdk.org'
> Subject: Re: [dpdk-dev] Intel fortville not working with multi-segment
> 
> Hi,
> 
> can someone assist regarding this issue?
> 
> Is it a known limitation in i40e/dpdk (no support for multi-segment)?
> 
> Thx
> Nissim
> 
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Nissim Nisimov
> Sent: Thursday, May 07, 2015 5:44 PM
> To: 'dev at dpdk.org'
> Subject: [dpdk-dev] Intel fortville not working with multi-segment
> 
> Hi,
> 
> 
> 
> I am trying to work with Intel Fortville (XL710) NICs in Passthrough mode
> from a VM running dpdk app.
> 
> 
> First I didn't have any TX traffic from the VM, I got dpdk patch for this 
> issue
> and it fixed it. (http://www.dpdk.org/dev/patchwork/patch/4588/)
> 
> But now I see that when trying to run multi-segment traffic not all the
> packets reaching the VM (I tested it on bare metal as well and saw the
> same issue)
> 
> Is it a known issue? any workaround for it?
> 
> Thanks,
> Nissim



[dpdk-dev] [PATCH v7 02/10] eal/linux: add rte_epoll_wait/ctl support

2015-05-11 Thread Liang, Cunming


On 5/8/2015 10:57 AM, Stephen Hemminger wrote:
> On Tue,  5 May 2015 13:39:38 +0800
> Cunming Liang  wrote:
>
>> +else if (rc < 0) {
>> +/* epoll_wait fail */
>> +RTE_LOG(ERR, EAL, "epoll_wait returns with fail %s\n",
>> +strerror(errno));
> In real application there maybe other random signals.
> Therefore the code should ignore and return for case of EWOULDBLOCK and EINTR
[LCM] Thanks, you're right, when EINTR happens, shall continue 
epoll_wait instead of return.
Per EWOULDBLOCK, seems epoll_wait won't return it, so I assume your 
mention is about epoll event read.


[dpdk-dev] Intel fortville not working with multi-segment

2015-05-11 Thread Nissim Nisimov
Hi,

I am using PF pass-through and it doesn't work even with 2000 bytes of server 
response page size.
Looks like the first segment of each session is not received.

When i am changing the server response size to 1000 bytes, all works as 
expected.

I am working with dpdk 1.8 version.

Any idea why ? Is it related to i40e multi segment support?

Thx
Nissim

On May 11, 2015 5:03 AM, "Zhang, Helin"  wrote:
>
> Hi Nissim
>
> Are you using PF pass-through or VF pass-through?
> For PF pass-through, you might have already gotten the fix.
> For VF pass-through, there is

Hi Nissim

Are you using PF pass-through or VF pass-through?
For PF pass-through, you might have already gotten the fix.
For VF pass-through, there is a bug fix which is needed for supporting jumbo 
frame and multiple mbuf. http://www.dpdk.org/dev/patchwork/patch/4641/


Regards,
Helin

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Nissim Nisimov
> Sent: Monday, May 11, 2015 3:48 AM
> To: Nissim Nisimov; 'dev at dpdk.org'
> Subject: Re: [dpdk-dev] Intel fortville not working with multi-segment
>
> Hi,
>
> can someone assist regarding this issue?
>
> Is it a known limitation in i40e/dpdk (no support for multi-segment)?
>
> Thx
> Nissim
>
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Nissim Nisimov
> Sent: Thursday, May 07, 2015 5:44 PM
> To: 'dev at dpdk.org'
> Subject: [dpdk-dev] Intel fortville not working with multi-segment
>
> Hi,
>
>
>
> I am trying to work with Intel Fortville (XL710) NICs in Passthrough mode
> from a VM running dpdk app.
>
>
> First I didn't have any TX traffic from the VM, I got dpdk patch for this 
> issue
> and it fixed it. (http://www.dpdk.org/dev/patchwork/patch/4588/)
>
> But now I see that when trying to run multi-segment traffic not all the
> packets reaching the VM (I tested it on bare metal as well and saw the
> same issue)
>
> Is it a known issue? any workaround for it?
>
> Thanks,
> Nissim



[dpdk-dev] [PATCH 0/6] extend flow director to support L2_paylod type and VF filtering in i40e driver

2015-05-11 Thread Jingjing Wu
This patch set extends flow director to support L2_paylod type and VF filtering 
in i40e driver.

Jingjing Wu (6):
  ethdev: add struct rte_eth_l2_flow to support l2_payload flow type
  i40e: extend flow diretcor to support l2_payload flow type
  ethdev: extend struct to support flow director in VFs
  i40e: extend flow diretcor to support filtering in VFs
  testpmd: extend commands
  doc: extend commands in testpmd

 app/test-pmd/cmdline.c  | 87 +++--
 doc/guides/testpmd_app_ug/testpmd_funcs.rst | 15 ++---
 lib/librte_ether/rte_eth_ctrl.h | 10 
 lib/librte_pmd_i40e/i40e_fdir.c | 39 +++--
 4 files changed, 132 insertions(+), 19 deletions(-)

-- 
1.9.3



[dpdk-dev] [PATCH 1/6] ethdev: add struct rte_eth_l2_flow to support l2_payload flow type

2015-05-11 Thread Jingjing Wu
This patch adds a new struct rte_eth_l2_flow to support l2_payload flow type

Signed-off-by: Jingjing Wu 
---
 lib/librte_ether/rte_eth_ctrl.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/lib/librte_ether/rte_eth_ctrl.h b/lib/librte_ether/rte_eth_ctrl.h
index 498fc85..0e30dd9 100644
--- a/lib/librte_ether/rte_eth_ctrl.h
+++ b/lib/librte_ether/rte_eth_ctrl.h
@@ -298,6 +298,13 @@ struct rte_eth_tunnel_filter_conf {
 #define RTE_ETH_FDIR_MAX_FLEXLEN 16 /** < Max length of flexbytes. */

 /**
+ * A structure used to define the input for L2 flow
+ */
+struct rte_eth_l2_flow {
+   uint16_t ether_type;  /**< Ether type to match */
+};
+
+/**
  * A structure used to define the input for IPV4 flow
  */
 struct rte_eth_ipv4_flow {
@@ -369,6 +376,7 @@ struct rte_eth_sctpv6_flow {
  * An union contains the inputs for all types of flow
  */
 union rte_eth_fdir_flow {
+   struct rte_eth_l2_flow l2_flow;
struct rte_eth_udpv4_flow  udp4_flow;
struct rte_eth_tcpv4_flow  tcp4_flow;
struct rte_eth_sctpv4_flow sctp4_flow;
-- 
1.9.3



[dpdk-dev] [PATCH 2/6] i40e: extend flow diretcor to support l2_payload flow type

2015-05-11 Thread Jingjing Wu
This patch extends flow diretcor to support l2_payload flow
type in i40e driver.

Signed-off-by: Jingjing Wu 
---
 lib/librte_pmd_i40e/i40e_fdir.c | 24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/lib/librte_pmd_i40e/i40e_fdir.c b/lib/librte_pmd_i40e/i40e_fdir.c
index 7b68c78..27c2102 100644
--- a/lib/librte_pmd_i40e/i40e_fdir.c
+++ b/lib/librte_pmd_i40e/i40e_fdir.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -104,7 +105,8 @@
(1 << RTE_ETH_FLOW_NONFRAG_IPV6_UDP) | \
(1 << RTE_ETH_FLOW_NONFRAG_IPV6_TCP) | \
(1 << RTE_ETH_FLOW_NONFRAG_IPV6_SCTP) | \
-   (1 << RTE_ETH_FLOW_NONFRAG_IPV6_OTHER))
+   (1 << RTE_ETH_FLOW_NONFRAG_IPV6_OTHER) | \
+   (1 << RTE_ETH_FLOW_L2_PAYLOAD))

 #define I40E_FLEX_WORD_MASK(off) (0x80 >> (off))

@@ -366,7 +368,9 @@ i40e_init_flx_pld(struct i40e_pf *pf)

/* initialize the masks */
for (pctype = I40E_FILTER_PCTYPE_NONF_IPV4_UDP;
-pctype <= I40E_FILTER_PCTYPE_FRAG_IPV6; pctype++) {
+pctype <= I40E_FILTER_PCTYPE_L2_PAYLOAD; pctype++) {
+   if (!I40E_VALID_PCTYPE((enum i40e_filter_pctype)pctype))
+   continue;
pf->fdir.flex_mask[pctype].word_mask = 0;
I40E_WRITE_REG(hw, I40E_PRTQF_FD_FLXINSET(pctype), 0);
for (i = 0; i < I40E_FDIR_BITMASK_NUM_WORD; i++) {
@@ -704,6 +708,9 @@ i40e_fdir_fill_eth_ip_head(const struct rte_eth_fdir_input 
*fdir_input,
};

switch (fdir_input->flow_type) {
+   case RTE_ETH_FLOW_L2_PAYLOAD:
+   ether->ether_type = fdir_input->flow.l2_flow.ether_type;
+   break;
case RTE_ETH_FLOW_NONFRAG_IPV4_TCP:
case RTE_ETH_FLOW_NONFRAG_IPV4_UDP:
case RTE_ETH_FLOW_NONFRAG_IPV4_SCTP:
@@ -866,6 +873,17 @@ i40e_fdir_construct_pkt(struct i40e_pf *pf,
  sizeof(struct ipv6_hdr);
set_idx = I40E_FLXPLD_L3_IDX;
break;
+   case RTE_ETH_FLOW_L2_PAYLOAD:
+   payload = raw_pkt + sizeof(struct ether_hdr);
+   /*
+* ARP packet is a special case on which the payload
+* starts after the whole ARP header
+*/
+   if (fdir_input->flow.l2_flow.ether_type ==
+   rte_cpu_to_be_16(ETHER_TYPE_ARP))
+   payload += sizeof(struct arp_hdr);
+   set_idx = I40E_FLXPLD_L2_IDX;
+   break;
default:
PMD_DRV_LOG(ERR, "unknown flow type %u.", 
fdir_input->flow_type);
return -EINVAL;
@@ -1218,7 +1236,7 @@ i40e_fdir_info_get_flex_mask(struct i40e_pf *pf,
uint16_t off_bytes, mask_tmp;

for (i = I40E_FILTER_PCTYPE_NONF_IPV4_UDP;
-i <= I40E_FILTER_PCTYPE_FRAG_IPV6;
+i <= I40E_FILTER_PCTYPE_L2_PAYLOAD;
 i++) {
mask =  &pf->fdir.flex_mask[i];
if (!I40E_VALID_PCTYPE((enum i40e_filter_pctype)i))
-- 
1.9.3



[dpdk-dev] [PATCH 3/6] ethdev: extend struct to support flow director in VFs

2015-05-11 Thread Jingjing Wu
This patch extends struct rte_eth_fdir_flow_ext to support flow
director in VFs.

Signed-off-by: Jingjing Wu 
---
 lib/librte_ether/rte_eth_ctrl.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/librte_ether/rte_eth_ctrl.h b/lib/librte_ether/rte_eth_ctrl.h
index 0e30dd9..601a4d3 100644
--- a/lib/librte_ether/rte_eth_ctrl.h
+++ b/lib/librte_ether/rte_eth_ctrl.h
@@ -394,6 +394,8 @@ struct rte_eth_fdir_flow_ext {
uint16_t vlan_tci;
uint8_t flexbytes[RTE_ETH_FDIR_MAX_FLEXLEN];
/**< It is filled by the flexible payload to match. */
+   uint8_t is_vf;   /**< 1 for VF, 0 for port dev */
+   uint16_t dst_id; /**< VF ID, available when is_vf is 1*/
 };

 /**
-- 
1.9.3



[dpdk-dev] [PATCH 4/6] i40e: extend flow diretcor to support filtering in VFs

2015-05-11 Thread Jingjing Wu
This patch extends flow diretcor to filterting in VFs.

Signed-off-by: Jingjing Wu 
---
 lib/librte_pmd_i40e/i40e_fdir.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/lib/librte_pmd_i40e/i40e_fdir.c b/lib/librte_pmd_i40e/i40e_fdir.c
index 27c2102..2f4c247 100644
--- a/lib/librte_pmd_i40e/i40e_fdir.c
+++ b/lib/librte_pmd_i40e/i40e_fdir.c
@@ -1008,6 +1008,11 @@ i40e_add_del_fdir_filter(struct rte_eth_dev *dev,
PMD_DRV_LOG(ERR, "Invalid queue ID");
return -EINVAL;
}
+   if (filter->input.flow_ext.is_vf &&
+   filter->input.flow_ext.dst_id >= pf->vf_num) {
+   PMD_DRV_LOG(ERR, "Invalid VF ID");
+   return -EINVAL;
+   }

memset(pkt, 0, I40E_FDIR_PKT_LEN);

@@ -1047,7 +1052,7 @@ i40e_fdir_filter_programming(struct i40e_pf *pf,
volatile struct i40e_tx_desc *txdp;
volatile struct i40e_filter_program_desc *fdirdp;
uint32_t td_cmd;
-   uint16_t i;
+   uint16_t vsi_id, i;
uint8_t dest;

PMD_DRV_LOG(INFO, "filling filter programming descriptor.");
@@ -1069,9 +1074,13 @@ i40e_fdir_filter_programming(struct i40e_pf *pf,
  I40E_TXD_FLTR_QW0_PCTYPE_SHIFT) &
  I40E_TXD_FLTR_QW0_PCTYPE_MASK);

-   /* Use LAN VSI Id by default */
+   if (filter->input.flow_ext.is_vf)
+   vsi_id = pf->vfs[filter->input.flow_ext.dst_id].vsi->vsi_id;
+   else
+   /* Use LAN VSI Id by default */
+   vsi_id = pf->main_vsi->vsi_id;
fdirdp->qindex_flex_ptype_vsi |=
-   rte_cpu_to_le_32((pf->main_vsi->vsi_id <<
+   rte_cpu_to_le_32((vsi_id <<
  I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT) &
  I40E_TXD_FLTR_QW0_DEST_VSI_MASK);

-- 
1.9.3



[dpdk-dev] [PATCH 6/6] doc: extend commands in testpmd

2015-05-11 Thread Jingjing Wu
Modify the doc about flow director commands to support l2_payload
flow type and filtering in VFs.

Signed-off-by: Jingjing Wu 
---
 doc/guides/testpmd_app_ug/testpmd_funcs.rst | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst 
b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 761172e..3d56097 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -1527,27 +1527,28 @@ Different NICs may have different capabilities, command 
show port fdir (port_id)

 flow_director_filter (port_id) (add|del|update) flow 
(ipv4-other|ipv4-frag|ipv6-other|ipv6-frag)
 src (src_ip_address) dst (dst_ip_address) vlan (vlan_value) flexbytes 
(flexbytes_value)
-(drop|fwd) queue (queue_id) fd_id (fd_id_value)
+(drop|fwd) pf|vf(vf_id) queue (queue_id) fd_id (fd_id_value)

 flow_director_filter (port_id) (add|del|update) flow 
(ipv4-tcp|ipv4-udp|ipv6-tcp|ipv6-udp)
 src (src_ip_address) (src_port) dst (dst_ip_address) (dst_port) vlan 
(vlan_value)
-flexbytes (flexbytes_value) (drop|fwd) queue (queue_id) fd_id (fd_id_value)
+flexbytes (flexbytes_value) (drop|fwd) pf|vf(vf_id) queue (queue_id) fd_id 
(fd_id_value)

 flow_director_filter (port_id) (add|del|update) flow (ipv4-sctp|ipv6-sctp)
 src (src_ip_address) (src_port) dst (dst_ip_address) (dst_port) tag 
(verification_tag)
-vlan (vlan_value) flexbytes (flexbytes_value) (drop|fwd) queue (queue_id) 
fd_id (fd_id_value)
+vlan (vlan_value) flexbytes (flexbytes_value) (drop|fwd) pf|vf(vf_id) queue 
(queue_id) fd_id (fd_id_value)

-For example, to add an ipv4-udp flow type filter:
+flow_director_filter (port_id) (add|del|update) flow l2_payload
+ether (ethertype) flexbytes (flexbytes_value) (drop|fwd) pf|vf(vf_id) queue 
(queue_id) fd_id (fd_id_value)

 .. code-block:: console

-testpmd> flow_director_filter 0 add flow ipv4-udp src 2.2.2.3 32 dst 
2.2.2.5 33 vlan 0x1 flexbytes (0x88,0x48) fwd queue 1 fd_id 1
+testpmd> flow_director_filter 0 add flow ipv4-udp src 2.2.2.3 32 dst 
2.2.2.5 33 vlan 0x1 flexbytes (0x88,0x48) fwd pf queue 1 fd_id 1

 For example, add an ipv4-other flow type filter:

 .. code-block:: console

-testpmd> flow_director_filter 0 add flow ipv4-other src 2.2.2.3 dst 
2.2.2.5 vlan 0x1 flexbytes (0x88,0x48) fwd queue 1 fd_id 1
+testpmd> flow_director_filter 0 add flow ipv4-other src 2.2.2.3 dst 
2.2.2.5 vlan 0x1 flexbytes (0x88,0x48) fwd pf queue 1 fd_id 1

 flush_flow_director
 ~~~
@@ -1582,7 +1583,7 @@ flow_director_flex_mask
 set masks of flow director's flexible payload based on certain flow type:

 flow_director_flex_mask (port_id) flow 
(none|ipv4-other|ipv4-frag|ipv4-tcp|ipv4-udp|ipv4-sctp|
-ipv6-other|ipv6-frag|ipv6-tcp|ipv6-udp|ipv6-sctp|all) (mask)
+ipv6-other|ipv6-frag|ipv6-tcp|ipv6-udp|ipv6-sctp|l2_payload|all) (mask)

 Example, to set flow director's flex mask for all flow type on port 0:

-- 
1.9.3



[dpdk-dev] [PATCH 5/6] testpmd: extend commands

2015-05-11 Thread Jingjing Wu
This patch extends commands to support l2_payload flow type
and filtering in VFs of flow director.

Signed-off-by: Jingjing Wu 
---
 app/test-pmd/cmdline.c | 87 ++
 1 file changed, 81 insertions(+), 6 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index f01db2a..438e948 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -632,7 +632,8 @@ static void cmd_help_long_parsed(void *parsed_result,
" flow (ipv4-other|ipv4-frag|ipv6-other|ipv6-frag)"
" src (src_ip_address) dst (dst_ip_address)"
" vlan (vlan_value) flexbytes (flexbytes_value)"
-   " (drop|fwd) queue (queue_id) fd_id (fd_id_value)\n"
+   " (drop|fwd) pf|vf(vf_id) queue (queue_id)"
+   " fd_id (fd_id_value)\n"
"Add/Del an IP type flow director filter.\n\n"

"flow_director_filter (port_id) (add|del|update)"
@@ -640,7 +641,8 @@ static void cmd_help_long_parsed(void *parsed_result,
" src (src_ip_address) (src_port)"
" dst (dst_ip_address) (dst_port)"
" vlan (vlan_value) flexbytes (flexbytes_value)"
-   " (drop|fwd) queue (queue_id) fd_id (fd_id_value)\n"
+   " (drop|fwd) pf|vf(vf_id)queue (queue_id)"
+   " fd_id (fd_id_value)\n"
"Add/Del an UDP/TCP type flow director filter.\n\n"

"flow_director_filter (port_id) (add|del|update)"
@@ -649,9 +651,15 @@ static void cmd_help_long_parsed(void *parsed_result,
" dst (dst_ip_address) (dst_port)"
" tag (verification_tag) vlan (vlan_value)"
" flexbytes (flexbytes_value) (drop|fwd)"
-   " queue (queue_id) fd_id (fd_id_value)\n"
+   " pf|vf(vf_id) queue (queue_id) fd_id (fd_id_value)\n"
"Add/Del a SCTP type flow director filter.\n\n"

+   "flow_director_filter (port_id) (add|del|update)"
+   " flow l2_payload ether (ethertype)"
+   " flexbytes (flexbytes_value) (drop|fwd)"
+   " pf|vf(vf_id) queue (queue_id) fd_id (fd_id_value)\n"
+   "Add/Del a l2 payload type flow director 
filter.\n\n"
+
"flush_flow_director (port_id)\n"
"Flush all flow director entries of a device.\n\n"

@@ -662,7 +670,7 @@ static void cmd_help_long_parsed(void *parsed_result,

"flow_director_flex_mask (port_id)"
" flow 
(none|ipv4-other|ipv4-frag|ipv4-tcp|ipv4-udp|ipv4-sctp|"
-   "ipv6-other|ipv6-frag|ipv6-tcp|ipv6-udp|ipv6-sctp|all)"
+   
"ipv6-other|ipv6-frag|ipv6-tcp|ipv6-udp|ipv6-sctp|l2_payload|all)"
" (mask)\n"
"Configure mask of flex payload.\n\n"

@@ -7653,6 +7661,8 @@ struct cmd_flow_director_result {
cmdline_fixed_string_t ops;
cmdline_fixed_string_t flow;
cmdline_fixed_string_t flow_type;
+   cmdline_fixed_string_t ether;
+   uint16_t ether_type;
cmdline_fixed_string_t src;
cmdline_ipaddr_t ip_src;
uint16_t port_src;
@@ -7665,6 +7675,7 @@ struct cmd_flow_director_result {
uint16_t vlan_value;
cmdline_fixed_string_t flexbytes;
cmdline_fixed_string_t flexbytes_value;
+   cmdline_fixed_string_t pf_vf;
cmdline_fixed_string_t drop;
cmdline_fixed_string_t queue;
uint16_t  queue_id;
@@ -7771,6 +7782,8 @@ cmd_flow_director_filter_parsed(void *parsed_result,
struct cmd_flow_director_result *res = parsed_result;
struct rte_eth_fdir_filter entry;
uint8_t flexbytes[RTE_ETH_FDIR_MAX_FLEXLEN];
+   char *end;
+   unsigned long vf_id;
int ret = 0;

ret = rte_eth_dev_filter_supported(res->port_id, RTE_ETH_FILTER_FDIR);
@@ -7837,6 +7850,10 @@ cmd_flow_director_filter_parsed(void *parsed_result,
entry.input.flow.sctp6_flow.verify_tag =
rte_cpu_to_be_32(res->verify_tag_value);
break;
+   case RTE_ETH_FLOW_L2_PAYLOAD:
+   entry.input.flow.l2_flow.ether_type =
+   rte_cpu_to_be_16(res->ether_type);
+   break;
default:
printf("invalid parameter.\n");
return;
@@ -7852,6 +7869,27 @@ cmd_flow_director_filter_parsed(void *parsed_result,
entry.action.behavior = RTE_ETH_FDIR_REJECT;
else
entry.action.behavior = RTE_ETH_FDIR_ACCEPT;
+
+   if (!strcmp(res->pf_vf, "pf"))
+   entry.input.flow_ext.is_vf = 0;
+   else if (!strnc

[dpdk-dev] [PATCH v7 09/10] igb: enable rx queue interrupts for PF

2015-05-11 Thread Liang, Cunming


On 5/6/2015 7:16 AM, Stephen Hemminger wrote:
> On Tue,  5 May 2015 13:39:45 +0800
> Cunming Liang  wrote:
>
>> The patch does below for igb PF:
>> - Setup NIC to generate MSI-X interrupts
>> - Set the IVAR register to map interrupt causes to vectors
>> - Implement interrupt enable/disable functions
>>
>> Signed-off-by: Danny Zhou 
>> Signed-off-by: Cunming Liang 
> What about E1000?
>
> This only usable if it works on all devices.
[LCM] Agree with you, will send separate patch for e1000 after the patch 
series close.


[dpdk-dev] [PATCH v6 7/8] igb: enable rx queue interrupts for PF

2015-05-11 Thread Liang, Cunming


On 3/21/2015 4:51 AM, Stephen Hemminger wrote:
> On Fri, 27 Feb 2015 12:56:15 +0800
> Cunming Liang  wrote:
>
>>   
>>   /*
>> + * It clears the interrupt causes and enables the interrupt.
>> + * It will be called once only during nic initialized.
>> + *
>> + * @param dev
>> + *  Pointer to struct rte_eth_dev.
>> + *
>> + * @return
>> + *  - On success, zero.
>> + *  - On failure, a negative value.
>> + */
>> +static int eth_igb_rxq_interrupt_setup(struct rte_eth_dev *dev)
>> +{
>> +
> This function should be void
> It always succeeds and the caller just not check the return value.
>
> If you did this in one driver, I bet other drivers have same problem.
[LCM] The previous reason probably to keep consistent with 
lsc_interrupt_setup. But I think it's reasonable to change to void.
I'm considering another thing is that does is necessary to have 
condition rxq_interrupt_setup by intr_conf.rxq.
As even without it, we can manually turn on rxq interrupt by API 
rte_eth_dev_rx_intr_enable.



[dpdk-dev] [PATCH v7 08/10] ixgbe: enable rx queue interrupts for both PF and VF

2015-05-11 Thread Liang, Cunming


On 5/6/2015 2:36 AM, Stephen Hemminger wrote:
> On Tue,  5 May 2015 13:39:44 +0800
> Cunming Liang  wrote:
>
>>   
>> +/* set max interrupt vfio request */
>> +if (pci_dev->intr_handle.vec_en) {
>> +pci_dev->intr_handle.max_intr = hw->mac.max_rx_queues +
>> +IXGBEVF_MAX_OTHER_INTR;
>> +pci_dev->intr_handle.intr_vec =
>> +rte_zmalloc("intr_vec",
>> +hw->mac.max_rx_queues * sizeof(int), 0);
>> +
> Since MSI-X vectors are limited on many hardware platforms, this whole API
> should be changed so that max_intr is based on number of rx_queues actually
> used by the application.  That means the setup needs to move from init to 
> configure.
[LCM] When MSI-X is not used, intr_vec and set max_intr are useless. It 
doesn't matter to non MSI-X mode.
As it allows the sequence "dev_stop->dev_reconfig->dev_start", the real 
used number of queue may change.
So allocation only on dev_init and release only on dev_close, just make 
it simple. During configure_msix, it do use the real useful queue number 
to set queue/vector mapping, refer xxx_configure_msix().


[dpdk-dev] [PATCH v3] kni: fix compilation issue in KNI vhost on kernel 3.19/4.0

2015-05-11 Thread Thomas Monjalon
2015-05-10 23:01, De Lara Guarch, Pablo:
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > I have the following errors with Linux 4.0.1:
> > 
> > lib/librte_eal/linuxapp/kni/igb_main.c:2321:2: error: initialization from
> > incompatible pointer type
> >   .ndo_bridge_setlink = igb_ndo_bridge_setlink,
> >   ^
> > lib/librte_eal/linuxapp/kni/igb_main.c:2321:2: error: (near initialization 
> > for
> > ?igb_netdev_ops.ndo_bridge_setlink?)
> > lib/librte_eal/linuxapp/kni/igb_main.c: In function ?igb_xmit_frame_ring?:
> > lib/librte_eal/linuxapp/kni/igb_main.c:5482:2: error: implicit declaration 
> > of
> > function ?vlan_tx_tag_present?
> >   if (vlan_tx_tag_present(skb)) {
> >   ^
> > lib/librte_eal/linuxapp/kni/igb_main.c:5484:3: error: implicit declaration 
> > of
> > function ?vlan_tx_tag_get?
> >tx_flags |= (vlan_tx_tag_get(skb) << IGB_TX_FLAGS_VLAN_SHIFT);
> >^
> 
> I sent a patch for that (kni: fix compilation issue on kernel 4.0.0), by the 
> end of last month.

Oh yes, you're right, sorry.

> Is it OK to merge it or do you want me to send a v4 of this one, including 
> that fix?

Separate patches are OK, thanks.


[dpdk-dev] [PATCH v3] kni: fix compilation issue in KNI vhost on kernel 3.19/4.0

2015-05-11 Thread Thomas Monjalon
> Due to commit c0371da6 in kernel 3.19, which removed msg_iov
> and msg_iovlen from struct msghdr, DPDK would not build.
> Also, functions memcpy_toiovecend and memcpy_fromiovecend
> were removed in commits ba7438ae and 57dd8a07, being substituted by
> copy_from_iter and copy_to_iter.
> 
> This patch makes use of struct iov_iter, which has references
> to msg_iov and msg_iovln, and makes use of copy_from_iter
> and copy_to_iter.
> 
> Changes in v2:
> - Replaced functions memcpy_toiovecend and memcpy_fromiovecend
>   with copy_from_iter and copy_to_iter
> 
> Changes in v3:
> - Fixed variable names
> - Add missing checks
> 
> Reported-by: Thomas Monjalon 
> Signed-off-by: Pablo de Lara 

Applied, thanks


[dpdk-dev] [PATCH] kni: fix compilation issue on kernel 4.0.0

2015-05-11 Thread Thomas Monjalon
> Due to API changes in function pointer ndo_bridge_setlink
> (commit ad41faa8) and the rename of functions vlan_tx_*
> (commit df8a39de) in kernel 4.0, DPDK would not build.
> 
> This patch adds the properly checks to fix the compilation.
> 
> Reported-by: Stephen Hemminger 
> Signed-off-by: Pablo de Lara 

Applied, thanks


[dpdk-dev] TX performance regression caused by the mbuf cachline split

2015-05-11 Thread Luke Gorrie
Hi Paul,

On 11 May 2015 at 02:14, Paul Emmerich  wrote:

> Another possible solution would be a more dynamic approach to mbufs:


Let me suggest a slightly more extreme idea for your consideration. This
method can easily do > 100 Mpps with one very lightly loaded core. I don't
know if it works for your application or not but I share it just in case.

Background: Load generators are specialist applications and can benefit
from specialist transmit mechanisms.

You can instruct the NIC to send up to 32K packets with one operation: load
the address of a descriptor list into the TDBA register (Transmit
Descriptor Base Address).

The descriptor list is a simple series of 64-bit values: addr0, flags0,
addr1, flags1, ... etc. It is easy to construct by hand.

The NIC can also be made to play the packets in a loop. You just have to
periodically reset the DMA cursor to make all the packets valid again. That
is a simple register poke: TDT = TDH-1.

We do this routinely when we want to generate a large amount of traffic
with few resources, typically when generating load using spare capacity of
a device under test. (I have sample code but it is not based on DPDK.)

If you want all of your packets to be unique then you have to be a bit more
clever. For example you could poll to see the DMA progress: let half the
packets be sent, then rewrite those while the other half are sent, and so
on. Kind of like the way video games tracked the progress of the display
scan beam to update parts of the frame buffer that were not being DMA'd.

This method may impose other limitations that are not acceptable for your
application of course. But if not then it can drastically reduce the number
of instructions and cache footprint required to generate load. You don't
have to touch mbufs or descriptors at all. You just update the payload and
update the DMA register every millisecond or so.

Cheers,
-Luke


[dpdk-dev] [PATCH] enic: add support for enic in nic_uio driver for FreeBSD

2015-05-11 Thread Thomas Monjalon
2015-05-07 12:57, David Marchand:
> On Thu, May 7, 2015 at 11:23 AM, Bruce Richardson wrote:
> > On Thu, May 07, 2015 at 09:19:09AM +0530, Sujith Sankar wrote:
> > > This patch adds support for enic in the nic_uio driver so that enic
> > could be used on FreeBSD.
> > >
> > > Signed-off-by: Sujith Sankar 
> >
> > Acked-by: Bruce Richardson 
> 
> Well this is not really bsd specific, as people who rely on
> rte_pci_dev_ids.h header to find devices that must be bound to igb_uio and
> consort, will also benefit from this fix.
> By the way, I am working on removing these device ids from the eal, since
> the pmds should be the only one that maintain their devices ids list.

Agree

> Will send some patches soon.
> 
> Acked-by: David Marchand 

Applied, thanks


[dpdk-dev] [PATCH v6 0/6] enicpmd: Cisco Systems Inc. VIC Ethernet PMD

2015-05-11 Thread Thomas Monjalon
Hi Sujith,

2015-02-27 08:09, Sujith Sankar:
> Hi Thomas,
> 
> No update on it from my side :-(
> It would take some more time for me to start working on it (and flow
> director api) as a few other things are keeping me busy.

[...]

Documentation was split to better welcome new NICs:
http://dpdk.org/doc/guides/nics/index.html
It would be nice to have some insights about features, design and performance
numbers for enic.

Thanks

> >>2015-01-21 05:03, Sujith Sankar:
> >>> Hi David,
> >>> 
> >>> Apologies for the delay.  I was not able to find quality time to finish
> >>>it
> >>> as a few other things have been keeping me busy.  But I shall work on
> >>>it
> >>> and provide the doc and the perf details soon.
> >>> In the mean time, it would be great if you could point me to some
> >>>resources
> >>> on running pktgen-dpdk as I was stuck on it.
> >>> 
> >>> Thanks,
> >>> -Sujith
> >>> 
> >>> From: David Marchand
> >>>mailto:david.marchand at 6wind.com>>
> >>> Date: Tuesday, 20 January 2015 4:55 pm
> >>> > Hello Sujith,
> >>> > 
> >>> > Any news on the documentation and the performance numbers you said
> >>>you
> >>> > would send ?
> >>> > 
> >>> > Thanks.
> >>> > 
> >>> > --
> >>> > David Marchand
> >>> > 
> >>> > On Thu, Nov 27, 2014 at 4:31 PM, Thomas Monjalon
> >>> > mailto:thomas.monjalon at 6wind.com>> 
> >>> > wrote:
> >>> > > 2014-11-27 04:27, Sujith Sankar:
> >>> > > > Thanks Thomas, David and Neil !
> >>> > > > 
> >>> > > > I shall work on finishing the documentation.
> >>> > > > About that, you had mentioned that you wanted it in doc/drivers/
> >>>path.
> >>> > > > Could I send a patch with documentation in the path
> >>> > > > doc/drivers/enicpmd/
> >>> > > > ?
> >>> > > 
> >>> > > Yes.
> >>> > > I'd prefer doc/drivers/enic/ but it's a detail ;)
> >>> > > The format must be sphinx rst to allow web publishing.
> >>> > > 
> >>> > > It would be great to have some design documentation of every
> >>>drivers
> >>> > > in doc/drivers.
> >>> > > 
> >>> > > Thanks
> >>> > > --
> >>> > > Thomas
> >
> 




[dpdk-dev] [RFC PATCH 0/2] Move PMDs out of lib directory

2015-05-11 Thread Thomas Monjalon
2015-05-07 16:35, Bruce Richardson:
> The "lib" directory is getting very crowded, with both general libs and 
> poll mode drivers in it. This patch set proposes to move the PMDs out of the
> lib folder and to put them in a separate "pmds" folder. This should help
> with code browse-ability as the number of libs, and pmds increases.
> 
> Comments or objections?

When someone is looking for a driver implementation and check what is done in
DPDK, it will be easier to open a directory named "drivers" rather than "pmds".
I agree that they are not really libs as they are used as plugins. So they
deserve a separate directory at the top level.
Moreover, I suspect that the dataplane managed by DPDK can be extended to
crypto and storage devices in the near future.

So, I would suggest
drivers/net
drivers/crypto
drivers/storage
This kind of split could help to clearly define the responsibilities of some
new git subtrees.

Don't you think we could also remove the librte_pmd_ prefix for these new
directories?
Ultimately, I'd like to see the subdirectories for e1000, ixgbe and i40e renamed
to base/.


[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

2015-05-11 Thread Ananyev, Konstantin
Hi Ravi,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ravi Kerur
> Sent: Friday, May 08, 2015 11:55 PM
> To: Matt Laswell
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE 
> instructions.
> 
> On Fri, May 8, 2015 at 3:29 PM, Matt Laswell  
> wrote:
> 
> >
> >
> > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur  wrote:
> >
> >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> >> implemented with AVX/SSE instructions.
> >>
> >> +static inline int
> >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> >> +{
> >> +   const uint8_t *src_1 = (const uint8_t *)_src_1;
> >> +   const uint8_t *src_2 = (const uint8_t *)_src_2;
> >> +   int ret = 0;
> >> +
> >> +   if (n & 0x80)
> >> +   return rte_cmp128(src_1, src_2);
> >> +
> >> +   if (n & 0x40)
> >> +   return rte_cmp64(src_1, src_2);
> >> +
> >> +   if (n & 0x20) {
> >> +   ret = rte_cmp32(src_1, src_2);
> >> +   n -= 0x20;
> >> +   src_1 += 0x20;
> >> +   src_2 += 0x20;
> >> +   }
> >>
> >>
> > Pardon me for butting in, but this seems incorrect for the first two cases
> > listed above, as the function as written will only compare the first 128 or
> > 64 bytes of each source and return the result.  The pattern expressed in
> > the 32 byte case appears more correct, as it compares the first 32 bytes
> > and then lets later pieces of the function handle the smaller remaining
> > bits of the sources. Also, if this function is to handle arbitrarily large
> > source data, the 128 byte case needs to be in a loop.
> >
> > What am I missing?
> >
> 
> Current max hash key length supported is 64 bytes, hence no comparison is
> done after 64 bytes. 128 bytes comparison is added to measure performance
> only and there is no use-case as of now. With the current use-cases its not
> required but if there is a need to handle large arbitrary data upto 128
> bytes it can be modified.

So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid results, right?
While on PPC will work as expected (as it calls memcpu underneath)? 
That looks really weird to me.
If you plan to use rte_memcmp only for hash comparisons, then probably
you should put it somewhere into librte_hash and name it accordingly: 
rte_hash_key_cmp() or something. 
And put a big comment around it, that it only works with particular lengths.
If you want it to be a generic function inside EAL, then it probably need to 
handle different lengths properly
on all supported architectures. 
Konstantin

> 
> >
> > --
> > Matt Laswell
> > infinite io, inc.
> > laswell at infiniteio.com
> >
> >


[dpdk-dev] [PATCH] librte_eal:Using compiler memory barrier for IA processor's rte_wmb/rte_rmb.

2015-05-11 Thread Ananyev, Konstantin
Hi Dong,

> -Original Message-
> From: Wang Dong [mailto:dong.wang.pro at hotmail.com]
> Sent: Saturday, May 09, 2015 11:24 AM
> To: Ananyev, Konstantin; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_eal:Using compiler memory barrier for 
> IA processor's rte_wmb/rte_rmb.
> 
> Hi Konstantin,
> 
> >
> > Hi Dong,
> >
> >> -Original Message-
> >> From: Wang Dong [mailto:dong.wang.pro at hotmail.com]
> >> Sent: Thursday, May 07, 2015 4:28 PM
> >> To: Ananyev, Konstantin; dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH] librte_eal:Using compiler memory barrier 
> >> for IA processor's rte_wmb/rte_rmb.
> >>
> >> Hi Konstantin,
> >>
> >>> Hi Dong,
> >>>
>  -Original Message-
>  From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of WangDong
>  Sent: Tuesday, May 05, 2015 4:38 PM
>  To: dev at dpdk.org
>  Subject: [dpdk-dev] [PATCH] librte_eal:Using compiler memory barrier for 
>  IA processor's rte_wmb/rte_rmb.
> 
>  The current implementation of rte_wmb/rte_rmb for x86 is using processor 
>  memory barrier. It's unnessary for IA processor,
> >> compiler
>  memory barrier is enough.
> >>>
> >>> I wouldn't say they are 'unnecessary'.
> >>> There are situations, even on IA, when you need _fence_ isntructions.
> >>> So, please leave rte_*mb() macros unmodified.
> >> OK, leave them unmodified, but I really can't find a situation to use
> >> sfence and lfence instructions.
> >
> > For example:
> > http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> > http://dpdk.org/ml/archives/dev/2014-May/002613.html
> >
> >>
> >>
> >>> I still think that we need to create a new set of architecture dependent 
> >>> macros, as what discussed before.
> >>> Probably by analogy with linux kernel rte_smp_*mb() is a good name for 
> >>> them.
> >>> Though if you have some better name in mind, I am open to suggestions 
> >>> here.
> >> What abount rte_dma_*mb()? I find dma_*mb() in linux-4.0.1, it looks good~~
> >
> > Hmm, but why _dma_?
> > We need same thing for multi-core communication too.
> > If rte_smp_ is not good enough, might be: rte_arch_?
> I want these two macro only used in PMD, so I think _dma_ is better. The
> memory barrier of processor-processor maybe more complex, and I'm not
> familiar with it... Someone can add rte_smp_*mb for multi-core.

Sorry, what you are talking about?
At the end, it will use same instructions, whateve we'll name it: _dma_, _smp_, 
_arch_.
Konstantin

> 
> I think _arch_ is means nothing here, because rte_*mb is already for
> architectures that dpdk supported, they are redefined in these architecture.
> 
> >
> >>
> >>>
>  But if dpdk runing on a AMD processor, maybe we should use processor 
>  memory barrier.
> >>>
> >>> As far as I remember, amd has the same memory ordering model.
> >> It's too hard to find a AMD's software developer manual.
> >
> > There for example:
> > http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf
> > ?
> Search such document on AMD offical website for a long time, this manual
> is what I want, thanks very much!!!
> 
> Dong
> 
> >
> > Konstantin
> >
> >>
> >> Dong
> >>
> >>> So, I don't think we need  #ifdef RTE_ARCH_X86_IA here.
> >>>
> >>> Konstantin
> >>>
>  I add a macro to distinguish them, if we compile DPDK for IA processor, 
>  add the macro (RTE_ARCH_X86_IA) can improve
> >> performance
>  with compiler memory barrier. Or we can add RTE_ARCH_X86_AMD for using 
>  processor memory barrier, in this case, if didn't
> add
> >> the
>  macro, the memory ordering will not be guaranteed. Which macro is better?
>  If this patch applied, the PMD's old implementation of compiler memory 
>  barrier (some volatile variable) can be fixed with
> >> rte_rmb()
>  and rte_wmb() for any architecture.
> 
>  ---
> lib/librte_eal/common/include/arch/x86/rte_atomic.h | 10 ++
> 1 file changed, 10 insertions(+)
> 
>  diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h 
>  b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
>  index e93e8ee..52b1e81 100644
>  --- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
>  +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
>  @@ -49,10 +49,20 @@ extern "C" {
> 
> #define   rte_mb() _mm_mfence()
> 
>  +#ifdef RTE_ARCH_X86_IA
>  +
>  +#define rte_wmb() rte_compiler_barrier()
>  +
>  +#define rte_rmb() rte_compiler_barrier()
>  +
>  +#else
>  +
> #define   rte_wmb() _mm_sfence()
> 
> #define   rte_rmb() _mm_lfence()
> 
>  +#endif
>  +
> /*- 16 bit atomic operations 
>  -*/
> 
> #ifndef RTE_FORCE_INTRINSICS
>  --
>  1.9.1
> >>>


[dpdk-dev] Getting started - sanity check

2015-05-11 Thread Bruce Richardson
On Sat, May 09, 2015 at 04:27:12PM +, Clark, Gilbert wrote:
> 
> Hi folks:
> 
> I'm brand new to DPDK.? Read about it off and on occasionally, but never had 
> the chance to sit down and play with things until now. ?It's been fun so far: 
> just been working on a few toy applications to get myself started.
> 
> I have run into a question, though: when calling rte_eth_tx_burst with a 
> ring-backed PMD I've set up, the mbufs I've sent never seem to be freed.? 
> This seems to make some degree of sense, but ... since I'm new, and because 
> the documentation says rte_eth_tx_burst should eventually free mbufs that are 
> sent [1], I wanted to make sure I'm on track and not just misunderstanding 
> the way something works [2].
> 
> Thanks,
> Gilbert Clark
> 
> [1] From http://dpdk.org/doc/api/rte__ethdev_8h.html?:
> 
> It is the responsibility of the rte_eth_tx_burst() function to transparently 
> free the memory buffers of packets previously sent
> 
> [2] From lib/librte_pmd_ring.c:
> 
> static uint16_t
> eth_ring_tx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs)
> {
> void **ptrs = (void *)&bufs[0];
> struct ring_queue *r = q;
> const uint16_t nb_tx = (uint16_t)rte_ring_enqueue_burst(r->rng,
> ptrs, nb_bufs);
> if (r->rng->flags & RING_F_SP_ENQ) {
> r->tx_pkts.cnt += nb_tx;
> r->err_pkts.cnt += nb_bufs - nb_tx;
> } else {
> rte_atomic64_add(&(r->tx_pkts), nb_tx);
> rte_atomic64_add(&(r->err_pkts), nb_bufs - nb_tx);
> }
> return nb_tx;
> }
> 
> This doesn't ever appear to free a transmitted mbuf ... unless there's code 
> to do that somewhere else that I'm missing?

Indeed it doesn't free the mbufs, because this is not a PMD backed by real 
hardware
so the packets are never actually transmitted anywhere, just passed to the other
end of the ring. To behave strictly like a physical PMD, we would copy the sent
packet to a new buffer on RX, and free the old one. 
However, in this case, we take a shortcut and just pass the same mbuf on RX as
was passed on TX, which saves the cycles for buffer management.

To see buffer freeing on TX occur, I suggest you look at some of the other PMDs,
perhaps the e1000/igb PMD?

/Bruce


[dpdk-dev] TX performance regression caused by the mbuf cachline split

2015-05-11 Thread Paul Emmerich
Hi Luke,

thanks for your suggestion, I actually looked at how your packet 
generator in SnabbSwitch works before and it's quite clever. But 
unfortunately that's not what I'm looking for.

I'm looking for a generic solution that works with whatever NIC is 
supported by DPDK and I don't want to write NIC-specific transmit logic.
I don't want to maintain, test, or debug drivers. That's why I chose 
DPDK in the first place.

The DPDK drivers (used to) hit a sweet spot for the performance. I can 
usually load about two 10 Gbit/s ports on a reasonably sized CPU core 
without worrying about writing my own device drivers*. This allows for 
packet generation at interesting packet rates on low-end servers (e.g. 
servers with Xeon E3 1230 v2 CPUs and dual-port NICs). Servers with more 
ports usually also have the necessary CPU power to handle it.


I also don't want to be limited to packet generation in the long run. 
For example, I have a student who is working on an IPSec offloading 
application and another student working on a proof-of-concept router.


Paul


*) yes, I still need some NIC-specific low-level code (timestamping) and 
a small patch in the DPDK drivers (flag to disable CRC offloading on a 
per-packet basis) for some features of my packet generator.


[dpdk-dev] Issues met while running openvswitch/dpdk/virtio inside the VM

2015-05-11 Thread Traynor, Kevin

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Pravin Shelar
> Sent: Friday, May 8, 2015 2:20 AM
> To: Oleg Strikov
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] Issues met while running openvswitch/dpdk/virtio
> inside the VM
> 
> On Thu, May 7, 2015 at 9:22 AM, Oleg Strikov 
> wrote:
> > Hi DPDK users and developers,
> >
> > Few weeks ago I came up with the idea to run openvswitch with dpdk backend
> > inside qemu-kvm virtual machine. I don't have enough supported NICs yet and
> > my plan was to start experimenting inside the virtualized environment,
> > achieve functional state of all the components and then switch to the real
> > hardware. Additional useful side-effect of doing things inside the vm is
> > that issues can be easily reproduced by someone else in a different
> > environment.
> >
> > I (fondly) hoped that running openvswitch/dpdk inside the vm would be
> > simpler than running the same set of components on the real hardware.
> > Unfortunately I met a bunch of issues on the way. All these issues lie on a
> > borderline between dpdk and openvswitch but I think that you might be
> > interested in my story. Please note that I still don't have
> > openvswitch/dpdk working inside the vm. I definetely have some progress
> > though.
> >
> Thanks for summarizing all the issues.
> DPDK is testing is done on real hardware and we are planing testing it
> in VM. This will certainly help in fixing issues sooner.
> 
> > Q: Does it sound okay from functional (not performance) standpoint to run
> > openvswitch/dpdk inside the vm? Do we want to be able to do this? Does
> > anyone from the dpdk development team do this?
> >
> > ## Issue 1 ##
> >
> > Openvswitch requires backend pmd driver to provide N_CORES tx queues where
> > N_CORES is the amount of cores available on the machine (openvswitch counts
> > the amount of cpu* entries inside /sys/devices/system/node/node0/ folder).
> > To my understanding it doesn't take into account the actual amount of cores
> > used by dpdk and just allocates tx queue for each available core. You may
> > refer to this chunk of code for details:
> > https://github.com/openvswitch/ovs/blob/master/lib/dpif-netdev.c#L1067
> >
> In case of OVS DPDK, there is no dpdk thread. Therefore all polling
> cores are managed by OVS and there is no need to account cores for
> DPDK. You can assign specific cores for OVS to limit number of cores
> used by OVS.
> 
> > This approach works fine on the real hardware but makes some issues when we
> > run openvswitch/dpdk inside the virtual machine. I tried both emulated
> > e1000 NIC and virtio NIC and neither of them worked just from the box.
> > Emulated e1000 NIC doesn't support multiple tx queues at all (see
> > http://dpdk.org/browse/dpdk/tree/lib/librte_pmd_e1000/em_ethdev.c#n884) and
> > virtio NIC doesn't support multiple tx queues by default. To enable
> > multiple tx queue for virtio NIC I had to add the following line to the
> > interface section of my libvirt config: ''
> >
> Good point. We should document this. Can you send patch to update
> README.DPDK?

Daniele's patch http://openvswitch.org/pipermail/dev/2015-March/052344.html
also allows for having a limited set of queues available. The documentation
patch is a good idea too.

> 
> > ## Issue 2 ##
> >
> > Openvswitch calls rte_eth_tx_queue_setup() twice for the same
> > port_id/queue_id. First call takes place during device initialization (see
> > call to dpdk_eth_dev_init() inside netdev_dpdk_init():
> > https://github.com/openvswitch/ovs/blob/master/lib/netdev-dpdk.c#L522).
> > Second call takes place when openvswitch tries to add more tx queues to the
> > device (see call to dpdk_eth_dev_init() inside netdev_dpdk_set_multiq():
> > https://github.com/openvswitch/ovs/blob/master/lib/netdev-dpdk.c#L697).
> > Second call not only initialized new queues but tries to re-initialize
> > existing ones.
> >
> > Unfortunately virtio driver can't handle second call of
> > rte_eth_tx_queue_setup() and returns error here:
> > http://dpdk.org/browse/dpdk/tree/lib/librte_pmd_virtio/virtio_ethdev.c#n316
> > This happens because memzone with the name portN_tvqN already exists when
> > second call takes place (memzone has been created during the first call).
> > To deal with this issue I had to manually add rte_memzone_lookup-based
> > check for this situation and avoid allocation of a new memzone if it
> > already exists.
> >
> This sounds like issue with virtIO driver. I think we need to fix DPDK
> upstream for this to work correctly.
> 
> > Q: Is it okay that openvswitch calls rte_eth_tx_queue_setup() twice? Right
> > now I can't understand if it's the issue with the virtio pmd driver or
> > incorrect API usage by openvswitch? Could someone shed some light on this
> > so I can move forward and maybe propose a fix.
> >
> > ## Issue 3 ##
> >
> > This issue is also (somehow) related to the fact that openvswitch calls
> > rte_eth_tx_queue_setup() twice. I

[dpdk-dev] [PATCH 2/6] rte_sched: expand scheduler hierarchy for more VLAN's

2015-05-11 Thread Thomas Monjalon
2015-04-29 10:04, Stephen Hemminger:
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -285,7 +285,10 @@ struct rte_mbuf {
>   /**< First 4 flexible bytes or FD ID, dependent on
>PKT_RX_FDIR_* flag in ol_flags. */
>   } fdir;   /**< Filter identifier if FDIR enabled */
> - uint32_t sched;   /**< Hierarchical scheduler */
> + struct {
> + uint32_t lo;
> + uint32_t hi;
> + } sched;  /**< Hierarchical scheduler */

Please don't use tabs to align a comment.


[dpdk-dev] [PATCH 4/6] rte_sched: allow reading without clearing

2015-05-11 Thread Thomas Monjalon
2015-04-29 10:04, Stephen Hemminger:
> The rte_sched statistics API should allow reading statistics without
> clearing. Make auto-clear optional.
> 
> Signed-off-by: Stephen Hemminger 
> ---
>  app/test/test_sched.c|  4 ++--
>  examples/qos_sched/stats.c   | 16 +++-
>  lib/librte_sched/rte_sched.c | 44 
> ++--
>  lib/librte_sched/rte_sched.h | 18 ++
[...]

This API change needs more adjustments in the example app:

examples/qos_sched/stats.c: In function ?subport_stat?:
examples/qos_sched/stats.c:263:9: error: too few arguments to function 
?rte_sched_subport_read_stats?
 rte_sched_subport_read_stats(port, subport_id, &stats, tc_ov);
 ^
examples/qos_sched/stats.c: In function ?pipe_stat?:
examples/qos_sched/stats.c:309:25: error: too few arguments to function 
?rte_sched_queue_read_stats?
 rte_sched_queue_read_stats(port, queue_id + (i * 
RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS + j), &stats, &qlen);
 ^

[...]
> --- a/lib/librte_sched/rte_sched.h
> +++ b/lib/librte_sched/rte_sched.h
> @@ -308,14 +308,15 @@ rte_sched_port_get_memory_footprint(struct 
> rte_sched_port_params *params);
>   * @param tc_ov
>   *   Pointer to pre-allocated 4-entry array where the oversubscription 
> status for
>   *   each of the 4 subport traffic classes should be stored.
> + * @parm clear
> + *   Reset statistics after read
>   * @return
>   *   0 upon success, error code otherwise
>   */
>  int
> -rte_sched_subport_read_stats(struct rte_sched_port *port,
> - uint32_t subport_id,
> - struct rte_sched_subport_stats *stats,
> - uint32_t *tc_ov);
> +rte_sched_subport_read_stats(struct rte_sched_port *port, uint32_t 
> subport_id,
> +  struct rte_sched_subport_stats *stats,
> +  uint32_t *tc_ov, int clear);
>  
>  /**
>   * Hierarchical scheduler queue statistics read
> @@ -329,14 +330,15 @@ rte_sched_subport_read_stats(struct rte_sched_port 
> *port,
>   *   counters should be stored
>   * @param qlen
>   *   Pointer to pre-allocated variable where the current queue length should 
> be stored.
> + * @parm clear
> + *   Reset statistics after read
>   * @return
>   *   0 upon success, error code otherwise
>   */
>  int
> -rte_sched_queue_read_stats(struct rte_sched_port *port,
> - uint32_t queue_id,
> - struct rte_sched_queue_stats *stats,
> - uint16_t *qlen);
> +rte_sched_queue_read_stats(struct rte_sched_port *port, uint32_t queue_id,
> +struct rte_sched_queue_stats *stats,
> +uint16_t *qlen, int clear);
>  
>  /*
>   * Run-time
> 

What about ABI versioning? compatibility?



[dpdk-dev] [PATCH 1/4] pci: allow access to PCI config space

2015-05-11 Thread Neil Horman
On Thu, May 07, 2015 at 04:25:32PM -0700, Stephen Hemminger wrote:
> From: Stephen Hemminger 
> 
> Some drivers need ability to access PCI config (for example for power
> management). This adds an abstraction to do this; only implemented
> on Linux, but should be possible on BSD.
> 
> Signed-off-by: Stephen Hemminger 
> ---
>  lib/librte_eal/common/include/rte_pci.h | 28 +++
>  lib/librte_eal/linuxapp/eal/eal_pci.c   | 48 
> +
>  lib/librte_eal/linuxapp/eal/eal_pci_init.h  | 11 ++
>  lib/librte_eal/linuxapp/eal/eal_pci_uio.c   | 14 
>  lib/librte_eal/linuxapp/eal/eal_pci_vfio.c  | 16 +
>  lib/librte_eal/linuxapp/eal/rte_eal_version.map |  2 ++
>  6 files changed, 119 insertions(+)
> 
> diff --git a/lib/librte_eal/common/include/rte_pci.h 
> b/lib/librte_eal/common/include/rte_pci.h
> index 223d3cd..cea982a 100644
> --- a/lib/librte_eal/common/include/rte_pci.h
> +++ b/lib/librte_eal/common/include/rte_pci.h
> @@ -393,6 +393,34 @@ void rte_eal_pci_register(struct rte_pci_driver *driver);
>   */
>  void rte_eal_pci_unregister(struct rte_pci_driver *driver);
>  
> +/**
> + * Read PCI config space.
> + *
> + * @param device
> + *   A pointer to a rte_pci_device structure describing the device
> + *   to use
> + * @param buf
> + *   A data buffer where the bytes should be read into
> + * @param size
> + *   The length of the data buffer.
> + */
> +int rte_eal_pci_read_config(const struct rte_pci_device *device,
> + void *buf, size_t len, off_t offset);
> +
> +/**
> + * Write PCI config space.
> + *
> + * @param device
> + *   A pointer to a rte_pci_device structure describing the device
> + *   to use
> + * @param buf
> + *   A data buffer containing the bytes should be written
> + * @param size
> + *   The length of the data buffer.
> + */
> +int rte_eal_pci_write_config(const struct rte_pci_device *device,
> +  const void *buf, size_t len, off_t offset);
> +
I still think this needs a BSD implementation before we pull the whole thing in.
Only partially implementing infrastructure like this is bad practice, and will
lead to complicated build procedures (i.e. developers will require institutional
knoweldge to know that bnx2x can't build on BSD because someone still needs to
implement these functions on bsd.  Even having them just return -ENOTSUPP would
be preferable, so that you got a proper run time error, rather than a build
break, though I don't think thats even necessecary, as pci passthrough is
possible in project like bhyve (implying user space pci access)

Neil

>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c 
> b/lib/librte_eal/linuxapp/eal/eal_pci.c
> index d2adc66..6d79a08 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_pci.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
> @@ -756,6 +756,54 @@ rte_eal_pci_close_one_driver(struct rte_pci_driver *dr 
> __rte_unused,
>  }
>  #endif /* RTE_LIBRTE_EAL_HOTPLUG */
>  
> +/* Read PCI config space. */
> +int rte_eal_pci_read_config(const struct rte_pci_device *device,
> + void *buf, size_t len, off_t offset)
> +{
> + const struct rte_intr_handle *intr_handle = &device->intr_handle;
> +
> + switch (intr_handle->type) {
> + case RTE_INTR_HANDLE_UIO:
> + return pci_uio_read_config(intr_handle, buf, len, offset);
> +
> +#ifdef VFIO_PRESENT
> + case RTE_INTR_HANDLE_VFIO_MSIX:
> + case RTE_INTR_HANDLE_VFIO_MSI:
> + case RTE_INTR_HANDLE_VFIO_LEGACY:
> + return pci_vfio_read_config(intr_handle, buf, len, offset);
> +#endif
> + default:
> + RTE_LOG(ERR, EAL,
> + "Unknown handle type of fd %d\n",
> + intr_handle->fd);
> + return -1;
> + }
> +}
> +
> +/* Write PCI config space. */
> +int rte_eal_pci_write_config(const struct rte_pci_device *device,
> +  const void *buf, size_t len, off_t offset)
> +{
> + const struct rte_intr_handle *intr_handle = &device->intr_handle;
> +
> + switch (intr_handle->type) {
> + case RTE_INTR_HANDLE_UIO:
> + return pci_uio_write_config(intr_handle, buf, len, offset);
> +
> +#ifdef VFIO_PRESENT
> + case RTE_INTR_HANDLE_VFIO_MSIX:
> + case RTE_INTR_HANDLE_VFIO_MSI:
> + case RTE_INTR_HANDLE_VFIO_LEGACY:
> + return pci_vfio_write_config(intr_handle, buf, len, offset);
> +#endif
> + default:
> + RTE_LOG(ERR, EAL,
> + "Unknown handle type of fd %d\n",
> + intr_handle->fd);
> + return -1;
> + }
> +}
> +
>  /* Init the PCI EAL subsystem */
>  int
>  rte_eal_pci_init(void)
> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_init.h 
> b/lib/librte_eal/linuxapp/eal/eal_pci_init.h
> index aa7b755..c28e5b0 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_pci_init.h
> +++ b

[dpdk-dev] [RFC PATCH 0/8] reduce header dependency on rte_mbuf.h

2015-05-11 Thread Thomas Monjalon
> > A large number of our header files and libraries are dependent on one 
> > another, 
> > which can lead to problems with circular dependencies if trying to tie some 
> > of
> > those libraries together, e.g. when prototyping with pktdev, or other 
> > schemes
> > to get a common API for ethdev/rings/KNI. :-)
> > 
> > One small way to reduce issues when doing this is to eliminate #includes 
> > when
> > they are not needed. While most includes in our headers are necessary, one 
> > common pattern seen is where a library just takes mbufs as part of it's API,
> > but does not de-reference those in the header file. In cases like this, it's
> > not necessary to include the whole mbuf header file just to allow pointers 
> > to
> > mbuf structures - a forward declaration of "struct rte_mbuf" will do.
> > Including the mbuf header file, also triggers inclusion of the mempool 
> > headers
> > which causes the inclusion of the ring headers amongst others.
> > 
> > Therefore, I propose changing the header files for our libraries to just use
> > the forward declaration instead of the full header inclusion where possible.
> 
> Series
> Acked-by: Olivier Matz 

Applied, thanks


[dpdk-dev] [PATCH] eal/bsdapp: fix compilation on FreeBSD

2015-05-11 Thread Thomas Monjalon
> Fixes: 6065355a "pci: make device id tables const"
> 
> Following the above commit, compilation on FreeBSD with clang was broken,
> giving the error message:
> 
> .../lib/librte_eal/bsdapp/eal/eal_pci.c:438:16: fatal error: assigning to
>   'struct rte_pci_id *' from 'const struct rte_pci_id *' discards 
> qualifiers
>   [-Wincompatible-pointer-types-discards-qualifiers]
> for (id_table = dr->id_table ; id_table->vendor_id != 0; id_table++) {
>   ^ 
> 
> This patch fixes the issue by adding "const" to the type of id_table.
> 
> Signed-off-by: Bruce Richardson 

Applied, thanks


[dpdk-dev] [PATCH] lib: syntax cleanup

2015-05-11 Thread Ferruh Yigit
Remove extra parenthesis from return statements.

Signed-off-by: Ferruh Yigit 
---
 lib/librte_cmdline/cmdline_parse_etheraddr.c   | 10 ++---
 lib/librte_cmdline/cmdline_parse_ipaddr.c  | 38 -
 lib/librte_cmdline/cmdline_parse_num.c | 24 +--
 lib/librte_cmdline/cmdline_parse_portlist.c| 14 +++
 lib/librte_cmdline/cmdline_socket.c|  2 +-
 lib/librte_eal/bsdapp/contigmem/contigmem.c| 22 +-
 lib/librte_eal/bsdapp/eal/eal.c|  4 +-
 lib/librte_eal/bsdapp/eal/eal_pci.c| 12 +++---
 lib/librte_eal/bsdapp/nic_uio/nic_uio.c| 10 ++---
 lib/librte_eal/common/eal_common_memzone.c |  2 +-
 lib/librte_eal/common/include/rte_common.h |  2 +-
 lib/librte_eal/common/include/rte_pci.h|  6 +--
 lib/librte_eal/linuxapp/eal/eal.c  |  4 +-
 lib/librte_eal/linuxapp/eal/eal_interrupts.c   |  4 +-
 lib/librte_eal/linuxapp/eal/eal_memory.c   |  4 +-
 .../linuxapp/kni/ethtool/igb/igb_procfs.c  |  2 +-
 lib/librte_eal/linuxapp/kni/kni_net.c  |  8 ++--
 lib/librte_ip_frag/ip_frag_internal.c  | 12 +++---
 lib/librte_ip_frag/rte_ip_frag_common.c|  6 +--
 lib/librte_ip_frag/rte_ipv4_reassembly.c   |  8 ++--
 lib/librte_ip_frag/rte_ipv6_fragmentation.c|  8 ++--
 lib/librte_lpm/rte_lpm.c   |  2 +-
 lib/librte_mbuf/rte_mbuf.h | 16 
 lib/librte_mempool/rte_dom0_mempool.c  |  2 +-
 lib/librte_mempool/rte_mempool.c   | 18 
 lib/librte_net/rte_ip.h|  2 +-
 lib/librte_pmd_e1000/em_ethdev.c   | 36 
 lib/librte_pmd_e1000/em_rxtx.c | 48 +++---
 lib/librte_pmd_e1000/igb_ethdev.c  | 26 ++--
 lib/librte_pmd_e1000/igb_rxtx.c| 30 +++---
 lib/librte_pmd_fm10k/fm10k_ethdev.c| 30 +++---
 lib/librte_pmd_i40e/i40e_rxtx.c| 14 +++
 lib/librte_pmd_ixgbe/ixgbe_82599_bypass.c  |  4 +-
 lib/librte_pmd_ixgbe/ixgbe_bypass.c|  2 +-
 lib/librte_pmd_ixgbe/ixgbe_ethdev.c| 42 +--
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c  | 36 
 lib/librte_pmd_virtio/virtio_ethdev.c  |  4 +-
 37 files changed, 257 insertions(+), 257 deletions(-)

diff --git a/lib/librte_cmdline/cmdline_parse_etheraddr.c 
b/lib/librte_cmdline/cmdline_parse_etheraddr.c
index 64ae86c..dbfe4a6 100644
--- a/lib/librte_cmdline/cmdline_parse_etheraddr.c
+++ b/lib/librte_cmdline/cmdline_parse_etheraddr.c
@@ -105,32 +105,32 @@ my_ether_aton(const char *a)
errno = 0;
o[i] = strtoul(a, &end, 16);
if (errno != 0 || end == a || (end[0] != ':' && end[0] != 0))
-   return (NULL);
+   return NULL;
a = end + 1;
} while (++i != sizeof (o) / sizeof (o[0]) && end[0] != 0);

/* Junk at the end of line */
if (end[0] != 0)
-   return (NULL);
+   return NULL;

/* Support the format XX:XX:XX:XX:XX:XX */
if (i == ETHER_ADDR_LEN) {
while (i-- != 0) {
if (o[i] > UINT8_MAX)
-   return (NULL);
+   return NULL;
ether_addr.ea_oct[i] = (uint8_t)o[i];
}
/* Support the format :: */
} else if (i == ETHER_ADDR_LEN / 2) {
while (i-- != 0) {
if (o[i] > UINT16_MAX)
-   return (NULL);
+   return NULL;
ether_addr.ea_oct[i * 2] = (uint8_t)(o[i] >> 8);
ether_addr.ea_oct[i * 2 + 1] = (uint8_t)(o[i] & 0xff);
}
/* unknown format */
} else
-   return (NULL);
+   return NULL;

return (struct ether_addr *)ðer_addr;
 }
diff --git a/lib/librte_cmdline/cmdline_parse_ipaddr.c 
b/lib/librte_cmdline/cmdline_parse_ipaddr.c
index 7f33599..d3d3e04 100644
--- a/lib/librte_cmdline/cmdline_parse_ipaddr.c
+++ b/lib/librte_cmdline/cmdline_parse_ipaddr.c
@@ -135,12 +135,12 @@ my_inet_pton(int af, const char *src, void *dst)
 {
switch (af) {
case AF_INET:
-   return (inet_pton4(src, dst));
+   return inet_pton4(src, dst);
case AF_INET6:
-   return (inet_pton6(src, dst));
+   return inet_pton6(src, dst);
default:
errno = EAFNOSUPPORT;
-   return (-1);
+   return -1;
}
/* NOTREACHED */
 }
@@ -172,26 +172,2

[dpdk-dev] [PATCH v7 08/10] ixgbe: enable rx queue interrupts for both PF and VF

2015-05-11 Thread Stephen Hemminger
On Mon, 11 May 2015 13:31:04 +0800
"Liang, Cunming"  wrote:

> > Since MSI-X vectors are limited on many hardware platforms, this whole API
> > should be changed so that max_intr is based on number of rx_queues actually
> > used by the application.  That means the setup needs to move from init to 
> > configure.  
> [LCM] When MSI-X is not used, intr_vec and set max_intr are useless. It 
> doesn't matter to non MSI-X mode.
> As it allows the sequence "dev_stop->dev_reconfig->dev_start", the real 
> used number of queue may change.
> So allocation only on dev_init and release only on dev_close, just make 
> it simple. During configure_msix, it do use the real useful queue number 
> to set queue/vector mapping, refer xxx_configure_msix().

The problem is that if a customer has 16 NIC's with 32 MSI vectors per NIC,
it maybe that the MSI table in south bridge gets full. That is why the ixgbe
driver for Linux limits itself to num_online_cpu() + 1 MSI interrrupts.



[dpdk-dev] [PATCHv2] app/ and examples/ fix default mbuf size

2015-05-11 Thread Thomas Monjalon
> > v2 changes:
> > - add a new macro into rte_mbuf.h
> > - make samples to use that new macro
> >
> >
> > Fixes: dfb03bbe2b ("app/testpmd: use standard functions to initialize
> > mbufs and mbuf pool").
> > Latest mbuf changes (priv_size addition and related fixes)
> > exposed small problem with testpmd and few other sample apps:
> > when mbuf size is exaclty 2KB or less, that causes
> > ixgbe PMD to select scattered RX even for configs with 'normal'
> > max packet length (max_rx_pkt_len == ETHER_MAX_LEN).
> > To overcome that problem and unify the code, new macro was created
> > to represent recommended minimal buffer length for mbuf.
> > When appropriate, samples are updated to use that macro.
> >
> > Signed-off-by: Konstantin Ananyev 
> 
> Acked-by: Olivier Matz 

Applied, thanks


[dpdk-dev] [PATCH] vfio: eventfd should be non-block and not inherited

2015-05-11 Thread Thomas Monjalon
> > Set internal event file descriptor to be non-block and not inherited across
> > exec.  This prevents accidental hangs and passing in anothr thread.
> > 
> > Signed-off-by: Stephen Hemminger 
> 
> Acked-by: Anatoly  Burakov 

Applied, thanks


[dpdk-dev] [PATCH 1/4] pci: allow access to PCI config space

2015-05-11 Thread Stephen Hemminger
Ok will stub out the pci_config stuff for BSD.
But I don't have time or resources to do real BSD support.

Also, the whole bnx2x driver loads firmware and that probably has dependencies
that are different on BSD


[dpdk-dev] DPDK Community Call - Beyond DPDK 2.0

2015-05-11 Thread O'Driscoll, Tim
This is just a reminder that this call is on tomorrow, at the following times, 
which is just under 24 hours from now.

Dublin (Ireland) - Tuesday, May 12, 2015 at 4:00:00 PM IST UTC+1 hour
San Francisco (U.S.A. - California) - Tuesday, May 12, 2015 at 8:00:00 AM PDT 
UTC-7 hours
Phoenix (U.S.A. - Arizona) - Tuesday, May 12, 2015 at 8:00:00 AM MST UTC-7 hours
Boston (U.S.A. - Massachusetts) - Tuesday, May 12, 2015 at 11:00:00 AM EDT 
UTC-4 hours
New York (U.S.A. - New York) - Tuesday, May 12, 2015 at 11:00:00 AM EDT UTC-4 
hours
Ottawa (Canada - Ontario) - Tuesday, May 12, 2015 at 11:00:00 AM EDT UTC-4 hours
London (United Kingdom - England) - Tuesday, May 12, 2015 at 4:00:00 PM BST 
UTC+1 hour
Paris (France) - Tuesday, May 12, 2015 at 5:00:00 PM CEST UTC+2 hours
Tel Aviv (Israel) - Tuesday, May 12, 2015 at 6:00:00 PM IDT UTC+3 hours
Moscow (Russia) - Tuesday, May 12, 2015 at 6:00:00 PM MSK UTC+3 hours
New Delhi (India - Delhi) - Tuesday, May 12, 2015 at 8:30:00 PM IST UTC+5:30 
hours
Shanghai (China - Shanghai Municipality) - Tuesday, May 12, 2015 at 11:00:00 PM 
CST
UTC+8 hours Corresponding UTC (GMT) Tuesday, May 12, 2015 at 15:00:00


It would be good to have as much representation as possible from DPDK 
contributors and users, so that the discussion reflects the views and needs of 
the community.


Tim

> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of O'Driscoll, Tim
> 
> > From: Dave Neary [mailto:dneary at redhat.com]
> >
> > When were you thinking of having the call?
> 
> I put the day and time at the end of the email, but it probably should
> have been at the start! Apologies that this wasn't clear.
> 
> Dublin (Ireland) - Tuesday, May 12, 2015 at 4:00:00 PM IST UTC+1 hour
> San Francisco (U.S.A. - California) - Tuesday, May 12, 2015 at 8:00:00
> AM PDT UTC-7 hours
> Phoenix (U.S.A. - Arizona) - Tuesday, May 12, 2015 at 8:00:00 AM MST
> UTC-7 hours
> Boston (U.S.A. - Massachusetts) - Tuesday, May 12, 2015 at 11:00:00 AM
> EDT UTC-4 hours
> New York (U.S.A. - New York) - Tuesday, May 12, 2015 at 11:00:00 AM EDT
> UTC-4 hours
> Ottawa (Canada - Ontario) - Tuesday, May 12, 2015 at 11:00:00 AM EDT
> UTC-4 hours
> London (United Kingdom - England) - Tuesday, May 12, 2015 at 4:00:00 PM
> BST UTC+1 hour
> Paris (France) - Tuesday, May 12, 2015 at 5:00:00 PM CEST UTC+2 hours
> Tel Aviv (Israel) - Tuesday, May 12, 2015 at 6:00:00 PM IDT UTC+3 hours
> Moscow (Russia) - Tuesday, May 12, 2015 at 6:00:00 PM MSK UTC+3 hours
> New Delhi (India - Delhi) - Tuesday, May 12, 2015 at 8:30:00 PM IST
> UTC+5:30 hours
> Shanghai (China - Shanghai Municipality) - Tuesday, May 12, 2015 at
> 11:00:00 PM CST
> UTC+8 hours Corresponding UTC (GMT) Tuesday, May 12, 2015 at 15:00:00
> 
> > It's not been explicit, but can I assume that this call will also be
> > promoted among potential supporters of the project who may not be on
> > this list? I would be interested to get the perspective from the
> people
> > who are perhaps not developers who decide whether their organization
> > engages strategically with a project or not.
> 
> I think this is a great idea. Anybody should feel free to pass this on
> to other interested parties. If we have more contributors to the
> discussion then the output should be more representative of current and
> future community needs.
> 
> 
> Tim



[dpdk-dev] [PATCH 1/4] pci: allow access to PCI config space

2015-05-11 Thread Neil Horman
On Mon, May 11, 2015 at 08:23:59AM -0700, Stephen Hemminger wrote:
> Ok will stub out the pci_config stuff for BSD.
> But I don't have time or resources to do real BSD support.
> 
> Also, the whole bnx2x driver loads firmware and that probably has dependencies
> that are different on BSD
> 
Thats a fair point.  Is the implication here that, even with the stub functions
for pci config space, bnx2x won't build on BSD?

Neil



[dpdk-dev] [RFC PATCHv2 0/2] pktdev as wrapper type

2015-05-11 Thread Bruce Richardson
Hi all,

after a small amount of offline discussion with Marc Sune, here is an
alternative proposal for a higher-level interface - aka pktdev - to allow a
common Rx/Tx API across device types handling mbufs [for now, ethdev, ring
and KNI]. The key code is in the first patch fo the set - the second is an
example of a trivial usecase.

What is different about this to previously:
* wrapper class, so no changes to any existing ring, ethdev implementations
* use of function pointers for RX/TX with an API that maps to ethdev
  - this means there is little/no additional overhead for ethdev calls
  - inline special case for rings, to accelerate that. Since we are at a 
higher level, we can special case process some things if appropriate. This
means the impact to ring ops is one (predictable) branch per burst
* elimination of the queue abstraction. For the ring and KNI, there is no
  concept of queues, so we just wrap the functions directly (no need even for
  wrapper functions, the api's match so we can call directly). This also
  means:
  - adding in features per-queue, is far easier as we don't need to worry about
having arrays of multiple queues. For example:
  - adding in buffering on TX (or RX) is easier since again we only have a 
single queue.
* thread safety is made easier using a wrapper. For a MP ring, we can create
  multiple pktdevs around it, and each thread will then be able to use their
  own copy, with their own buffering etc.

However, at this point, I'm just looking for general feedback on this as an
approach. I think it's quite flexible - even more so than the earlier proposal
we had. It's less proscriptive and doesn't make any demands on any other libs.

Comments/thoughts welcome.

Bruce Richardson (2):
  Add example pktdev implementation
  example app showing pktdevs used in a chain

 config/common_bsdapp   |   5 +
 config/common_linuxapp |   5 +
 examples/pktdev/Makefile   |  57 +++
 examples/pktdev/basicfwd.c | 221 +
 lib/Makefile   |   1 +
 lib/librte_pktdev/Makefile |  53 ++
 lib/librte_pktdev/rte_pktdev.h | 200 +
 7 files changed, 542 insertions(+)
 create mode 100644 examples/pktdev/Makefile
 create mode 100644 examples/pktdev/basicfwd.c
 create mode 100644 lib/librte_pktdev/Makefile
 create mode 100644 lib/librte_pktdev/rte_pktdev.h

-- 
2.1.0



[dpdk-dev] [RFC PATCHv2 2/2] example app showing pktdevs used in a chain

2015-05-11 Thread Bruce Richardson
This is a trivial example showing code which is using ethdevs and rings
in a neutral manner, with the same piece of pipeline code passing mbufs
along a chain without ever having to query its source or destination
type.

Signed-off-by: Bruce Richardson 
---
 examples/pktdev/Makefile   |  57 
 examples/pktdev/basicfwd.c | 221 +
 2 files changed, 278 insertions(+)
 create mode 100644 examples/pktdev/Makefile
 create mode 100644 examples/pktdev/basicfwd.c

diff --git a/examples/pktdev/Makefile b/examples/pktdev/Makefile
new file mode 100644
index 000..4a5d99f
--- /dev/null
+++ b/examples/pktdev/Makefile
@@ -0,0 +1,57 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+# * Redistributions of source code must retain the above copyright
+#   notice, this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+#   notice, this list of conditions and the following disclaimer in
+#   the documentation and/or other materials provided with the
+#   distribution.
+# * Neither the name of Intel Corporation nor the names of its
+#   contributors may be used to endorse or promote products derived
+#   from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+ifeq ($(RTE_SDK),)
+$(error "Please define RTE_SDK environment variable")
+endif
+
+# Default target, can be overridden by command line or environment
+RTE_TARGET ?= x86_64-native-linuxapp-gcc
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# binary name
+APP = basicfwd
+
+# all source are stored in SRCS-y
+SRCS-y := basicfwd.c
+
+CFLAGS += $(WERROR_FLAGS)
+
+# workaround for a gcc bug with noreturn attribute
+# http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12603
+ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
+CFLAGS_main.o += -Wno-return-type
+endif
+
+EXTRA_CFLAGS += -O3 -g -Wfatal-errors
+
+include $(RTE_SDK)/mk/rte.extapp.mk
diff --git a/examples/pktdev/basicfwd.c b/examples/pktdev/basicfwd.c
new file mode 100644
index 000..91c0c3b
--- /dev/null
+++ b/examples/pktdev/basicfwd.c
@@ -0,0 +1,221 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define RX_RI

[dpdk-dev] [RFC PATCHv2 1/2] Add example pktdev implementation

2015-05-11 Thread Bruce Richardson
This commit demonstrates what a minimal API for all packet handling
might look like. It provides common APIs for RX and TX, by wrapping
the types as appropriate. Implementations provided for ring, ethdev and
kni.

Signed-off-by: Bruce Richardson 
---
 config/common_bsdapp   |   5 ++
 config/common_linuxapp |   5 ++
 lib/Makefile   |   1 +
 lib/librte_pktdev/Makefile |  53 +++
 lib/librte_pktdev/rte_pktdev.h | 200 +
 5 files changed, 264 insertions(+)
 create mode 100644 lib/librte_pktdev/Makefile
 create mode 100644 lib/librte_pktdev/rte_pktdev.h

diff --git a/config/common_bsdapp b/config/common_bsdapp
index c2374c0..64fcdc8 100644
--- a/config/common_bsdapp
+++ b/config/common_bsdapp
@@ -132,6 +132,11 @@ CONFIG_RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT=y
 CONFIG_RTE_LIBRTE_KVARGS=y

 #
+# Compile generic packet handling device library
+#
+CONFIG_RTE_LIBRTE_PKTDEV=y
+
+#
 # Compile generic ethernet library
 #
 CONFIG_RTE_LIBRTE_ETHER=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 0078dc9..399f15d 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -129,6 +129,11 @@ CONFIG_RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT=y
 CONFIG_RTE_LIBRTE_KVARGS=y

 #
+# Compile generic packet handling device library
+#
+CONFIG_RTE_LIBRTE_PKTDEV=y
+
+#
 # Compile generic ethernet library
 #
 CONFIG_RTE_LIBRTE_ETHER=y
diff --git a/lib/Makefile b/lib/Makefile
index d94355d..4db5ee0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -32,6 +32,7 @@
 include $(RTE_SDK)/mk/rte.vars.mk

 DIRS-y += librte_compat
+DIRS-$(CONFIG_RTE_LIBRTE_PKTDEV) += librte_pktdev
 DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
 DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
 DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
diff --git a/lib/librte_pktdev/Makefile b/lib/librte_pktdev/Makefile
new file mode 100644
index 000..858d3e3
--- /dev/null
+++ b/lib/librte_pktdev/Makefile
@@ -0,0 +1,53 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2015 Intel Corporation. All rights reserved.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+# * Redistributions of source code must retain the above copyright
+#   notice, this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+#   notice, this list of conditions and the following disclaimer in
+#   the documentation and/or other materials provided with the
+#   distribution.
+# * Neither the name of Intel Corporation nor the names of its
+#   contributors may be used to endorse or promote products derived
+#   from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = libpktdev.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+EXPORT_MAP := rte_pktdev_version.map
+
+LIBABIVER := 1
+
+#
+# Export include files
+#
+SYMLINK-y-include += rte_pktdev.h
+
+DEPDIRS-y += lib/librte_ring lib/librte_kni lib/librte_ether
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_pktdev/rte_pktdev.h b/lib/librte_pktdev/rte_pktdev.h
new file mode 100644
index 000..eba7989
--- /dev/null
+++ b/lib/librte_pktdev/rte_pktdev.h
@@ -0,0 +1,200 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ *   contributors may be used to endorse

[dpdk-dev] [PATCH v3 0/6] rte_sched: cleanups and API changes

2015-05-11 Thread Stephen Hemminger
This is 3rd rev of DPDK changes to QoS scheduler.
The change from last revision was to fix the example and fix whitespace.

Stephen Hemminger (6):
  rte_sched: make RED optional at runtime
  rte_sched: expand scheduler hierarchy for more VLAN's
  rte_sched: keep track of RED drops
  rte_sched: allow reading without clearing
  rte_sched: don't put tabs in log messages
  rte_sched: use correct log level

 app/test/test_sched.c|   4 +-
 examples/qos_sched/stats.c   |  22 ++---
 lib/librte_mbuf/rte_mbuf.h   |   5 +-
 lib/librte_sched/rte_sched.c | 113 ---
 lib/librte_sched/rte_sched.h |  62 +++-
 5 files changed, 134 insertions(+), 72 deletions(-)

-- 
2.1.4



[dpdk-dev] [PATCH 1/6] rte_sched: make RED optional at runtime

2015-05-11 Thread Stephen Hemminger
From: Stephen Hemminger 

Want to be able to build with RTE_SCHED_RED enabled but
allow disabling RED on a per-queue basis at runtime.

RED is disabled unless min/max thresholds set.

Signed-off-by: Stephen Hemmminger 
---
 lib/librte_sched/rte_sched.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
index 95dee27..3b5acd1 100644
--- a/lib/librte_sched/rte_sched.c
+++ b/lib/librte_sched/rte_sched.c
@@ -636,6 +636,12 @@ rte_sched_port_config(struct rte_sched_port_params *params)
uint32_t j;

for (j = 0; j < e_RTE_METER_COLORS; j++) {
+   /* if min/max are both zero, then RED is disabled */
+   if ((params->red_params[i][j].min_th |
+params->red_params[i][j].max_th) == 0) {
+   continue;
+   }
+
if (rte_red_config_init(&port->red_config[i][j],
params->red_params[i][j].wq_log2,
params->red_params[i][j].min_th,
@@ -1069,6 +1075,9 @@ rte_sched_port_red_drop(struct rte_sched_port *port, 
struct rte_mbuf *pkt, uint3
color = rte_sched_port_pkt_read_color(pkt);
red_cfg = &port->red_config[tc_index][color];

+   if ((red_cfg->min_th | red_cfg->max_th) == 0)
+   return 0;
+
qe = port->queue_extra + qindex;
red = &qe->red;

-- 
2.1.4



[dpdk-dev] [PATCH 2/6] rte_sched: expand scheduler hierarchy for more VLAN's

2015-05-11 Thread Stephen Hemminger
From: Stephen Hemminger 

The QoS subport is limited to 8 bits in original code.
But customers demanded ability to support full number of VLAN's (4096)
therefore use the full part of the tag field of mbuf.

Resize the pipe as well to allow for more pipes in future and
avoid expensive bitfield access.

Signed-off-by: Stephen Hemminger 
---
 lib/librte_mbuf/rte_mbuf.h   |  5 -
 lib/librte_sched/rte_sched.h | 38 --
 2 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ab6de67..cc0658d 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -295,7 +295,10 @@ struct rte_mbuf {
/**< First 4 flexible bytes or FD ID, dependent on
 PKT_RX_FDIR_* flag in ol_flags. */
} fdir;   /**< Filter identifier if FDIR enabled */
-   uint32_t sched;   /**< Hierarchical scheduler */
+   struct {
+   uint32_t lo;
+   uint32_t hi;
+   } sched;  /**< Hierarchical scheduler */
uint32_t usr; /**< User defined tags. See 
rte_distributor_process() */
} hash;   /**< hash information */

diff --git a/lib/librte_sched/rte_sched.h b/lib/librte_sched/rte_sched.h
index e6bba22..bf5ef8d 100644
--- a/lib/librte_sched/rte_sched.h
+++ b/lib/librte_sched/rte_sched.h
@@ -195,16 +195,20 @@ struct rte_sched_port_params {
 #endif
 };

-/** Path through the scheduler hierarchy used by the scheduler enqueue 
operation to
-identify the destination queue for the current packet. Stored in the field 
hash.sched
-of struct rte_mbuf of each packet, typically written by the classification 
stage and read by
-scheduler enqueue.*/
+/*
+ * Path through the scheduler hierarchy used by the scheduler enqueue
+ * operation to identify the destination queue for the current
+ * packet. Stored in the field pkt.hash.sched of struct rte_mbuf of
+ * each packet, typically written by the classification stage and read
+ * by scheduler enqueue.
+ */
 struct rte_sched_port_hierarchy {
-   uint32_t queue:2;/**< Queue ID (0 .. 3) */
-   uint32_t traffic_class:2;/**< Traffic class ID (0 .. 3)*/
-   uint32_t pipe:20;/**< Pipe ID */
-   uint32_t subport:6;  /**< Subport ID */
-   uint32_t color:2;/**< Color */
+   uint16_t queue:2;/**< Queue ID (0 .. 3) */
+   uint16_t traffic_class:2;/**< Traffic class ID (0 .. 3)*/
+   uint16_t color:2;/**< Color */
+   uint16_t unused:10;
+   uint16_t subport;/**< Subport ID */
+   uint32_t pipe;   /**< Pipe ID */
 };

 /*
@@ -350,12 +354,15 @@ rte_sched_queue_read_stats(struct rte_sched_port *port,
  */
 static inline void
 rte_sched_port_pkt_write(struct rte_mbuf *pkt,
-   uint32_t subport, uint32_t pipe, uint32_t traffic_class, uint32_t 
queue, enum rte_meter_color color)
+uint32_t subport, uint32_t pipe,
+uint32_t traffic_class,
+uint32_t queue, enum rte_meter_color color)
 {
-   struct rte_sched_port_hierarchy *sched = (struct 
rte_sched_port_hierarchy *) &pkt->hash.sched;
+   struct rte_sched_port_hierarchy *sched
+   = (struct rte_sched_port_hierarchy *) &pkt->hash.sched;

-   sched->color = (uint32_t) color;
sched->subport = subport;
+   sched->color = (uint32_t) color;
sched->pipe = pipe;
sched->traffic_class = traffic_class;
sched->queue = queue;
@@ -379,9 +386,12 @@ rte_sched_port_pkt_write(struct rte_mbuf *pkt,
  *
  */
 static inline void
-rte_sched_port_pkt_read_tree_path(struct rte_mbuf *pkt, uint32_t *subport, 
uint32_t *pipe, uint32_t *traffic_class, uint32_t *queue)
+rte_sched_port_pkt_read_tree_path(struct rte_mbuf *pkt, uint32_t *subport,
+ uint32_t *pipe, uint32_t *traffic_class,
+ uint32_t *queue)
 {
-   struct rte_sched_port_hierarchy *sched = (struct 
rte_sched_port_hierarchy *) &pkt->hash.sched;
+   struct rte_sched_port_hierarchy *sched
+   = (struct rte_sched_port_hierarchy *) &pkt->hash.sched;

*subport = sched->subport;
*pipe = sched->pipe;
-- 
2.1.4



[dpdk-dev] [PATCH 3/6] rte_sched: keep track of RED drops

2015-05-11 Thread Stephen Hemminger
From: Stephen Hemminger 

Add new statistic to keep track of drops due to RED.

Signed-off-by: Stephen Hemminger 
---
 lib/librte_sched/rte_sched.c | 31 ++-
 lib/librte_sched/rte_sched.h |  6 ++
 2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
index 3b5acd1..c044c09 100644
--- a/lib/librte_sched/rte_sched.c
+++ b/lib/librte_sched/rte_sched.c
@@ -1028,7 +1028,9 @@ rte_sched_port_update_subport_stats(struct rte_sched_port 
*port, uint32_t qindex
 }

 static inline void
-rte_sched_port_update_subport_stats_on_drop(struct rte_sched_port *port, 
uint32_t qindex, struct rte_mbuf *pkt)
+rte_sched_port_update_subport_stats_on_drop(struct rte_sched_port *port,
+   uint32_t qindex,
+   struct rte_mbuf *pkt, uint32_t red)
 {
struct rte_sched_subport *s = port->subport + (qindex / 
rte_sched_port_queues_per_subport(port));
uint32_t tc_index = (qindex >> 2) & 0x3;
@@ -1036,6 +1038,9 @@ rte_sched_port_update_subport_stats_on_drop(struct 
rte_sched_port *port, uint32_

s->stats.n_pkts_tc_dropped[tc_index] += 1;
s->stats.n_bytes_tc_dropped[tc_index] += pkt_len;
+#ifdef RTE_SCHED_RED
+   s->stats.n_pkts_red_dropped[tc_index] += red;
+#endif
 }

 static inline void
@@ -1049,13 +1054,18 @@ rte_sched_port_update_queue_stats(struct rte_sched_port 
*port, uint32_t qindex,
 }

 static inline void
-rte_sched_port_update_queue_stats_on_drop(struct rte_sched_port *port, 
uint32_t qindex, struct rte_mbuf *pkt)
+rte_sched_port_update_queue_stats_on_drop(struct rte_sched_port *port,
+ uint32_t qindex,
+ struct rte_mbuf *pkt, uint32_t red)
 {
struct rte_sched_queue_extra *qe = port->queue_extra + qindex;
uint32_t pkt_len = pkt->pkt_len;

qe->stats.n_pkts_dropped += 1;
qe->stats.n_bytes_dropped += pkt_len;
+#ifdef RTE_SCHED_RED
+   qe->stats.n_pkts_red_dropped += red;
+#endif
 }

 #endif /* RTE_SCHED_COLLECT_STATS */
@@ -1206,12 +1216,23 @@ rte_sched_port_enqueue_qwa(struct rte_sched_port *port, 
uint32_t qindex, struct
qlen = q->qw - q->qr;

/* Drop the packet (and update drop stats) when queue is full */
-   if (unlikely(rte_sched_port_red_drop(port, pkt, qindex, qlen) || (qlen 
>= qsize))) {
+   if (unlikely(rte_sched_port_red_drop(port, pkt, qindex, qlen))) {
+#ifdef RTE_SCHED_COLLECT_STATS
+   rte_sched_port_update_subport_stats_on_drop(port, qindex,
+   pkt, 1);
+   rte_sched_port_update_queue_stats_on_drop(port, qindex, pkt, 1);
+#endif
rte_pktmbuf_free(pkt);
+   return 0;
+   }
+
+   if (qlen >= qsize) {
 #ifdef RTE_SCHED_COLLECT_STATS
-   rte_sched_port_update_subport_stats_on_drop(port, qindex, pkt);
-   rte_sched_port_update_queue_stats_on_drop(port, qindex, pkt);
+   rte_sched_port_update_subport_stats_on_drop(port, qindex,
+   pkt, 0);
+   rte_sched_port_update_queue_stats_on_drop(port, qindex, pkt, 0);
 #endif
+   rte_pktmbuf_free(pkt);
return 0;
}

diff --git a/lib/librte_sched/rte_sched.h b/lib/librte_sched/rte_sched.h
index bf5ef8d..3fd1fe1 100644
--- a/lib/librte_sched/rte_sched.h
+++ b/lib/librte_sched/rte_sched.h
@@ -140,6 +140,9 @@ struct rte_sched_subport_stats {
  subport for each traffic class*/
uint32_t n_bytes_tc_dropped[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE]; /**< 
Number of bytes dropped by the current
   subport for each traffic class due 
to subport queues being full or congested */
+#ifdef RTE_SCHED_RED
+   uint32_t n_pkts_red_dropped[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE]; /**< 
Number of packets dropped by red */
+#endif
 };

 /** Pipe configuration parameters. The period and credits_per_period 
parameters are measured
@@ -168,6 +171,9 @@ struct rte_sched_queue_stats {
/* Packets */
uint32_t n_pkts; /**< Number of packets successfully 
written to current queue */
uint32_t n_pkts_dropped; /**< Number of packets dropped due to 
current queue being full or congested */
+#ifdef RTE_SCHED_RED
+   uint32_t n_pkts_red_dropped;
+#endif

/* Bytes */
uint32_t n_bytes;/**< Number of bytes successfully 
written to current queue */
-- 
2.1.4



[dpdk-dev] [PATCH 4/6] rte_sched: allow reading without clearing

2015-05-11 Thread Stephen Hemminger
The rte_sched statistics API should allow reading statistics without
clearing. Make auto-clear optional.

Signed-off-by: Stephen Hemminger 
---
 app/test/test_sched.c|  4 ++--
 examples/qos_sched/stats.c   | 22 +++---
 lib/librte_sched/rte_sched.c | 44 ++--
 lib/librte_sched/rte_sched.h | 18 ++
 4 files changed, 49 insertions(+), 39 deletions(-)

diff --git a/app/test/test_sched.c b/app/test/test_sched.c
index c7239f8..1526ad7 100644
--- a/app/test/test_sched.c
+++ b/app/test/test_sched.c
@@ -198,13 +198,13 @@ test_sched(void)

struct rte_sched_subport_stats subport_stats;
uint32_t tc_ov;
-   rte_sched_subport_read_stats(port, SUBPORT, &subport_stats, &tc_ov);
+   rte_sched_subport_read_stats(port, SUBPORT, &subport_stats, &tc_ov, 1);
 #if 0
TEST_ASSERT_EQUAL(subport_stats.n_pkts_tc[TC-1], 10, "Wrong subport 
stats\n");
 #endif
struct rte_sched_queue_stats queue_stats;
uint16_t qlen;
-   rte_sched_queue_read_stats(port, QUEUE, &queue_stats, &qlen);
+   rte_sched_queue_read_stats(port, QUEUE, &queue_stats, &qlen, 1);
 #if 0
TEST_ASSERT_EQUAL(queue_stats.n_pkts, 10, "Wrong queue stats\n");
 #endif
diff --git a/examples/qos_sched/stats.c b/examples/qos_sched/stats.c
index b4db7b5..a6d05ab 100644
--- a/examples/qos_sched/stats.c
+++ b/examples/qos_sched/stats.c
@@ -61,7 +61,7 @@ qavg_q(uint8_t port_id, uint32_t subport_id, uint32_t 
pipe_id, uint8_t tc, uint8
 average = 0;

 for (count = 0; count < qavg_ntimes; count++) {
-rte_sched_queue_read_stats(port, queue_id, &stats, &qlen);
+rte_sched_queue_read_stats(port, queue_id, &stats, &qlen, 1);
 average += qlen;
 usleep(qavg_period);
 }
@@ -99,7 +99,9 @@ qavg_tcpipe(uint8_t port_id, uint32_t subport_id, uint32_t 
pipe_id, uint8_t tc)
 for (count = 0; count < qavg_ntimes; count++) {
 part_average = 0;
 for (i = 0; i < RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS; i++) {
-rte_sched_queue_read_stats(port, queue_id + (tc * 
RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS + i), &stats, &qlen);
+rte_sched_queue_read_stats(port,
+  queue_id + (tc * 
RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS + i),
+  &stats, &qlen, 1);
 part_average += qlen;
 }
 average += part_average / RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS;
@@ -138,7 +140,8 @@ qavg_pipe(uint8_t port_id, uint32_t subport_id, uint32_t 
pipe_id)
 for (count = 0; count < qavg_ntimes; count++) {
 part_average = 0;
 for (i = 0; i < RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE * 
RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS; i++) {
-rte_sched_queue_read_stats(port, queue_id + i, &stats, 
&qlen);
+rte_sched_queue_read_stats(port, queue_id + i,
+  &stats, &qlen, 1);
 part_average += qlen;
 }
 average += part_average / (RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE 
* RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS);
@@ -178,7 +181,9 @@ qavg_tcsubport(uint8_t port_id, uint32_t subport_id, 
uint8_t tc)
 queue_id = RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE * 
RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS * (subport_id * 
port_params.n_pipes_per_subport + i);

 for (j = 0; j < RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS; 
j++) {
-rte_sched_queue_read_stats(port, queue_id + 
(tc * RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS + j), &stats, &qlen);
+rte_sched_queue_read_stats(port,
+  queue_id + (tc * 
RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS + j),
+  &stats, &qlen, 1);
 part_average += qlen;
 }
 }
@@ -220,7 +225,8 @@ qavg_subport(uint8_t port_id, uint32_t subport_id)
 queue_id = RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE * 
RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS * (subport_id * 
port_params.n_pipes_per_subport + i);

 for (j = 0; j < RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE * 
RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS; j++) {
-rte_sched_queue_read_stats(port, queue_id + j, 
&stats, &qlen);
+rte_sched_queue_read_stats(port, queue_id + j,
+  &stats, &qlen, 1);
 part_average += qlen;
 }
 }
@@ -254,7 +260,7 @@ subport_stat(uint8_t port_id, uint32_t subport_id)
 port = qos_conf[i].sched_port;
memset (tc_ov, 0, sizeof(t

[dpdk-dev] [PATCH 5/6] rte_sched: don't put tabs in log messages

2015-05-11 Thread Stephen Hemminger
From: Stephen Hemminger 

syslog does not like tabs in log messages; tab gets translated to #011

Signed-off-by: Stephen Hemminger 
---
 lib/librte_sched/rte_sched.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
index 74b3111..b8d036a 100644
--- a/lib/librte_sched/rte_sched.c
+++ b/lib/librte_sched/rte_sched.c
@@ -495,10 +495,10 @@ rte_sched_port_log_pipe_profile(struct rte_sched_port 
*port, uint32_t i)
struct rte_sched_pipe_profile *p = port->pipe_profiles + i;

RTE_LOG(INFO, SCHED, "Low level config for pipe profile %u:\n"
-   "\tToken bucket: period = %u, credits per period = %u, size = 
%u\n"
-   "\tTraffic classes: period = %u, credits per period = [%u, %u, 
%u, %u]\n"
-   "\tTraffic class 3 oversubscription: weight = %hhu\n"
-   "\tWRR cost: [%hhu, %hhu, %hhu, %hhu], [%hhu, %hhu, %hhu, 
%hhu], [%hhu, %hhu, %hhu, %hhu], [%hhu, %hhu, %hhu, %hhu]\n",
+   "Token bucket: period = %u, credits per period = %u, size = 
%u\n"
+   "Traffic classes: period = %u, credits per period = [%u, 
%u, %u, %u]\n"
+   "Traffic class 3 oversubscription: weight = %hhu\n"
+   "WRR cost: [%hhu, %hhu, %hhu, %hhu], [%hhu, %hhu, %hhu, 
%hhu], [%hhu, %hhu, %hhu, %hhu], [%hhu, %hhu, %hhu, %hhu]\n",
i,

/* Token bucket */
@@ -716,9 +716,9 @@ rte_sched_port_log_subport_config(struct rte_sched_port 
*port, uint32_t i)
struct rte_sched_subport *s = port->subport + i;

RTE_LOG(INFO, SCHED, "Low level config for subport %u:\n"
-   "\tToken bucket: period = %u, credits per period = %u, size = 
%u\n"
-   "\tTraffic classes: period = %u, credits per period = [%u, %u, 
%u, %u]\n"
-   "\tTraffic class 3 oversubscription: wm min = %u, wm max = 
%u\n",
+   "Token bucket: period = %u, credits per period = %u, size = 
%u\n"
+   "Traffic classes: period = %u, credits per period = [%u, 
%u, %u, %u]\n"
+   "Traffic class 3 oversubscription: wm min = %u, wm max = 
%u\n",
i,

/* Token bucket */
-- 
2.1.4



[dpdk-dev] [PATCH 6/6] rte_sched: use correct log level

2015-05-11 Thread Stephen Hemminger
The setup messages should be at DEBUG level since they are not
important for normal operation of system. The messages about
problems should be at NOTICE or ERR level.

Signed-off-by: Stephen Hemminger 
---
 lib/librte_sched/rte_sched.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
index b8d036a..ec55f67 100644
--- a/lib/librte_sched/rte_sched.c
+++ b/lib/librte_sched/rte_sched.c
@@ -448,7 +448,8 @@ rte_sched_port_get_memory_footprint(struct 
rte_sched_port_params *params)

status = rte_sched_port_check_params(params);
if (status != 0) {
-   RTE_LOG(INFO, SCHED, "Port scheduler params check failed 
(%d)\n", status);
+   RTE_LOG(NOTICE, SCHED,
+   "Port scheduler params check failed (%d)\n", status);

return 0;
}
@@ -494,7 +495,7 @@ rte_sched_port_log_pipe_profile(struct rte_sched_port 
*port, uint32_t i)
 {
struct rte_sched_pipe_profile *p = port->pipe_profiles + i;

-   RTE_LOG(INFO, SCHED, "Low level config for pipe profile %u:\n"
+   RTE_LOG(DEBUG, SCHED, "Low level config for pipe profile %u:\n"
"Token bucket: period = %u, credits per period = %u, size = 
%u\n"
"Traffic classes: period = %u, credits per period = [%u, 
%u, %u, %u]\n"
"Traffic class 3 oversubscription: weight = %hhu\n"
@@ -688,7 +689,7 @@ rte_sched_port_config(struct rte_sched_port_params *params)
bmp_mem_size = rte_bitmap_get_memory_footprint(n_queues_per_port);
port->bmp = rte_bitmap_init(n_queues_per_port, port->bmp_array, 
bmp_mem_size);
if (port->bmp == NULL) {
-   RTE_LOG(INFO, SCHED, "Bitmap init error\n");
+   RTE_LOG(ERR, SCHED, "Bitmap init error\n");
return NULL;
}
for (i = 0; i < RTE_SCHED_PORT_N_GRINDERS; i ++) {
@@ -715,7 +716,7 @@ rte_sched_port_log_subport_config(struct rte_sched_port 
*port, uint32_t i)
 {
struct rte_sched_subport *s = port->subport + i;

-   RTE_LOG(INFO, SCHED, "Low level config for subport %u:\n"
+   RTE_LOG(DEBUG, SCHED, "Low level config for subport %u:\n"
"Token bucket: period = %u, credits per period = %u, size = 
%u\n"
"Traffic classes: period = %u, credits per period = [%u, 
%u, %u, %u]\n"
"Traffic class 3 oversubscription: wm min = %u, wm max = 
%u\n",
@@ -857,7 +858,8 @@ rte_sched_pipe_config(struct rte_sched_port *port,
s->tc_ov = s->tc_ov_rate > subport_tc3_rate;

if (s->tc_ov != tc3_ov) {
-   RTE_LOG(INFO, SCHED, "Subport %u TC3 oversubscription 
is OFF (%.4lf >= %.4lf)\n",
+   RTE_LOG(DEBUG, SCHED,
+   "Subport %u TC3 oversubscription is OFF (%.4lf 
>= %.4lf)\n",
subport_id, subport_tc3_rate, s->tc_ov_rate);
}
 #endif
@@ -896,7 +898,8 @@ rte_sched_pipe_config(struct rte_sched_port *port,
s->tc_ov = s->tc_ov_rate > subport_tc3_rate;

if (s->tc_ov != tc3_ov) {
-   RTE_LOG(INFO, SCHED, "Subport %u TC3 oversubscription 
is ON (%.4lf < %.4lf)\n",
+   RTE_LOG(DEBUG, SCHED,
+   "Subport %u TC3 oversubscription is ON (%.4lf < 
%.4lf)\n",
subport_id, subport_tc3_rate, s->tc_ov_rate);
}
p->tc_ov_period_id = s->tc_ov_period_id;
-- 
2.1.4



[dpdk-dev] [PATCH 2/6] rte_sched: expand scheduler hierarchy for more VLAN's

2015-05-11 Thread Neil Horman
On Mon, May 11, 2015 at 10:07:47AM -0700, Stephen Hemminger wrote:
> From: Stephen Hemminger 
> 
> The QoS subport is limited to 8 bits in original code.
> But customers demanded ability to support full number of VLAN's (4096)
> therefore use the full part of the tag field of mbuf.
> 
> Resize the pipe as well to allow for more pipes in future and
> avoid expensive bitfield access.
> 
> Signed-off-by: Stephen Hemminger 
> ---
>  lib/librte_mbuf/rte_mbuf.h   |  5 -
>  lib/librte_sched/rte_sched.h | 38 --
>  2 files changed, 28 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index ab6de67..cc0658d 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -295,7 +295,10 @@ struct rte_mbuf {
>   /**< First 4 flexible bytes or FD ID, dependent on
>PKT_RX_FDIR_* flag in ol_flags. */
>   } fdir;   /**< Filter identifier if FDIR enabled */
> - uint32_t sched;   /**< Hierarchical scheduler */
> + struct {
> + uint32_t lo;
> + uint32_t hi;
> + } sched;  /**< Hierarchical scheduler */
>   uint32_t usr; /**< User defined tags. See 
> rte_distributor_process() */
>   } hash;   /**< hash information */
>  
> diff --git a/lib/librte_sched/rte_sched.h b/lib/librte_sched/rte_sched.h
> index e6bba22..bf5ef8d 100644
> --- a/lib/librte_sched/rte_sched.h
> +++ b/lib/librte_sched/rte_sched.h
> @@ -195,16 +195,20 @@ struct rte_sched_port_params {
>  #endif
>  };
>  
> -/** Path through the scheduler hierarchy used by the scheduler enqueue 
> operation to
> -identify the destination queue for the current packet. Stored in the field 
> hash.sched
> -of struct rte_mbuf of each packet, typically written by the classification 
> stage and read by
> -scheduler enqueue.*/
> +/*
> + * Path through the scheduler hierarchy used by the scheduler enqueue
> + * operation to identify the destination queue for the current
> + * packet. Stored in the field pkt.hash.sched of struct rte_mbuf of
> + * each packet, typically written by the classification stage and read
> + * by scheduler enqueue.
> + */
>  struct rte_sched_port_hierarchy {
> - uint32_t queue:2;/**< Queue ID (0 .. 3) */
> - uint32_t traffic_class:2;/**< Traffic class ID (0 .. 3)*/
> - uint32_t pipe:20;/**< Pipe ID */
> - uint32_t subport:6;  /**< Subport ID */
> - uint32_t color:2;/**< Color */
> + uint16_t queue:2;/**< Queue ID (0 .. 3) */
> + uint16_t traffic_class:2;/**< Traffic class ID (0 .. 3)*/
> + uint16_t color:2;/**< Color */
> + uint16_t unused:10;
> + uint16_t subport;/**< Subport ID */
> + uint32_t pipe;   /**< Pipe ID */
>  };
Have you run this through the ABI checker?  Seems like this would alter lots of
pointer offsets.
Neil

>  
>  /*
> @@ -350,12 +354,15 @@ rte_sched_queue_read_stats(struct rte_sched_port *port,
>   */
>  static inline void
>  rte_sched_port_pkt_write(struct rte_mbuf *pkt,
> - uint32_t subport, uint32_t pipe, uint32_t traffic_class, uint32_t 
> queue, enum rte_meter_color color)
> +  uint32_t subport, uint32_t pipe,
> +  uint32_t traffic_class,
> +  uint32_t queue, enum rte_meter_color color)
>  {
> - struct rte_sched_port_hierarchy *sched = (struct 
> rte_sched_port_hierarchy *) &pkt->hash.sched;
> + struct rte_sched_port_hierarchy *sched
> + = (struct rte_sched_port_hierarchy *) &pkt->hash.sched;
>  
> - sched->color = (uint32_t) color;
>   sched->subport = subport;
> + sched->color = (uint32_t) color;
>   sched->pipe = pipe;
>   sched->traffic_class = traffic_class;
>   sched->queue = queue;
> @@ -379,9 +386,12 @@ rte_sched_port_pkt_write(struct rte_mbuf *pkt,
>   *
>   */
>  static inline void
> -rte_sched_port_pkt_read_tree_path(struct rte_mbuf *pkt, uint32_t *subport, 
> uint32_t *pipe, uint32_t *traffic_class, uint32_t *queue)
> +rte_sched_port_pkt_read_tree_path(struct rte_mbuf *pkt, uint32_t *subport,
> +   uint32_t *pipe, uint32_t *traffic_class,
> +   uint32_t *queue)
>  {
> - struct rte_sched_port_hierarchy *sched = (struct 
> rte_sched_port_hierarchy *) &pkt->hash.sched;
> + struct rte_sched_port_hierarchy *sched
> + = (struct rte_sched_port_hierarchy *) &pkt->hash.sched;
>  
>   *subport = sched->subport;
>   *pipe = sched->pipe;
> -- 
> 2.1.4
> 
> 


[dpdk-dev] [PATCH 1/4] pci: allow access to PCI config space

2015-05-11 Thread Stephen Hemminger
On Mon, 11 May 2015 11:37:08 -0400
Neil Horman  wrote:

> On Mon, May 11, 2015 at 08:23:59AM -0700, Stephen Hemminger wrote:
> > Ok will stub out the pci_config stuff for BSD.
> > But I don't have time or resources to do real BSD support.
> > 
> > Also, the whole bnx2x driver loads firmware and that probably has 
> > dependencies
> > that are different on BSD
> > 
> Thats a fair point.  Is the implication here that, even with the stub 
> functions
> for pci config space, bnx2x won't build on BSD?
> 
> Neil
> 

It will build but not run since it looks for firmware in /lib/firmware


[dpdk-dev] [PATCH 1/4] pci: allow access to PCI config space

2015-05-11 Thread Stephen Hemminger
On Mon, 11 May 2015 12:54:54 +
Neil Horman  wrote:

> On Thu, May 07, 2015 at 04:25:32PM -0700, Stephen Hemminger wrote:
> > From: Stephen Hemminger 
> > 
> > Some drivers need ability to access PCI config (for example for power
> > management). This adds an abstraction to do this; only implemented
> > on Linux, but should be possible on BSD.
> > 

Could someone who has BSD infrastructure try this, not sure if it will even
build.

diff --git a/lib/librte_eal/bsdapp/eal/eal_pci.c 
b/lib/librte_eal/bsdapp/eal/eal_pci.c
index 61e8921..8ba5b13 100644
--- a/lib/librte_eal/bsdapp/eal/eal_pci.c
+++ b/lib/librte_eal/bsdapp/eal/eal_pci.c
@@ -490,6 +490,76 @@ rte_eal_pci_probe_one_driver(struct rte_pci_driver *dr, 
struct rte_pci_device *d
return 1;
 }

+/* Read PCI config space. */
+int rte_eal_pci_read_config(const struct rte_pci_device *dev,
+   void *buf, size_t len, off_t offset)
+{
+   int fd = -1;
+
+   fd = open("/dev/pci", O_RDONLY);
+   if (fd < 0) {
+   RTE_LOG(ERR, EAL, "%s(): error opening /dev/pci\n", __func__);
+   goto error;
+   }
+
+   struct pci_io pi = {
+   .pi_sel = {
+   .pc_domain = dev->addr.domain,
+   .bus = dev->addr.bus,
+   .pc_dev = dev->addr.devid,
+   .pc_func = dev->addr.function,
+   },
+   .pi_reg = offset,
+   .pi_data = buf,
+   .pi_width = len,
+   };
+
+   if (ioctl(fd, PCIIOCREAD, &pi) < 0)
+   goto error;
+   close(fd);
+   return 0;
+
+error:
+   if (fd >= 0)
+   close(fd);
+   return -1;
+}
+
+/* Write PCI config space. */
+int rte_eal_pci_write_config(const struct rte_pci_device *device,
+const void *buf, size_t len, off_t offset)
+{
+   int fd = -1;
+
+   fd = open("/dev/pci", O_RDONLY);
+   if (fd < 0) {
+   RTE_LOG(ERR, EAL, "%s(): error opening /dev/pci\n", __func__);
+   goto error;
+   }
+
+   struct pci_io pi = {
+   .pi_sel = {
+   .pc_domain = dev->addr.domain,
+   .bus = dev->addr.bus,
+   .pc_dev = dev->addr.devid,
+   .pc_func = dev->addr.function,
+   },
+   .pi_reg = offset,
+   .pi_data = buf,
+   .pi_width = len,
+   };
+
+   if (ioctl(fd, PCIIOCWRITE, &pi) < 0)
+   goto error;
+   close(fd);
+   return 0;
+
+error:
+   if (fd >= 0)
+   close(fd);
+   return -1;
+}
+
 /* Init the PCI EAL subsystem */
 int
 rte_eal_pci_init(void)


[dpdk-dev] [PATCH 2/6] rte_sched: expand scheduler hierarchy for more VLAN's

2015-05-11 Thread Stephen Hemminger
On Mon, 11 May 2015 17:20:07 +
Neil Horman  wrote:

> Have you run this through the ABI checker?  Seems like this would alter lots 
> of
> pointer offsets.
> Neil

No, I have not run it through ABI checker.
It would change the ABI for applications using qos_sched but will not
change layout of mbuf.

But my assumption was that as part of release process the ABI version
would change rather than doing for each patch that gets merged.


[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

2015-05-11 Thread Ravi Kerur
Hi Konstantin,


On Mon, May 11, 2015 at 2:51 AM, Ananyev, Konstantin <
konstantin.ananyev at intel.com> wrote:

> Hi Ravi,
>
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ravi Kerur
> > Sent: Friday, May 08, 2015 11:55 PM
> > To: Matt Laswell
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> >
> > On Fri, May 8, 2015 at 3:29 PM, Matt Laswell 
> wrote:
> >
> > >
> > >
> > > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur  wrote:
> > >
> > >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> > >> implemented with AVX/SSE instructions.
> > >>
> > >> +static inline int
> > >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> > >> +{
> > >> +   const uint8_t *src_1 = (const uint8_t *)_src_1;
> > >> +   const uint8_t *src_2 = (const uint8_t *)_src_2;
> > >> +   int ret = 0;
> > >> +
> > >> +   if (n & 0x80)
> > >> +   return rte_cmp128(src_1, src_2);
> > >> +
> > >> +   if (n & 0x40)
> > >> +   return rte_cmp64(src_1, src_2);
> > >> +
> > >> +   if (n & 0x20) {
> > >> +   ret = rte_cmp32(src_1, src_2);
> > >> +   n -= 0x20;
> > >> +   src_1 += 0x20;
> > >> +   src_2 += 0x20;
> > >> +   }
> > >>
> > >>
> > > Pardon me for butting in, but this seems incorrect for the first two
> cases
> > > listed above, as the function as written will only compare the first
> 128 or
> > > 64 bytes of each source and return the result.  The pattern expressed
> in
> > > the 32 byte case appears more correct, as it compares the first 32
> bytes
> > > and then lets later pieces of the function handle the smaller remaining
> > > bits of the sources. Also, if this function is to handle arbitrarily
> large
> > > source data, the 128 byte case needs to be in a loop.
> > >
> > > What am I missing?
> > >
> >
> > Current max hash key length supported is 64 bytes, hence no comparison is
> > done after 64 bytes. 128 bytes comparison is added to measure performance
> > only and there is no use-case as of now. With the current use-cases its
> not
> > required but if there is a need to handle large arbitrary data upto 128
> > bytes it can be modified.
>
> So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid results,
> right?
> While on PPC will work as expected (as it calls memcpu underneath)?
> That looks really weird to me.
> If you plan to use rte_memcmp only for hash comparisons, then probably
> you should put it somewhere into librte_hash and name it accordingly:
> rte_hash_key_cmp() or something.
> And put a big comment around it, that it only works with particular
> lengths.
> If you want it to be a generic function inside EAL, then it probably need
> to handle different lengths properly
> on all supported architectures.
> Konstantin
>
>
Let me just explain it here and probably add it to document as well.

rte_memcmp is not

1. a replacement to memcmp

2.  restricted to hash key comparison

rte_memcmp is

1. optimized comparison for 16 to 128 bytes, v1 patch series had this
support. Changed some of the logic in v2 due to concerns raised for
unavailable use-cases beyond 64 bytes comparison. With minor tuning over
the weekend I am able to get better performance for anything between 16 to
128 bytes comparison.

2. will be specific to DPDK  i.e. currently all memcmp usage in DPDK are
for equality or inequality hence "less than" or "greater than"
implementation in rte_memcmp doesn't make sense and will be removed in
subsequent patches, it will return 0 or 1 for equal/unequal cases.

rte_hash will be the first candidate to move to rte_memcmp and subsequently
rte_lpm6 which uses 16 bytes comparison will be moved

Later on RING_SIZE which uses large size for comparison will be moved. I am
currently studying/understanding that logic and will make changes to
rte_memcmp to support that.

I don't want to make lot of changes in one shot and see that patch series
die a slow death with no takers.

Thanks,
Ravi

>
> > >
> > > --
> > > Matt Laswell
> > > infinite io, inc.
> > > laswell at infiniteio.com
> > >
> > >
>


[dpdk-dev] [PATCH 2/6] rte_sched: expand scheduler hierarchy for more VLAN's

2015-05-11 Thread Neil Horman
On Mon, May 11, 2015 at 10:32:59AM -0700, Stephen Hemminger wrote:
> On Mon, 11 May 2015 17:20:07 +
> Neil Horman  wrote:
> 
> > Have you run this through the ABI checker?  Seems like this would alter 
> > lots of
> > pointer offsets.
> > Neil
> 
> No, I have not run it through ABI checker.
> It would change the ABI for applications using qos_sched but will not
> change layout of mbuf.
> 
> But my assumption was that as part of release process the ABI version
> would change rather than doing for each patch that gets merged.
> 

You're correct that the ABI version can change, but the process is to make an
update to doc/guides/rel_notes/abi.rst documenting the proposed changed, wait
for that to be published in an official release, then make the change for the
following release.  That way downstream adopters have some lead time to prep for
upstream changes.

Neil



[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

2015-05-11 Thread Ananyev, Konstantin

Hi Ravi,

> 
> From: Ravi Kerur [mailto:rkerur at gmail.com]
> Sent: Monday, May 11, 2015 6:43 PM
> To: Ananyev, Konstantin
> Cc: Matt Laswell; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE 
> instructions.
> 
> Hi Konstantin,
> 
> 
> On Mon, May 11, 2015 at 2:51 AM, Ananyev, Konstantin  intel.com> wrote:
> Hi Ravi,
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ravi Kerur
> > Sent: Friday, May 08, 2015 11:55 PM
> > To: Matt Laswell
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE 
> > instructions.
> >
> > On Fri, May 8, 2015 at 3:29 PM, Matt Laswell  
> > wrote:
> >
> > >
> > >
> > > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur  wrote:
> > >
> > >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> > >> implemented with AVX/SSE instructions.
> > >>
> > >> +static inline int
> > >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> > >> +{
> > >> +? ? ? ?const uint8_t *src_1 = (const uint8_t *)_src_1;
> > >> +? ? ? ?const uint8_t *src_2 = (const uint8_t *)_src_2;
> > >> +? ? ? ?int ret = 0;
> > >> +
> > >> +? ? ? ?if (n & 0x80)
> > >> +? ? ? ? ? ? ? ?return rte_cmp128(src_1, src_2);
> > >> +
> > >> +? ? ? ?if (n & 0x40)
> > >> +? ? ? ? ? ? ? ?return rte_cmp64(src_1, src_2);
> > >> +
> > >> +? ? ? ?if (n & 0x20) {
> > >> +? ? ? ? ? ? ? ?ret = rte_cmp32(src_1, src_2);
> > >> +? ? ? ? ? ? ? ?n -= 0x20;
> > >> +? ? ? ? ? ? ? ?src_1 += 0x20;
> > >> +? ? ? ? ? ? ? ?src_2 += 0x20;
> > >> +? ? ? ?}
> > >>
> > >>
> > > Pardon me for butting in, but this seems incorrect for the first two cases
> > > listed above, as the function as written will only compare the first 128 
> > > or
> > > 64 bytes of each source and return the result.? The pattern expressed in
> > > the 32 byte case appears more correct, as it compares the first 32 bytes
> > > and then lets later pieces of the function handle the smaller remaining
> > > bits of the sources. Also, if this function is to handle arbitrarily large
> > > source data, the 128 byte case needs to be in a loop.
> > >
> > > What am I missing?
> > >
> >
> > Current max hash key length supported is 64 bytes, hence no comparison is
> > done after 64 bytes. 128 bytes comparison is added to measure performance
> > only and there is no use-case as of now. With the current use-cases its not
> > required but if there is a need to handle large arbitrary data upto 128
> > bytes it can be modified.
> So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid results, right?
> While on PPC will work as expected (as it calls memcpu underneath)?
> That looks really weird to me.
> If you plan to use rte_memcmp only for hash comparisons, then probably
> you should put it somewhere into librte_hash and name it accordingly: 
> rte_hash_key_cmp() or something.
> And put a big comment around it, that it only works with particular lengths.
> If you want it to be a generic function inside EAL, then it probably need to 
> handle different lengths properly
> on all supported architectures.
> Konstantin
> 
> 
> Let me just explain it here and probably add it to document as well.
> 
> rte_memcmp is not
> 
> 1. a replacement to memcmp
> 
> 2. ?restricted to hash key comparison
> 
> rte_memcmp is
> 
> 1. optimized comparison for 16 to 128 bytes, v1 patch series had this 
> support. Changed some of the logic in v2 due to concerns raised
> for unavailable use-cases beyond 64 bytes comparison.

>From what I see in v2 it supposed to work correctly for len in [0,64] and  
>len=128, right?
Not sure I get it: so for v1 it was able to handle any length correctly, but 
then you removed it?
If so, I wonder what was the reason? Make it faster?

Another thing that looks strange to me:
While all rte_cmp*() uses actual data values for comparison results,
rte_memcmp_remainder() return value depends not only on data values but also on 
data locations:

+static inline int
+rte_memcmp_remainder(const uint8_t *src_1u, const uint8_t *src_2u, size_t n)
+{
...
exit:
+
+   return src_1u < src_2u ? -1 : 1;
+}

If you just test for equal/not equal that doesn't really matter.
If this is supposed to be a 'proper' comparison function, then the result is 
sort of unpredictable.

> With minor tuning over the weekend I am able to get better performance for
> anything between 16 to 128 bytes comparison.
> 
> 2. will be specific to DPDK ?i.e. currently all memcmp usage in DPDK are for 
> equality or inequality hence "less than" or "greater than"
> implementation in rte_memcmp doesn't make sense and will be removed in 
> subsequent patches, it will return 0 or 1 for
> equal/unequal cases.

If you don't plan your function to follow memcmp() semantics and syntax, why to 
name it rte_memcmp()?
I  think that will make a lot of confusion around.
Why not to name it differently(and put a clear comment in the declaration of 
course)?

> 
> rte_hash will be the first candidate to m

[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

2015-05-11 Thread Ravi Kerur
Hi Konstantin,


On Mon, May 11, 2015 at 12:35 PM, Ananyev, Konstantin <
konstantin.ananyev at intel.com> wrote:

>
> Hi Ravi,
>
> >
> > From: Ravi Kerur [mailto:rkerur at gmail.com]
> > Sent: Monday, May 11, 2015 6:43 PM
> > To: Ananyev, Konstantin
> > Cc: Matt Laswell; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> >
> > Hi Konstantin,
> >
> >
> > On Mon, May 11, 2015 at 2:51 AM, Ananyev, Konstantin <
> konstantin.ananyev at intel.com> wrote:
> > Hi Ravi,
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ravi Kerur
> > > Sent: Friday, May 08, 2015 11:55 PM
> > > To: Matt Laswell
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> > >
> > > On Fri, May 8, 2015 at 3:29 PM, Matt Laswell 
> wrote:
> > >
> > > >
> > > >
> > > > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur  wrote:
> > > >
> > > >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> > > >> implemented with AVX/SSE instructions.
> > > >>
> > > >> +static inline int
> > > >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> > > >> +{
> > > >> +   const uint8_t *src_1 = (const uint8_t *)_src_1;
> > > >> +   const uint8_t *src_2 = (const uint8_t *)_src_2;
> > > >> +   int ret = 0;
> > > >> +
> > > >> +   if (n & 0x80)
> > > >> +   return rte_cmp128(src_1, src_2);
> > > >> +
> > > >> +   if (n & 0x40)
> > > >> +   return rte_cmp64(src_1, src_2);
> > > >> +
> > > >> +   if (n & 0x20) {
> > > >> +   ret = rte_cmp32(src_1, src_2);
> > > >> +   n -= 0x20;
> > > >> +   src_1 += 0x20;
> > > >> +   src_2 += 0x20;
> > > >> +   }
> > > >>
> > > >>
> > > > Pardon me for butting in, but this seems incorrect for the first two
> cases
> > > > listed above, as the function as written will only compare the first
> 128 or
> > > > 64 bytes of each source and return the result.  The pattern
> expressed in
> > > > the 32 byte case appears more correct, as it compares the first 32
> bytes
> > > > and then lets later pieces of the function handle the smaller
> remaining
> > > > bits of the sources. Also, if this function is to handle arbitrarily
> large
> > > > source data, the 128 byte case needs to be in a loop.
> > > >
> > > > What am I missing?
> > > >
> > >
> > > Current max hash key length supported is 64 bytes, hence no comparison
> is
> > > done after 64 bytes. 128 bytes comparison is added to measure
> performance
> > > only and there is no use-case as of now. With the current use-cases
> its not
> > > required but if there is a need to handle large arbitrary data upto 128
> > > bytes it can be modified.
> > So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid results,
> right?
> > While on PPC will work as expected (as it calls memcpu underneath)?
> > That looks really weird to me.
> > If you plan to use rte_memcmp only for hash comparisons, then probably
> > you should put it somewhere into librte_hash and name it accordingly:
> rte_hash_key_cmp() or something.
> > And put a big comment around it, that it only works with particular
> lengths.
> > If you want it to be a generic function inside EAL, then it probably
> need to handle different lengths properly
> > on all supported architectures.
> > Konstantin
> >
> >
> > Let me just explain it here and probably add it to document as well.
> >
> > rte_memcmp is not
> >
> > 1. a replacement to memcmp
> >
> > 2.  restricted to hash key comparison
> >
> > rte_memcmp is
> >
> > 1. optimized comparison for 16 to 128 bytes, v1 patch series had this
> support. Changed some of the logic in v2 due to concerns raised
> > for unavailable use-cases beyond 64 bytes comparison.
>
> From what I see in v2 it supposed to work correctly for len in [0,64] and
> len=128, right?
> Not sure I get it: so for v1 it was able to handle any length correctly,
> but then you removed it?
> If so, I wonder what was the reason? Make it faster?
>

My initial discussion was with Zhilong(John) from Intel and we decided to
implement up to 128 bytes comparison and use rte_hash and rte_lpm6 as a
candidate for testing. When I sent out v1 patch, Bruce comments were on
use-case for 128 bytes comparison and was it really required? Hence I
decided in v2 to support only up to 64 bytes and added 128 bytes only for
performance measurement.

Personally I think support for up to 128 bytes comparison is required,
there might not be use-cases today but it will definitely be useful.


> Another thing that looks strange to me:
> While all rte_cmp*() uses actual data values for comparison results,
> rte_memcmp_remainder() return value depends not only on data values but
> also on data locations:
>
> +static inline int
> +rte_memcmp_remainder(const uint8_t *src_1u, const uint8_t *src_2u, size_t
> n)
> +{
> ...
> exit:
> +
> +   return src_1u < src_2u ? -1 : 

[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

2015-05-11 Thread Don Provan
I probably shouldn't stick my nose into this, but I can't help myself.

An experienced programmer will tend to ignore the documentation for
a routine named "blahblah_memcmp" and just assume it functions like
memcmp. Whether or not there's currently a use case in DPDK is
completely irrelevant because as soon as there *is* a use case, some
poor DPDK developer will try to use rte_memcmp for that and may or
may not have a test case that reveals their mistake.

The term "compare" suggests checking for larger or smaller.
If you want to check for equality, use "equal" or "eq" in the name
and return true if they're equal. But personally, I'd compare unless
there was a good reason not to. Indeed, I would just implement
full memcmp functionality and be done with it, even if that meant
using my fancy new assembly code for the cases I handle and then
calling memcmp itself for the cases I didn't.

If a routine that appears to take an arbitrary size doesn't, the name
should in some manner reflect what sizes it takes. Better would be
for a routine that only handles specific sizes to be split into versions
that only take fixed sizes, but I don't know enough about your use
cases to say whether that makes sense here.

-don provan
dprovan at bivio.net

-Original Message-
From: Ravi Kerur [mailto:rke...@gmail.com] 
Sent: Monday, May 11, 2015 1:47 PM
To: Ananyev, Konstantin
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

...
Following memcmp semantics is not hard but there are no use-cases for it in 
DPDK currently. Keeping it specific to DPDK usage simplifies code as well.
I can change the name to "rte_compare" and add comments to the function.
Will it work?
...



[dpdk-dev] [PATCH v7 1/2] Simplify the ifdefs in rte.app.mk

2015-05-11 Thread Keith Wiles
Trying to simplify the ifdefs in rte.app.mk to make the code
more readable and maintainable by moving LDLIBS variable to use
the same style as LDLIBS-y being used in the rest of the code.

Added a new variable called EXTRA_LDLIBS to be used by example apps
instead of using LDLIBS directly. The new internal variable _LDLIBS
should not be used outside of the rte.app.mk file.

Signed-off-by: Keith Wiles 
---
 mk/rte.app.mk | 242 +++---
 1 file changed, 60 insertions(+), 182 deletions(-)

diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 62a76ae..b8030d2 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -1,7 +1,7 @@
 #   BSD LICENSE
 #
-#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
-#   Copyright(c) 2014 6WIND S.A.
+#   Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+#   Copyright(c) 2014-2015 6WIND S.A.
 #   All rights reserved.
 #
 #   Redistribution and use in source and binary forms, with or without
@@ -51,7 +51,7 @@ LDSCRIPT = $(RTE_LDSCRIPT)
 endif

 # default path for libs
-LDLIBS += -L$(RTE_SDK_BIN)/lib
+_LDLIBS-y += -L$(RTE_SDK_BIN)/lib

 #
 # Include libraries depending on config if NO_AUTOLIBS is not set
@@ -59,215 +59,93 @@ LDLIBS += -L$(RTE_SDK_BIN)/lib
 #
 ifeq ($(NO_AUTOLIBS),)

-LDLIBS += --whole-archive
+_LDLIBS-y += --whole-archive

-ifeq ($(CONFIG_RTE_BUILD_COMBINE_LIBS),y)
-LDLIBS += -l$(RTE_LIBNAME)
-endif
+_LDLIBS-$(CONFIG_RTE_BUILD_COMBINE_LIBS)+= -l$(RTE_LIBNAME)

 ifeq ($(CONFIG_RTE_BUILD_COMBINE_LIBS),n)

-ifeq ($(CONFIG_RTE_LIBRTE_DISTRIBUTOR),y)
-LDLIBS += -lrte_distributor
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_REORDER),y)
-LDLIBS += -lrte_reorder
-endif
+_LDLIBS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR)+= -lrte_distributor
+_LDLIBS-$(CONFIG_RTE_LIBRTE_REORDER)+= -lrte_reorder

-ifeq ($(CONFIG_RTE_LIBRTE_KNI),y)
 ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
-LDLIBS += -lrte_kni
-endif
+_LDLIBS-$(CONFIG_RTE_LIBRTE_KNI)+= -lrte_kni
+_LDLIBS-$(CONFIG_RTE_LIBRTE_IVSHMEM)+= -lrte_ivshmem
 endif

-ifeq ($(CONFIG_RTE_LIBRTE_IVSHMEM),y)
-ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
-LDLIBS += -lrte_ivshmem
-endif
-endif
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PIPELINE)   += -lrte_pipeline
+_LDLIBS-$(CONFIG_RTE_LIBRTE_TABLE)  += -lrte_table
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)   += -lrte_port
+_LDLIBS-$(CONFIG_RTE_LIBRTE_TIMER)  += -lrte_timer
+_LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)   += -lrte_hash
+_LDLIBS-$(CONFIG_RTE_LIBRTE_JOBSTATS)   += -lrte_jobstats
+_LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)+= -lrte_lpm
+_LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)  += -lrte_power
+_LDLIBS-$(CONFIG_RTE_LIBRTE_ACL)+= -lrte_acl
+_LDLIBS-$(CONFIG_RTE_LIBRTE_METER)  += -lrte_meter

-ifeq ($(CONFIG_RTE_LIBRTE_PIPELINE),y)
-LDLIBS += -lrte_pipeline
-endif
+_LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED)  += -lrte_sched
+_LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED)  += -lm
+_LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED)  += -lrt

-ifeq ($(CONFIG_RTE_LIBRTE_TABLE),y)
-LDLIBS += -lrte_table
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_PORT),y)
-LDLIBS += -lrte_port
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_TIMER),y)
-LDLIBS += -lrte_timer
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_HASH),y)
-LDLIBS += -lrte_hash
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_JOBSTATS),y)
-LDLIBS += -lrte_jobstats
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_LPM),y)
-LDLIBS += -lrte_lpm
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_POWER),y)
-LDLIBS += -lrte_power
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_ACL),y)
-LDLIBS += -lrte_acl
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_METER),y)
-LDLIBS += -lrte_meter
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_SCHED),y)
-LDLIBS += -lrte_sched
-LDLIBS += -lm
-LDLIBS += -lrt
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_VHOST), y)
-LDLIBS += -lrte_vhost
-endif
+_LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)  += -lrte_vhost

 endif # ! CONFIG_RTE_BUILD_COMBINE_LIBS

-ifeq ($(CONFIG_RTE_LIBRTE_PMD_PCAP),y)
-LDLIBS += -lpcap
-endif
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_PCAP)   += -lpcap

-ifeq ($(CONFIG_RTE_LIBRTE_VHOST)$(CONFIG_RTE_LIBRTE_VHOST_USER),yn)
-LDLIBS += -lfuse
+ifeq ($(CONFIG_RTE_LIBRTE_VHOST_USER),n)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)  += -lfuse
 endif

-ifeq ($(CONFIG_RTE_LIBRTE_MLX4_PMD),y)
-LDLIBS += -libverbs
-endif
+_LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)   += -libverbs

-LDLIBS += --start-group
+_LDLIBS-y += --start-group

 ifeq ($(CONFIG_RTE_BUILD_COMBINE_LIBS),n)

-ifeq ($(CONFIG_RTE_LIBRTE_KVARGS),y)
-LDLIBS += -lrte_kvargs
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_MBUF),y)
-LDLIBS += -lrte_mbuf
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_IP_FRAG),y)
-LDLIBS += -lrte_ip_frag
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_ETHER),y)
-LDLIBS += -lethdev
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_MALLOC),y)
-LDLIBS += -lrte_malloc
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_MEMPOOL),y)
-LDLIBS += -lrte_mempool
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_RING),y)
-LDLIBS += -lrte_ring
-endif
-
-ifeq ($(CONFIG_RTE_LIBRTE_EAL),y)
-LDLIBS += 

[dpdk-dev] [PATCH v7 2/2] Update Docs for new EXTRA_LDLIBS variable

2015-05-11 Thread Keith Wiles
Signed-off-by: Keith Wiles 
---
 doc/build-sdk-quick.txt  | 1 +
 doc/guides/prog_guide/dev_kit_build_system.rst   | 2 ++
 doc/guides/prog_guide/dev_kit_root_make_help.rst | 2 +-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/doc/build-sdk-quick.txt b/doc/build-sdk-quick.txt
index 041a40e..26d5442 100644
--- a/doc/build-sdk-quick.txt
+++ b/doc/build-sdk-quick.txt
@@ -13,6 +13,7 @@ Build variables
EXTRA_CPPFLAGS   preprocessor options
EXTRA_CFLAGS compiler options
EXTRA_LDFLAGSlinker options
+   EXTRA_LDLIBS linker libary options
RTE_KERNELDIRlinux headers path
CROSS toolchain prefix
V verbose
diff --git a/doc/guides/prog_guide/dev_kit_build_system.rst 
b/doc/guides/prog_guide/dev_kit_build_system.rst
index 5bfef58..50bfe34 100644
--- a/doc/guides/prog_guide/dev_kit_build_system.rst
+++ b/doc/guides/prog_guide/dev_kit_build_system.rst
@@ -413,6 +413,8 @@ Variables that Can be Set/Overridden by the User in a 
Makefile or Command Line

 *   EXTRA_LDFLAGS: The content of this variable is appended after LDFLAGS when 
linking.

+*   EXTRA_LDLIBS: The content of this variable is appended after LDLIBS when 
linking.
+
 *   EXTRA_ASFLAGS: The content of this variable is appended after ASFLAGS when 
assembling.

 *   EXTRA_CPPFLAGS: The content of this variable is appended after CPPFLAGS 
when using a C preprocessor on assembly files.
diff --git a/doc/guides/prog_guide/dev_kit_root_make_help.rst 
b/doc/guides/prog_guide/dev_kit_root_make_help.rst
index 333b007..e522c12 100644
--- a/doc/guides/prog_guide/dev_kit_root_make_help.rst
+++ b/doc/guides/prog_guide/dev_kit_root_make_help.rst
@@ -218,7 +218,7 @@ The following variables can be specified on the command 
line:

 Enable dependency debugging. This provides some useful information about 
why a target is built or not.

-*   EXTRA_CFLAGS=, EXTRA_LDFLAGS=, EXTRA_ASFLAGS=, EXTRA_CPPFLAGS=
+*   EXTRA_CFLAGS=, EXTRA_LDFLAGS=, EXTRA_LDLIBS=, EXTRA_ASFLAGS=, 
EXTRA_CPPFLAGS=

 Append specific compilation, link or asm flags.

-- 
2.3.0