[dpdk-dev] DPDK and custom memory

2014-09-19 Thread Saygin, Artur
FWIW: rte_mempool_xmem_create turned out to be exactly what the use case 
requires. It's not without limitations but is probably better than having to 
copy buffers between device and DPDK memory.

-Original Message-
From: Neil Horman [mailto:nhor...@tuxdriver.com] 
Sent: Wednesday, September 03, 2014 3:04 AM
To: Saygin, Artur
Cc: Alex Markuze; Thomas Monjalon; dev at dpdk.org
Subject: Re: [dpdk-dev] DPDK and custom memory

On Wed, Sep 03, 2014 at 01:17:53AM +, Saygin, Artur wrote:
> Thanks for prompt responses!
> 
> To clarify, the questions is not about accessing a NIC, but about a NIC 
> accessing a very specific block of physical memory, possibly non-kernel 
> managed.
> 
Still not sure what you mean here by non-kernel managed.  If memory can be
accessed from the CPU, then the kernel can allocate, free and access it, thats
it.  If the memory isn't accessible from the cpu, then this is out of our hands
anyway.  The only question is, how do you access it.

> Per my understanding memory that rte_mempool_create API obtains is kernel 
> managed, grabbed by DPDK via HUGETLBFS, with address selection being outside 
> of application control. Is there a way around that? As in have DPDK allocate 
> buffer memory from address XYZ only...
Nope, the DPDK allocates blocks of memory without regard to the operation of the
NIC.  If you have some odd NIC that requires access to a specific physical
memory range, then it is your responsibility to reserve that memory and author
the PMD in such a way that it communicates with the NIC via that memory.
Usually this is done via a combination of operating system facilities (e.g. the
linux kernel commanline option memmap or the runtime mmap operation on the
/dev/mem device).

Regards
Neil

> 
> If VFIO / IOMMU is still the answer - I'll poke in that direction. If not - 
> any additional insight is appreciated.
> 
> -Original Message-
> From: Alex Markuze [mailto:alex at weka.io] 
> Sent: Sunday, August 31, 2014 1:27 AM
> To: Thomas Monjalon
> Cc: Saygin, Artur; dev at dpdk.org
> Subject: Re: [dpdk-dev] DPDK and custom memory
> 
> Artur, I don't have the details of what you are trying to achieve, but
> it sounds like something that is covered by IOMMU, SW or HW.  The
> IOMMU creates an iova (I/O Virtual address) the nic can access the
> range is controlled with flags passed to the dma_map functions.
> 
> So I understand your question this way, How does the DPDK work with
> IOMMU enabled system and can you influence the mapping?
> 
> 
> On Sat, Aug 30, 2014 at 4:03 PM, Thomas Monjalon
>  wrote:
> > Hello,
> >
> > 2014-08-29 18:40, Saygin, Artur:
> >> Imagine a PMD for an FPGA-based NIC that is limited to accessing certain
> >> memory regions .
> >
> > Does it mean Intel is making an FPGA-based NIC?
> >
> >> Is there a way to make DPDK use that exact memory?
> >
> > Maybe I don't understand the question well, because it doesn't seem really
> > different of what other PMDs do.
> > Assuming your NIC is PCI, you can access it via uio (igb_uio) or VFIO.
> >
> >> Perhaps this is more of a hugetlbfs question than DPDK but I thought I'd
> >> start here.
> >
> > It's a pleasure to receive new drivers.
> > Welcome here :)
> >
> > --
> > Thomas


[dpdk-dev] [PATCH v2 0/3] add i40e RSS support in VF

2014-09-19 Thread Helin Zhang
As hardware supports RSS in VF, the patches add that support
in driver. In addition, minor improvements are added for
defining macro with constant.

v2 changes:
* Remove support of updating/querying redirection table, as it
  will be implemented in another patches later.
* Remove changes in testpmd, as it is not needed at all for
  supporting RSS in VF.

Helin Zhang (3):
  ethdev: improvement for constant usage
  i40e: extern two functions and relevant macros
  i40evf: support of RSS in VF

 lib/librte_ether/rte_ethdev.h|  47 ++--
 lib/librte_pmd_i40e/i40e_ethdev.c|   4 +-
 lib/librte_pmd_i40e/i40e_ethdev.h|  40 +-
 lib/librte_pmd_i40e/i40e_ethdev_vf.c | 142 +++
 4 files changed, 207 insertions(+), 26 deletions(-)

-- 
1.8.1.4



[dpdk-dev] [PATCH v2 1/3] ethdev: improvement for constant usage

2014-09-19 Thread Helin Zhang
Forced type conversion is not needed to define a macro with
constant. The alternate is to let compiler use the default width,
or specify the width with suffix of 'U', 'UL', 'ULL', etc.

Signed-off-by: Helin Zhang 
Reviewed-by: Cunming Liang 
Reviewed-by: Jijiang Liu 
---
 lib/librte_ether/rte_ethdev.h | 47 ++-
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 50df654..3a0b33b 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -362,30 +362,31 @@ struct rte_eth_rss_conf {
 #define ETH_RSS_L2_PAYLOAD_SHIFT  63

 /* for 1G & 10G */
-#define ETH_RSS_IPV4((uint16_t)1 << ETH_RSS_IPV4_SHIFT)
-#define ETH_RSS_IPV4_TCP((uint16_t)1 << ETH_RSS_IPV4_TCP_SHIFT)
-#define ETH_RSS_IPV6((uint16_t)1 << ETH_RSS_IPV6_SHIFT)
-#define ETH_RSS_IPV6_EX ((uint16_t)1 << ETH_RSS_IPV6_EX_SHIFT)
-#define ETH_RSS_IPV6_TCP((uint16_t)1 << ETH_RSS_IPV6_TCP_SHIFT)
-#define ETH_RSS_IPV6_TCP_EX ((uint16_t)1 << 
ETH_RSS_IPV6_TCP_EX_SHIFT)
-#define ETH_RSS_IPV4_UDP((uint16_t)1 << ETH_RSS_IPV4_UDP_SHIFT)
-#define ETH_RSS_IPV6_UDP((uint16_t)1 << ETH_RSS_IPV6_UDP_SHIFT)
-#define ETH_RSS_IPV6_UDP_EX ((uint16_t)1 << 
ETH_RSS_IPV6_UDP_EX_SHIFT)
+#define ETH_RSS_IPV4(1 << ETH_RSS_IPV4_SHIFT)
+#define ETH_RSS_IPV4_TCP(1 << ETH_RSS_IPV4_TCP_SHIFT)
+#define ETH_RSS_IPV6(1 << ETH_RSS_IPV6_SHIFT)
+#define ETH_RSS_IPV6_EX (1 << ETH_RSS_IPV6_EX_SHIFT)
+#define ETH_RSS_IPV6_TCP(1 << ETH_RSS_IPV6_TCP_SHIFT)
+#define ETH_RSS_IPV6_TCP_EX (1 << ETH_RSS_IPV6_TCP_EX_SHIFT)
+#define ETH_RSS_IPV4_UDP(1 << ETH_RSS_IPV4_UDP_SHIFT)
+#define ETH_RSS_IPV6_UDP(1 << ETH_RSS_IPV6_UDP_SHIFT)
+#define ETH_RSS_IPV6_UDP_EX (1 << ETH_RSS_IPV6_UDP_EX_SHIFT)
 /* for 40G only */
-#define ETH_RSS_NONF_IPV4_UDP   ((uint64_t)1 << 
ETH_RSS_NONF_IPV4_UDP_SHIFT)
-#define ETH_RSS_NONF_IPV4_TCP   ((uint64_t)1 << 
ETH_RSS_NONF_IPV4_TCP_SHIFT)
-#define ETH_RSS_NONF_IPV4_SCTP  ((uint64_t)1 << 
ETH_RSS_NONF_IPV4_SCTP_SHIFT)
-#define ETH_RSS_NONF_IPV4_OTHER ((uint64_t)1 << 
ETH_RSS_NONF_IPV4_OTHER_SHIFT)
-#define ETH_RSS_FRAG_IPV4   ((uint64_t)1 << 
ETH_RSS_FRAG_IPV4_SHIFT)
-#define ETH_RSS_NONF_IPV6_UDP   ((uint64_t)1 << 
ETH_RSS_NONF_IPV6_UDP_SHIFT)
-#define ETH_RSS_NONF_IPV6_TCP   ((uint64_t)1 << 
ETH_RSS_NONF_IPV6_TCP_SHIFT)
-#define ETH_RSS_NONF_IPV6_SCTP  ((uint64_t)1 << 
ETH_RSS_NONF_IPV6_SCTP_SHIFT)
-#define ETH_RSS_NONF_IPV6_OTHER ((uint64_t)1 << 
ETH_RSS_NONF_IPV6_OTHER_SHIFT)
-#define ETH_RSS_FRAG_IPV6   ((uint64_t)1 << 
ETH_RSS_FRAG_IPV6_SHIFT)
-#define ETH_RSS_FCOE_OX ((uint64_t)1 << ETH_RSS_FCOE_OX_SHIFT) 
/* not used */
-#define ETH_RSS_FCOE_RX ((uint64_t)1 << ETH_RSS_FCOE_RX_SHIFT) 
/* not used */
-#define ETH_RSS_FCOE_OTHER  ((uint64_t)1 << 
ETH_RSS_FCOE_OTHER_SHIFT) /* not used */
-#define ETH_RSS_L2_PAYLOAD  ((uint64_t)1 << 
ETH_RSS_L2_PAYLOAD_SHIFT)
+#define ETH_RSS_NONF_IPV4_UDP   (1ULL << ETH_RSS_NONF_IPV4_UDP_SHIFT)
+#define ETH_RSS_NONF_IPV4_TCP   (1ULL << ETH_RSS_NONF_IPV4_TCP_SHIFT)
+#define ETH_RSS_NONF_IPV4_SCTP  (1ULL << ETH_RSS_NONF_IPV4_SCTP_SHIFT)
+#define ETH_RSS_NONF_IPV4_OTHER (1ULL << ETH_RSS_NONF_IPV4_OTHER_SHIFT)
+#define ETH_RSS_FRAG_IPV4   (1ULL << ETH_RSS_FRAG_IPV4_SHIFT)
+#define ETH_RSS_NONF_IPV6_UDP   (1ULL << ETH_RSS_NONF_IPV6_UDP_SHIFT)
+#define ETH_RSS_NONF_IPV6_TCP   (1ULL << ETH_RSS_NONF_IPV6_TCP_SHIFT)
+#define ETH_RSS_NONF_IPV6_SCTP  (1ULL << ETH_RSS_NONF_IPV6_SCTP_SHIFT)
+#define ETH_RSS_NONF_IPV6_OTHER (1ULL << ETH_RSS_NONF_IPV6_OTHER_SHIFT)
+#define ETH_RSS_FRAG_IPV6   (1ULL << ETH_RSS_FRAG_IPV6_SHIFT)
+/* FCOE relevant should not be used */
+#define ETH_RSS_FCOE_OX (1ULL << ETH_RSS_FCOE_OX_SHIFT)
+#define ETH_RSS_FCOE_RX (1ULL << ETH_RSS_FCOE_RX_SHIFT)
+#define ETH_RSS_FCOE_OTHER  (1ULL << ETH_RSS_FCOE_OTHER_SHIFT)
+#define ETH_RSS_L2_PAYLOAD  (1ULL << ETH_RSS_L2_PAYLOAD_SHIFT)

 #define ETH_RSS_IP ( \
ETH_RSS_IPV4 | \
-- 
1.8.1.4



[dpdk-dev] [PATCH v2 2/3] i40e: extern two functions and relevant macros

2014-09-19 Thread Helin Zhang
To reuse code, 'i40e_config_hena()' and 'i40e_parse_hena()' and
their relevant macros need to be extern, and then can be used for
both PF and VF parts.

Signed-off-by: Helin Zhang 
Reviewed-by: Cunming Liang 
Reviewed-by: Jijiang Liu 
---
 lib/librte_pmd_i40e/i40e_ethdev.c |  4 ++--
 lib/librte_pmd_i40e/i40e_ethdev.h | 34 +-
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/lib/librte_pmd_i40e/i40e_ethdev.c 
b/lib/librte_pmd_i40e/i40e_ethdev.c
index 4e65ca4..f7a40a9 100644
--- a/lib/librte_pmd_i40e/i40e_ethdev.c
+++ b/lib/librte_pmd_i40e/i40e_ethdev.c
@@ -3919,7 +3919,7 @@ DONE:
 }

 /* Configure hash enable flags for RSS */
-static uint64_t
+uint64_t
 i40e_config_hena(uint64_t flags)
 {
uint64_t hena = 0;
@@ -3954,7 +3954,7 @@ i40e_config_hena(uint64_t flags)
 }

 /* Parse the hash enable flags */
-static uint64_t
+uint64_t
 i40e_parse_hena(uint64_t flags)
 {
uint64_t rss_hf = 0;
diff --git a/lib/librte_pmd_i40e/i40e_ethdev.h 
b/lib/librte_pmd_i40e/i40e_ethdev.h
index 64deef2..c801345 100644
--- a/lib/librte_pmd_i40e/i40e_ethdev.h
+++ b/lib/librte_pmd_i40e/i40e_ethdev.h
@@ -68,6 +68,36 @@
   I40E_FLAG_HEADER_SPLIT_ENABLED | \
   I40E_FLAG_FDIR)

+#define I40E_RSS_OFFLOAD_ALL ( \
+   ETH_RSS_NONF_IPV4_UDP | \
+   ETH_RSS_NONF_IPV4_TCP | \
+   ETH_RSS_NONF_IPV4_SCTP | \
+   ETH_RSS_NONF_IPV4_OTHER | \
+   ETH_RSS_FRAG_IPV4 | \
+   ETH_RSS_NONF_IPV6_UDP | \
+   ETH_RSS_NONF_IPV6_TCP | \
+   ETH_RSS_NONF_IPV6_SCTP | \
+   ETH_RSS_NONF_IPV6_OTHER | \
+   ETH_RSS_FRAG_IPV6 | \
+   ETH_RSS_L2_PAYLOAD)
+
+/* All bits of RSS hash enable */
+#define I40E_RSS_HENA_ALL ( \
+   (1ULL << I40E_FILTER_PCTYPE_NONF_IPV4_UDP) | \
+   (1ULL << I40E_FILTER_PCTYPE_NONF_IPV4_TCP) | \
+   (1ULL << I40E_FILTER_PCTYPE_NONF_IPV4_SCTP) | \
+   (1ULL << I40E_FILTER_PCTYPE_NONF_IPV4_OTHER) | \
+   (1ULL << I40E_FILTER_PCTYPE_FRAG_IPV4) | \
+   (1ULL << I40E_FILTER_PCTYPE_NONF_IPV6_UDP) | \
+   (1ULL << I40E_FILTER_PCTYPE_NONF_IPV6_TCP) | \
+   (1ULL << I40E_FILTER_PCTYPE_NONF_IPV6_SCTP) | \
+   (1ULL << I40E_FILTER_PCTYPE_NONF_IPV6_OTHER) | \
+   (1ULL << I40E_FILTER_PCTYPE_FRAG_IPV6) | \
+   (1ULL << I40E_FILTER_PCTYPE_FCOE_OX) | \
+   (1ULL << I40E_FILTER_PCTYPE_FCOE_RX) | \
+   (1ULL << I40E_FILTER_PCTYPE_FCOE_OTHER) | \
+   (1ULL << I40E_FILTER_PCTYPE_L2_PAYLOAD))
+
 struct i40e_adapter;

 TAILQ_HEAD(i40e_mac_filter_list, i40e_mac_filter);
@@ -310,8 +340,10 @@ int i40e_dev_link_update(struct rte_eth_dev *dev,
 void i40e_vsi_queues_bind_intr(struct i40e_vsi *vsi);
 void i40e_vsi_queues_unbind_intr(struct i40e_vsi *vsi);
 int i40e_vsi_vlan_pvid_set(struct i40e_vsi *vsi,
-   struct i40e_vsi_vlan_pvid_info *info);
+  struct i40e_vsi_vlan_pvid_info *info);
 int i40e_vsi_config_vlan_stripping(struct i40e_vsi *vsi, bool on);
+uint64_t i40e_config_hena(uint64_t flags);
+uint64_t i40e_parse_hena(uint64_t flags);

 /* I40E_DEV_PRIVATE_TO */
 #define I40E_DEV_PRIVATE_TO_PF(adapter) \
-- 
1.8.1.4



[dpdk-dev] [PATCH v2 3/3] i40evf: support of RSS in VF

2014-09-19 Thread Helin Zhang
i40e hardware supports RSS in VF, the code changes are to add
this support.

v2 changes:
Updating/querying redirection table has been removed, as it
will be in another patches of supporting different redirection
table sizes soon later.

Signed-off-by: Helin Zhang 
Reviewed-by: Cunming Liang 
Reviewed-by: Jijiang Liu 
---
 lib/librte_pmd_i40e/i40e_ethdev.h|   6 ++
 lib/librte_pmd_i40e/i40e_ethdev_vf.c | 142 +++
 2 files changed, 148 insertions(+)

diff --git a/lib/librte_pmd_i40e/i40e_ethdev.h 
b/lib/librte_pmd_i40e/i40e_ethdev.h
index c801345..1d42cd2 100644
--- a/lib/librte_pmd_i40e/i40e_ethdev.h
+++ b/lib/librte_pmd_i40e/i40e_ethdev.h
@@ -283,6 +283,8 @@ struct i40e_vf_tx_queues {
  * Structure to store private data specific for VF instance.
  */
 struct i40e_vf {
+   struct i40e_adapter *adapter; /* The adapter this VF associate to */
+   struct rte_eth_dev_data *dev_data; /* Pointer to the device data */
uint16_t num_queue_pairs;
uint16_t max_pkt_len; /* Maximum packet length */
bool promisc_unicast_enabled;
@@ -393,6 +395,10 @@ i40e_get_vsi_from_adapter(struct i40e_adapter *adapter)
 #define I40E_PF_TO_ADAPTER(pf) \
((struct i40e_adapter *)pf->adapter)

+/* I40E_VF_TO */
+#define I40E_VF_TO_HW(vf) \
+   (&(((struct i40e_vf *)vf)->adapter->hw))
+
 static inline void
 i40e_init_adminq_parameter(struct i40e_hw *hw)
 {
diff --git a/lib/librte_pmd_i40e/i40e_ethdev_vf.c 
b/lib/librte_pmd_i40e/i40e_ethdev_vf.c
index d8552ad..840ae65 100644
--- a/lib/librte_pmd_i40e/i40e_ethdev_vf.c
+++ b/lib/librte_pmd_i40e/i40e_ethdev_vf.c
@@ -125,10 +125,19 @@ static void i40evf_dev_allmulticast_disable(struct 
rte_eth_dev *dev);
 static int i40evf_get_link_status(struct rte_eth_dev *dev,
  struct rte_eth_link *link);
 static int i40evf_init_vlan(struct rte_eth_dev *dev);
+static int i40evf_config_rss(struct i40e_vf *vf);
+static int i40evf_dev_rss_hash_update(struct rte_eth_dev *dev,
+ struct rte_eth_rss_conf *rss_conf);
+static int i40evf_dev_rss_hash_conf_get(struct rte_eth_dev *dev,
+   struct rte_eth_rss_conf *rss_conf);
 static int i40evf_dev_rx_queue_start(struct rte_eth_dev *, uint16_t);
 static int i40evf_dev_rx_queue_stop(struct rte_eth_dev *, uint16_t);
 static int i40evf_dev_tx_queue_start(struct rte_eth_dev *, uint16_t);
 static int i40evf_dev_tx_queue_stop(struct rte_eth_dev *, uint16_t);
+
+/* Default hash key buffer for RSS */
+static uint32_t rss_key_default[I40E_VFQF_HKEY_MAX_INDEX + 1];
+
 static struct eth_dev_ops i40evf_eth_dev_ops = {
.dev_configure= i40evf_dev_configure,
.dev_start= i40evf_dev_start,
@@ -152,6 +161,8 @@ static struct eth_dev_ops i40evf_eth_dev_ops = {
.rx_queue_release = i40e_dev_rx_queue_release,
.tx_queue_setup   = i40e_dev_tx_queue_setup,
.tx_queue_release = i40e_dev_tx_queue_release,
+   .rss_hash_update  = i40evf_dev_rss_hash_update,
+   .rss_hash_conf_get= i40evf_dev_rss_hash_conf_get,
 };

 static int
@@ -978,6 +989,8 @@ i40evf_init_vf(struct rte_eth_dev *dev)
struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private);
struct i40e_vf *vf = I40EVF_DEV_PRIVATE_TO_VF(dev->data->dev_private);

+   vf->adapter = I40E_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+   vf->dev_data = dev->data;
err = i40evf_set_mac_type(hw);
if (err) {
PMD_INIT_LOG(ERR, "set_mac_type failed: %d\n", err);
@@ -1334,11 +1347,13 @@ i40evf_vlan_filter_set(struct rte_eth_dev *dev, 
uint16_t vlan_id, int on)
 static int
 i40evf_rx_init(struct rte_eth_dev *dev)
 {
+   struct i40e_vf *vf = I40EVF_DEV_PRIVATE_TO_VF(dev->data->dev_private);
uint16_t i;
struct i40e_rx_queue **rxq =
(struct i40e_rx_queue **)dev->data->rx_queues;
struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private);

+   i40evf_config_rss(vf);
for (i = 0; i < dev->data->nb_rx_queues; i++) {
rxq[i]->qrx_tail = hw->hw_addr + I40E_QRX_TAIL1(i);
I40E_PCI_REG_WRITE(rxq[i]->qrx_tail, rxq[i]->nb_rx_desc - 1);
@@ -1573,3 +1588,130 @@ i40evf_dev_close(struct rte_eth_dev *dev)
i40evf_reset_vf(hw);
i40e_shutdown_adminq(hw);
 }
+
+static int
+i40evf_hw_rss_hash_set(struct i40e_hw *hw, struct rte_eth_rss_conf *rss_conf)
+{
+   uint32_t *hash_key;
+   uint8_t hash_key_len;
+   uint64_t rss_hf, hena;
+
+   hash_key = (uint32_t *)(rss_conf->rss_key);
+   hash_key_len = rss_conf->rss_key_len;
+   if (hash_key != NULL && hash_key_len >=
+   (I40E_VFQF_HKEY_MAX_INDEX + 1) * sizeof(uint32_t)) {
+   uint16_t i;
+
+   for (i = 0; i <= I40E_VFQF_HKEY_MAX_INDEX; i++)
+   I40E_WRITE_REG(hw, I40E_VFQF_HKEY(i), hash_key[i]);
+   }
+
+   rs

[dpdk-dev] Maximum possible throughput with the KNI DPDK Application

2014-09-19 Thread Zhang, Helin
Hi

Sure, multiple queues can be used in any KNI app, actually current ?KNI example 
app = l2fwd app + kni support?. So you can do in KNI app of what can be done in 
l2fwd. But you might need to try if it really works with multiple queues in 
current example KNI app, as I did not test it as that.
Actually KNI library just provides a way to exchange packets between kernel 
space and user space, no matter how the packets are received and transmitted in 
user space.

Regards,
Helin

From: Malveeka Tewari [mailto:malve...@gmail.com]
Sent: Friday, September 19, 2014 7:15 AM
To: Zhang, Helin; dev at dpdk.org
Subject: Re: [dpdk-dev] Maximum possible throughput with the KNI DPDK 
Application

[+dev at dpdk.org]

Sure, I understand that.
The 7Gb/s performance with iperf that I was getting was with one end-host using 
the KNI  app and the other host running the traditional linux stack.
With both end hosts running the KNI app, I see about 2.75Gb/s which is 
understandable because the TSO/LRO and other hardware NIC features are turned 
off.

I have another related question.
Is it possible to use multiple traffic queues with the KNI app?
I tried created different queues using tc for the vEth0_0 device but that gave 
me an error.

>$ sudo tc qdisc add dev vEth0_0 root handle 1: multiq
>$ RTNETLINK answers: Operation not supported

If I wanted to add support for multiple tc queues with the KNI app, where 
should I start making my changes?
I looked at the "lib/librte_kni/rte_kni_fifo.h" but it wasn't clear how I can 
add support for different queues for the KNI app.
Any pointers would be extremely helpful.

Thanks!

On Thu, Sep 18, 2014 at 3:28 PM, Malveeka Tewari mailto:malveeka at gmail.com>> wrote:
Sure, I understand that.
The 7Gb/s performance with iperf that I was getting was with one end-host using 
the DPDK framework and the other host running the traditional linux stack.
With both end hosts using DPDK, I see about 2.75Gb/s which is understandable 
because the TSO/LRO and other hardware NIC features are turned off.

I have another KNI related question.
Is it possible to use multiple traffic queues with the KNI app?
I tried created different queues using tc for the vEth0_0 device but that gave 
me an error.

>$ sudo tc qdisc add dev vEth0_0 root handle 1: multiq
>$ RTNETLINK answers: Operation not supported

If I wanted to add support for multiple tc queues with the KNI app, where 
should I start making my changes?
I looked at the "lib/librte_kni/rte_kni_fifo.h" but it wasn't clear how I can 
add support for different queues for the KNI app.
Any pointers would be extremely helpful.

Thanks!
Malveeka

On Wed, Sep 17, 2014 at 10:47 PM, Zhang, Helin mailto:helin.zhang at intel.com>> wrote:
Hi Malveeka

KNI loopback function can provide good enough performance, and more 
queues/threads can provide better performance. For formal KNI, it needs to talk 
with kernel stack and bridge, etc., the performance bottle neck is not in DPDK 
part anymore. You can try more queues/threads to see if performance is better. 
But do not expect too much!

Regards,
Helin

From: Malveeka Tewari [mailto:malveeka at gmail.com]
Sent: Thursday, September 18, 2014 12:56 PM
To: Zhang, Helin
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] Maximum possible throughput with the KNI DPDK 
Application

Thanks Helin!

I am actually working on a project to quantify the overhead of user-space to 
kernel-space data copying in case of conventional socket based applications.
My understanding is that the KNI application involves userspace -> kernel space 
-> user-space data copy again to send to the igb_uio driver.
I wanted to find out if the 7Gb/s throughput is the maximum throughput 
achievable by the KNI application or if someone has been able to achiever 
higher rates by using more cores or some other configuration.

Regards,
Malveeka




On Wed, Sep 17, 2014 at 6:01 PM, Zhang, Helin mailto:helin.zhang at intel.com>> wrote:
Hi Malveeka

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On 
> Behalf Of Malveeka Tewari
> Sent: Thursday, September 18, 2014 6:51 AM
> To: dev at dpdk.org
> Subject: [dpdk-dev] Maximum possible throughput with the KNI DPDK
> Application
>
> Hi all
>
> I've been playing the with DPDK API to send out packets using the l2fwd app
> and the Kernel Network Interface app with a single Intel 82599 NIC on an Intel
> Xeon E5-2630
>
> With the l2fwd application, I've been able to achieve 14.88 Mpps with minimum
> sized packets.
> However, running iperf with the KNI application only gives  me only ~7Gb/s
> peak throughput.

KNI is quite different from other DPDK applications, it is not for fast path 
forwarding. As it will pass the packets received in user space to kernel space, 
and possible the kernel stack. So don't expect too much higher performance. I 
think 7Gb/s might be a good enough d

[dpdk-dev] i40e: Steps and required configurations of how to achieve the best performance!

2014-09-19 Thread Zhang, Helin
Hi David

I agree with you that we need to re-think of these two configurations. Thank 
you very much for the good comments!

My idea on it could be,

1.   Write a script to use ?setpci? to configure pci configuration. End 
user can decide which PCI device needs to be changed.

2.   Add code to change that PCI configuration in i40e PMD only, as it 
seems nobody else need it till now.

Regards,
Helin

From: David Marchand [mailto:david.march...@6wind.com]
Sent: Thursday, September 18, 2014 4:58 PM
To: Zhang, Helin
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] i40e: Steps and required configurations of how to 
achieve the best performance!

Hello Helin,

On Thu, Sep 18, 2014 at 4:39 AM, Zhang, Helin mailto:helin.zhang at intel.com>> wrote:
Hi David

From: David Marchand [mailto:david.marchand at 
6wind.com]
Sent: Wednesday, September 17, 2014 10:03 PM
To: Zhang, Helin
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] i40e: Steps and required configurations of how to 
achieve the best performance!

On Wed, Sep 17, 2014 at 10:50 AM, Zhang, Helin mailto:helin.zhang at intel.com>> wrote:
For the ?extended tag?, it was defined in PCIe spec, but actually not all BIOS 
implements it. Enabling it in BIOS or at runtime are two choices of doing the 
same thing. I don?t think it can be configured per PCI device in BIOS, so we 
don?t need to do that per PCI device in DPDK. Right? Actually we don?t want to 
touch PCIe settings in DPDK code, that?s why we want to let BIOS config as it 
is by default. If no better choice, we can do it in DPDK by changing 
configurations.

- Ok, then if we can make a runtime decision (at dpdk level), there is no need 
for bios configuration and there is no need for a build option.
Why don't we get rid of this option ?

[Helin] Initially, we want to do that for BIOS, if specific BIOS does not 
implement it. That way it needs to be initialized once during initialization 
for each PCI device. Sure, that might not be the best option, but it is the 
easiest way. For Linux end users, the best option could be using ?setpci? 
command. It can enable ?extended_tag? per PCI device.

I am not sure I can see how easy it is since you are forcing this in a build 
option.
Anyway, all this knowledge should be in the documentation and not in an obscure 
build option that looks to be useless in the end.

The more I look at this, the more I think we did not have good enough argument 
for this change in eal / igb_uio yet.

We have something that gives "better performance" on "some server" with "some 
bios".




As far as the per-device runtime configuration is concerned, I want to make 
sure this pci configuration will not break other "igb_uio" pci devices.
If Intel can tell for sure this won't break other devices, then fine, we can go 
and enable this for all "igb_uio" pci devices.

[Helin] It is in PCIe specification, and enable it can provide better 
performance generally. But I cannot confirm that it would not break any other 
devices, as I don?t validate all devices. If you really concern it, ?setpci? 
can be the best option for you. We can add a script for that later.

Why not a script, but documentation is important too: I would say that we need 
an explicit list of platforms and nics which support this.



- By the way, there is also the CONFIG_MAX_READ_REQUEST_SIZE option that seems 
to be disabled (or at least its value 0 seems to tell so).
What is its purpose ?

[Helin] Yes, it was added for performance tuning long long ago. But now it 
seems contribute nothing or too few for the performance number, so I just skip 
it. The default value does nothing on PCIe registers, just keep it as is.

Not so long ago to dpdk.org (somewhere around 1.7.0 ...).
If this code had no use for "so long", why did it end up on 
dpdk.org ?
Why should we keep it ?


Thanks.

--
David Marchand


[dpdk-dev] [PATCH v3 00/20] cleanup logs in main PMDs

2014-09-19 Thread Thomas Monjalon
> Here is a patchset that reworks the log macro in e1000, ixgbe and i40e PMDs.
> The idea behind this is to make it easier to debug some init failures and to 
> be
> sure of the datapath selected in these PMDs (rx / tx handlers selection).
> 
> The PMDs changes involve adding more debug messages in the default build.
> A new eal option has been added to set the default log level, so that you can
> render the eal a little less noisy.
> 
> I did not change the default log level for now, as some eal log messages are
> marked as DEBUG while being interesting (from my point of view).
> I suppose we can change the default log level later once the eal has been
> cleaned up.
> 
> Changes since v2:
> - just a respin with Jay comments in mind
> * don't introduce \n in one commit then remove them
> * indent only the impacted parts before removing \n (so split previous 
> patches)
> * remove some "" garbage
> 
> Changes since v1:
> - continue clean up by always using PMD_*_LOG when logging something in
>   PMD (i.e. no more printf, RTE_LOG, DEBUGOUT)
> - introduce PMD_DRV_LOG_RAW macro for use by shared driver code
> - adopt 'second approach': no more \n in PMD_*_LOG callers. This means that we
>   will enforce a 'no \n' policy in logs for PMD.
> 
> David Marchand (20):
>   ixgbe: use the right debug macro
>   ixgbe/base: add a raw macro for use by shared code
>   ixgbe: indent logs sections
>   ixgbe: clean log messages
>   ixgbe: always log init messages
>   ixgbe: add a message when forcing scatter mode
>   ixgbe: add log messages when rx bulk mode is not usable
>   i40e: use the right debug macro
>   i40e/base: add a raw macro for use by shared code
>   i40e: indent logs sections
>   i40e: clean log messages
>   i40e: always log init messages
>   i40e: add log messages when rx bulk mode is not usable
>   e1000: use the right debug macro
>   e1000/base: add a raw macro for use by shared code
>   e1000: indent logs sections
>   e1000: clean log messages
>   e1000: always log init messages
>   e1000: add a message when forcing scatter mode
>   eal: set log level from command line

Applied for version 1.8.0.

Thanks
-- 
Thomas


[dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility

2014-09-19 Thread Richardson, Bruce
> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Thursday, September 18, 2014 8:14 PM
> To: Thomas Monjalon
> Cc: dev at dpdk.org; Richardson, Bruce
> Subject: Re: [PATCH 0/4] Add DSO symbol versioning to support backwards
> compatibility
> 
> On Thu, Sep 18, 2014 at 08:23:36PM +0200, Thomas Monjalon wrote:
> > Hi Neil,
> >
> > 2014-09-15 15:23, Neil Horman:
> > > The DPDK ABI develops and changes quickly, which makes it difficult for
> > > applications to keep up with the latest version of the library, 
> > > especially when
> > > it (the DPDK) is built as a set of shared objects, as applications may be 
> > > built
> > > against an older version of the library.
> > >
> > > To mitigate this, this patch series introduces support for library and 
> > > symbol
> > > versioning when the DPDK is built as a DSO.  Specifically, it does 4 
> > > things:
> > >
> > > 1) Adds initial support for library versioning.  Each library now has a 
> > > version
> > > map that explicitly calls out what symbols are exported to using 
> > > applications,
> > > and assigns version(s) to them
> > >
> > > 2) Adds support macros so that when libraries create incompatible ABI's,
> > > multiple versions may be supported so that applications linked against 
> > > older
> > > DPDK releases can continue to function
> > >
> > > 3) Adds library soname versioning suffixes so that when ABI's must be 
> > > broken
> in
> > > a fashion that requires a rebuild of older applications, they will break 
> > > at load
> > > time, rather than cause unexpected issues at run time.
> > >
> > > 4) Adds documentation for ABI policy, and provides space to document
> deprecated
> > > ABI versions, so that applications might be warned of impending changes.
> > >
> > > With these elements in place the DPDK has some support to allow for the
> extended
> > > maintenence of older API's while still allowing the freedom to develop new
> and
> > > improved API's.
> > >
> > > Implementing this feature will require some additional effort on the part 
> > > of
> > > developers and reviewers.  When reviewing patches, must be checked
> against
> > > existing exports to ensure that the function prototypes are not changing. 
> > >  If
> > > they are, the versioning macros must be used, and the library export map
> should
> > > be updated to reflect the new version of the function.
> > >
> > > When data structures change, if those structures are application 
> > > accessible,
> > > apis that accept or return instances of those data structures should have 
> > > new
> > > versions created so that users of the old data structure version might co-
> exist
> > > at the same time.
> >
> > Thanks for your efforts.
> > But I feel this change has too many constraints for the current status of
> > the DPDK. It's probably too early to adopt such policy.
> >
> I think you may be misunderstanding something.  What constraints do you
> beleive
> that this patch imposes?  Note it doesn't in any way prevent changes to the 
> ABI
> of the DPDK, but rather gives us infrastructure to support multiple ABI
> revisions at the same time, so that applications built against DPDK shared
> libraries can continue to function properly at least for some time until we
> decide to deprecate that ABI level.
> 

I view all this as a positive step. I consider backward compatibility as 
something that should always be encouraged, and I agree with Neil that this 
should allow us to guarantee compatibility for our customers while still having 
a path open to us to change things if we really need to.

> This is all based on the versioning strategy outlined here:
> http://www.akkadia.org/drepper/dsohowto.pdf
> 
> That may help clarify things for you.
> 
> > By the way, this versioning doesn't cover structure changes.
> No, it doesn't.  No link-time mechanism does so.
> 
> > How could it be managed?
> Thats a subject that is open to discussion, but my initial thinking is that we
> need to handle it on a case by case basis:
> 
> * For minor updates, where allocation of a structure is done on the heap and
> new
> fields need to be added, appending them to the end of a structure and 
> providing
> an initial value is sufficient.
> 
> * For major changes, where fields need to be removed, or re-arranged, mostly
> likely the API surfaces which accept or return those structures as
> inputs/outputs will need to have new versions written to accept the new 
> version
> of the structure, and internally we will have to support both formats for a 
> time
> (according to the policy I documented, that is currently a single major
> release).  I.e. if you want to change struct foo, which is accepted as a
> parameter for the function bar(struct foo *ptr), then for a release we would
> need to create struct foo_v2 with the new format, map a new function foo_v2
> to
> the exported foo@@DPDK_1.(X+1), and internally make the foo functions
> understand
> both the origional and v2 versions of 

[dpdk-dev] rx_eth_tx_burst not work in slave app.

2014-09-19 Thread zengxg14
Hi all.

I wrote a program to recv and send pkt at the same port.
First  init the port then use below code to receive pkts and send pkts. It 
works OK.

Then I try the multi-process. I init the port in master app,  and run the recv 
and send in slave spp.
In this mode rx_eth_tx_burst() always return 0.

while ( 1 ) {
int nb_rx, t;
int tmp;
nb_rx = rte_eth_rx_burst(1, 0, pkt_mbuf, BURST_COUNT);
if (nb_rx == 0) {
rte_delay_us(20);
continue;
}
t = 0;
while (t < nb_rx) {
tmp = rte_eth_tx_burst(1, 0, &pkt_mbuf[t], nb_rx - t);  
t += tmp;
}
}

What I found is: 
In single-process mode, the tx_pkt_burst is set to ixgbe_xmit_pkts_vec.
In multi-process mode, after the master app is up, it shows tx_pkt_burst is  
ixgbe_xmit_pkts_vec.
But when the slave app is runing,   "perf top" shows ixgbe_xmit_pkts is runing, 
  not the ixgbe_xmit_pkts_vec.

How can I make the slave app still call the ixgbe_xmit_pkts_vec?

zengxg


[dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32

2014-09-19 Thread Richardson, Bruce
> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Thursday, September 18, 2014 7:09 PM
> To: Thomas Monjalon
> Cc: Richardson, Bruce; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> 
> On Thu, Sep 18, 2014 at 07:13:52PM +0200, Thomas Monjalon wrote:
> > 2014-09-18 15:53, Richardson, Bruce:
> > > > > --- a/app/test-pmd/testpmd.c
> > > > > +++ b/app/test-pmd/testpmd.c
> > > > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > > > >  /*
> > > > >   * Configurable value of RX free threshold.
> > > > >   */
> > > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by
> default. */
> > > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32
> packets
> > > > */
> > > > >
> > > >
> > > > Why 32?  Was that an experimentally determined value?
> > > > Does it hold true for all PMD's?
> > >
> > > This is primarily for the ixgbe PMD, which is right now the most
> > > highly tuned driver, but it works fine for all other ones too,
> > > as far as I'm aware.
> >
> > Yes, you are changing this value for all PMDs but you're targetting
> > only one.
> > These thresholds are dependent of the PMD implementation. There's
> > something wrong here.
> >
> I agree. Its fine to do this, but it does seem like the sample application
> should document why it does this and make note of the fact that other PMDs
> may
> have a separate optimal value.
> 
> > > Basically, this is the minimum setting needed to enable either the
> > > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > > best made the default for that reason. Please see
> > > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > > the same file.
> >
> > Since this parameter is so important, it could be a default value somewhere.
> >
> > I think we should split generic tuning parameters and tuning parameters
> > related to driver implementation or specific hardware.
> > Then we should provide some good default values for each of them.
> > At last, if needed, applications should be able to easily tune the
> > pmd-specific parameters.
> >
> I like this idea.  I've not got an idea of how much work it is to do so, but 
> in
> principle it makes sense.
> 
> Perhaps for the immediate need, since rte_eth_rx_queue_setup allows the
> config
> struct to get passed directly to PMDs, we can create a reserved value
> RTE_ETH_RX_FREE_THRESH_OPTIMAL, that instructs the pmd to select
> whatever
> threshold is optimal for its own hardware?
> 
> Neil
> 
Actually, looking at the code, I would suggest a couple of options, some of 
which may be used together.
1) we make NULL a valid value for the rxconf structure parameter to 
rte_eth_rx_queue_setup. There is little information in it that should really 
need to be passed in by applications to the drivers, and that would allow the 
drivers to be completely free to select the best options for their own 
operation. 
2) As a companion to that (or as an alternative), we could also allow 
each driver to provide its own functions for rte_eth_get_rxconf_default, and 
rte_eth_get_txconf_default, to be used by applications that want to use 
known-good values for thresholds but also want to tweak one of the other values 
e.g. for rx, set the drop_en bit, and for tx set the txqflags to disable 
offloads.
3) Lastly, we could also consider removing the threshold and other 
not-generally-used values from the rxconf and txconf structures and make those 
removed fields completely driver-set values. Optionally, we could provide an 
alternate API to tune them, but I don't really see this being useful in most 
cases, and I'd probably omit it unless someone can prove a need for such APIs.

Regards,
/Bruce


[dpdk-dev] [PATCH 2/4] Provide initial versioning for all DPDK libraries

2014-09-19 Thread Bruce Richardson
On Mon, Sep 15, 2014 at 03:23:49PM -0400, Neil Horman wrote:
> Add linker version script files to each DPDK library to put a stake in the
> ground from which we can start cleaning up API's
> 
> Signed-off-by: Neil Horman 
> CC: Thomas Monjalon 
> CC: "Richardson, Bruce" 
> ---
>  <... snip for brevity ...>
>
> diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> index 65e566d..1f96645 100644
> --- a/lib/librte_acl/Makefile
> +++ b/lib/librte_acl/Makefile
> @@ -37,6 +37,8 @@ LIB = librte_acl.a
>  CFLAGS += -O3
>  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
>  
> +EXPORT_MAP := $(RTE_SDK)/lib/librte_acl/rte_acl_version.map
> +
>  # all source are stored in SRCS-y
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
>  
> diff --git a/lib/librte_acl/rte_acl_version.map 
> b/lib/librte_acl/rte_acl_version.map
> new file mode 100644
> index 000..4480690
> --- /dev/null
> +++ b/lib/librte_acl/rte_acl_version.map
> @@ -0,0 +1,19 @@
> +DPDK_1.8 {
> + global:
> + rte_acl_create;
> + rte_acl_find_existing;
> + rte_acl_free;
> + rte_acl_add_rules;
> + rte_acl_reset_rules;
> + rte_acl_build;
> + rte_acl_reset;
> + rte_acl_classify;
> + rte_acl_dump;
> + rte_acl_list_dump;
> + rte_acl_ipv4vlan_add_rules;
> + rte_acl_ipv4vlan_build;
> + rte_acl_classify_scalar;
> +
> + local: *;
> +};
> +

Looking at this versionning, it strikes me that this looks like the perfect 
opportunity to go to a 2.0 version number.

My reasoning:
* We have already got fairly significant ABI and indeed API changes in this 
  release due to the mbuf rework. That allow makes it a logical point to 
  bump the Intel DPDK major version number to 2.0
* Having the API versioning start at a 2.0 looks neater than having it at 
  1.8, since .0 is a nice round version number to start with. Also if we 
  decide in the near future for whatever reasons to go to a 2.0 release, the 
  ABIs are probably still going to be 1.8. [Again, if we ever want to go to 
  2.0, now looks the perfect time]
* For the naming of the .so files, starting with them at a .2 now seems 
  reasonable to me, denoting a clean break with the older releases which did 
  have a different ABI. [Though again it makes more sense if you consider 
  that we may want to move to a 2.0 in future].

What do people think?

/Bruce


[dpdk-dev] DPDK and custom memory

2014-09-19 Thread Neil Horman
On Fri, Sep 19, 2014 at 12:13:55AM +, Saygin, Artur wrote:
> FWIW: rte_mempool_xmem_create turned out to be exactly what the use case 
> requires. It's not without limitations but is probably better than having to 
> copy buffers between device and DPDK memory.
> 
Ah, so its not non-kernel managed memory you were after, it was a way to make
non-dpdk managed memory  get managed by dpdk.  That makes more sense.
Neil

> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com] 
> Sent: Wednesday, September 03, 2014 3:04 AM
> To: Saygin, Artur
> Cc: Alex Markuze; Thomas Monjalon; dev at dpdk.org
> Subject: Re: [dpdk-dev] DPDK and custom memory
> 
> On Wed, Sep 03, 2014 at 01:17:53AM +, Saygin, Artur wrote:
> > Thanks for prompt responses!
> > 
> > To clarify, the questions is not about accessing a NIC, but about a NIC 
> > accessing a very specific block of physical memory, possibly non-kernel 
> > managed.
> > 
> Still not sure what you mean here by non-kernel managed.  If memory can be
> accessed from the CPU, then the kernel can allocate, free and access it, thats
> it.  If the memory isn't accessible from the cpu, then this is out of our 
> hands
> anyway.  The only question is, how do you access it.
> 
> > Per my understanding memory that rte_mempool_create API obtains is kernel 
> > managed, grabbed by DPDK via HUGETLBFS, with address selection being 
> > outside of application control. Is there a way around that? As in have DPDK 
> > allocate buffer memory from address XYZ only...
> Nope, the DPDK allocates blocks of memory without regard to the operation of 
> the
> NIC.  If you have some odd NIC that requires access to a specific physical
> memory range, then it is your responsibility to reserve that memory and author
> the PMD in such a way that it communicates with the NIC via that memory.
> Usually this is done via a combination of operating system facilities (e.g. 
> the
> linux kernel commanline option memmap or the runtime mmap operation on the
> /dev/mem device).
> 
> Regards
> Neil
> 
> > 
> > If VFIO / IOMMU is still the answer - I'll poke in that direction. If not - 
> > any additional insight is appreciated.
> > 
> > -Original Message-
> > From: Alex Markuze [mailto:alex at weka.io] 
> > Sent: Sunday, August 31, 2014 1:27 AM
> > To: Thomas Monjalon
> > Cc: Saygin, Artur; dev at dpdk.org
> > Subject: Re: [dpdk-dev] DPDK and custom memory
> > 
> > Artur, I don't have the details of what you are trying to achieve, but
> > it sounds like something that is covered by IOMMU, SW or HW.  The
> > IOMMU creates an iova (I/O Virtual address) the nic can access the
> > range is controlled with flags passed to the dma_map functions.
> > 
> > So I understand your question this way, How does the DPDK work with
> > IOMMU enabled system and can you influence the mapping?
> > 
> > 
> > On Sat, Aug 30, 2014 at 4:03 PM, Thomas Monjalon
> >  wrote:
> > > Hello,
> > >
> > > 2014-08-29 18:40, Saygin, Artur:
> > >> Imagine a PMD for an FPGA-based NIC that is limited to accessing certain
> > >> memory regions .
> > >
> > > Does it mean Intel is making an FPGA-based NIC?
> > >
> > >> Is there a way to make DPDK use that exact memory?
> > >
> > > Maybe I don't understand the question well, because it doesn't seem really
> > > different of what other PMDs do.
> > > Assuming your NIC is PCI, you can access it via uio (igb_uio) or VFIO.
> > >
> > >> Perhaps this is more of a hugetlbfs question than DPDK but I thought I'd
> > >> start here.
> > >
> > > It's a pleasure to receive new drivers.
> > > Welcome here :)
> > >
> > > --
> > > Thomas
> 


[dpdk-dev] [PATCH 2/4] Provide initial versioning for all DPDK libraries

2014-09-19 Thread Neil Horman
On Fri, Sep 19, 2014 at 10:45:38AM +0100, Bruce Richardson wrote:
> On Mon, Sep 15, 2014 at 03:23:49PM -0400, Neil Horman wrote:
> > Add linker version script files to each DPDK library to put a stake in the
> > ground from which we can start cleaning up API's
> > 
> > Signed-off-by: Neil Horman 
> > CC: Thomas Monjalon 
> > CC: "Richardson, Bruce" 
> > ---
> >  <... snip for brevity ...>
> >
> > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > index 65e566d..1f96645 100644
> > --- a/lib/librte_acl/Makefile
> > +++ b/lib/librte_acl/Makefile
> > @@ -37,6 +37,8 @@ LIB = librte_acl.a
> >  CFLAGS += -O3
> >  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
> >  
> > +EXPORT_MAP := $(RTE_SDK)/lib/librte_acl/rte_acl_version.map
> > +
> >  # all source are stored in SRCS-y
> >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
> >  
> > diff --git a/lib/librte_acl/rte_acl_version.map 
> > b/lib/librte_acl/rte_acl_version.map
> > new file mode 100644
> > index 000..4480690
> > --- /dev/null
> > +++ b/lib/librte_acl/rte_acl_version.map
> > @@ -0,0 +1,19 @@
> > +DPDK_1.8 {
> > +   global:
> > +   rte_acl_create;
> > +   rte_acl_find_existing;
> > +   rte_acl_free;
> > +   rte_acl_add_rules;
> > +   rte_acl_reset_rules;
> > +   rte_acl_build;
> > +   rte_acl_reset;
> > +   rte_acl_classify;
> > +   rte_acl_dump;
> > +   rte_acl_list_dump;
> > +   rte_acl_ipv4vlan_add_rules;
> > +   rte_acl_ipv4vlan_build;
> > +   rte_acl_classify_scalar;
> > +
> > +   local: *;
> > +};
> > +
> 
> Looking at this versionning, it strikes me that this looks like the perfect 
> opportunity to go to a 2.0 version number.
> 
> My reasoning:
> * We have already got fairly significant ABI and indeed API changes in this 
>   release due to the mbuf rework. That allow makes it a logical point to 
>   bump the Intel DPDK major version number to 2.0
> * Having the API versioning start at a 2.0 looks neater than having it at 
>   1.8, since .0 is a nice round version number to start with. Also if we 
>   decide in the near future for whatever reasons to go to a 2.0 release, the 
>   ABIs are probably still going to be 1.8. [Again, if we ever want to go to 
>   2.0, now looks the perfect time]
> * For the naming of the .so files, starting with them at a .2 now seems 
>   reasonable to me, denoting a clean break with the older releases which did 
>   have a different ABI. [Though again it makes more sense if you consider 
>   that we may want to move to a 2.0 in future].
> 
> What do people think?
> 
I'm fine with it.  Just so that we're clear, this patch treats versions like
arbitrary strings (the file structure denotes version ordinality), so 1.8 vs 2.0
makes absolutely no difference as far as it goes, the exported version value is
a matter of policy, but I'm fine with making that adjustment
Neil

> /Bruce
> 


[dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32

2014-09-19 Thread Neil Horman
On Fri, Sep 19, 2014 at 09:18:26AM +, Richardson, Bruce wrote:
> > -Original Message-
> > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > Sent: Thursday, September 18, 2014 7:09 PM
> > To: Thomas Monjalon
> > Cc: Richardson, Bruce; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> > 
> > On Thu, Sep 18, 2014 at 07:13:52PM +0200, Thomas Monjalon wrote:
> > > 2014-09-18 15:53, Richardson, Bruce:
> > > > > > --- a/app/test-pmd/testpmd.c
> > > > > > +++ b/app/test-pmd/testpmd.c
> > > > > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > > > > >  /*
> > > > > >   * Configurable value of RX free threshold.
> > > > > >   */
> > > > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by
> > default. */
> > > > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 
> > > > > > 32
> > packets
> > > > > */
> > > > > >
> > > > >
> > > > > Why 32?  Was that an experimentally determined value?
> > > > > Does it hold true for all PMD's?
> > > >
> > > > This is primarily for the ixgbe PMD, which is right now the most
> > > > highly tuned driver, but it works fine for all other ones too,
> > > > as far as I'm aware.
> > >
> > > Yes, you are changing this value for all PMDs but you're targetting
> > > only one.
> > > These thresholds are dependent of the PMD implementation. There's
> > > something wrong here.
> > >
> > I agree. Its fine to do this, but it does seem like the sample application
> > should document why it does this and make note of the fact that other PMDs
> > may
> > have a separate optimal value.
> > 
> > > > Basically, this is the minimum setting needed to enable either the
> > > > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > > > best made the default for that reason. Please see
> > > > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > > > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > > > the same file.
> > >
> > > Since this parameter is so important, it could be a default value 
> > > somewhere.
> > >
> > > I think we should split generic tuning parameters and tuning parameters
> > > related to driver implementation or specific hardware.
> > > Then we should provide some good default values for each of them.
> > > At last, if needed, applications should be able to easily tune the
> > > pmd-specific parameters.
> > >
> > I like this idea.  I've not got an idea of how much work it is to do so, 
> > but in
> > principle it makes sense.
> > 
> > Perhaps for the immediate need, since rte_eth_rx_queue_setup allows the
> > config
> > struct to get passed directly to PMDs, we can create a reserved value
> > RTE_ETH_RX_FREE_THRESH_OPTIMAL, that instructs the pmd to select
> > whatever
> > threshold is optimal for its own hardware?
> > 
> > Neil
> > 
> Actually, looking at the code, I would suggest a couple of options, some of 
> which may be used together.
>   1) we make NULL a valid value for the rxconf structure parameter to 
> rte_eth_rx_queue_setup. There is little information in it that should really 
> need to be passed in by applications to the drivers, and that would allow the 
> drivers to be completely free to select the best options for their own 
> operation. 
>   2) As a companion to that (or as an alternative), we could also allow 
> each driver to provide its own functions for rte_eth_get_rxconf_default, and 
> rte_eth_get_txconf_default, to be used by applications that want to use 
> known-good values for thresholds but also want to tweak one of the other 
> values e.g. for rx, set the drop_en bit, and for tx set the txqflags to 
> disable offloads.
>   3) Lastly, we could also consider removing the threshold and other 
> not-generally-used values from the rxconf and txconf structures and make 
> those removed fields completely driver-set values. Optionally, we could 
> provide an alternate API to tune them, but I don't really see this being 
> useful in most cases, and I'd probably omit it unless someone can prove a 
> need for such APIs.
> 
These all sound fairly reasonable to me.
Neil

> Regards,
> /Bruce
> 


[dpdk-dev] Porting DPDK to ARM platform

2014-09-19 Thread Mukesh Dua
Hi,

Did someone tried porting DPDK to ARM platform?
If so please provide some guidance with respect to the changes required.

Regards,
Mukesh


[dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32

2014-09-19 Thread Richardson, Bruce


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Friday, September 19, 2014 11:25 AM
> To: Richardson, Bruce
> Cc: Thomas Monjalon; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> 
> On Fri, Sep 19, 2014 at 09:18:26AM +, Richardson, Bruce wrote:
> > > -Original Message-
> > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > Sent: Thursday, September 18, 2014 7:09 PM
> > > To: Thomas Monjalon
> > > Cc: Richardson, Bruce; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> > >
> > > On Thu, Sep 18, 2014 at 07:13:52PM +0200, Thomas Monjalon wrote:
> > > > 2014-09-18 15:53, Richardson, Bruce:
> > > > > > > --- a/app/test-pmd/testpmd.c
> > > > > > > +++ b/app/test-pmd/testpmd.c
> > > > > > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > > > > > >  /*
> > > > > > >   * Configurable value of RX free threshold.
> > > > > > >   */
> > > > > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors 
> > > > > > > by
> > > default. */
> > > > > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once 
> > > > > > > every 32
> > > packets
> > > > > > */
> > > > > > >
> > > > > >
> > > > > > Why 32?  Was that an experimentally determined value?
> > > > > > Does it hold true for all PMD's?
> > > > >
> > > > > This is primarily for the ixgbe PMD, which is right now the most
> > > > > highly tuned driver, but it works fine for all other ones too,
> > > > > as far as I'm aware.
> > > >
> > > > Yes, you are changing this value for all PMDs but you're targetting
> > > > only one.
> > > > These thresholds are dependent of the PMD implementation. There's
> > > > something wrong here.
> > > >
> > > I agree. Its fine to do this, but it does seem like the sample application
> > > should document why it does this and make note of the fact that other PMDs
> > > may
> > > have a separate optimal value.
> > >
> > > > > Basically, this is the minimum setting needed to enable either the
> > > > > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > > > > best made the default for that reason. Please see
> > > > > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > > > > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > > > > the same file.
> > > >
> > > > Since this parameter is so important, it could be a default value
> somewhere.
> > > >
> > > > I think we should split generic tuning parameters and tuning parameters
> > > > related to driver implementation or specific hardware.
> > > > Then we should provide some good default values for each of them.
> > > > At last, if needed, applications should be able to easily tune the
> > > > pmd-specific parameters.
> > > >
> > > I like this idea.  I've not got an idea of how much work it is to do so, 
> > > but in
> > > principle it makes sense.
> > >
> > > Perhaps for the immediate need, since rte_eth_rx_queue_setup allows the
> > > config
> > > struct to get passed directly to PMDs, we can create a reserved value
> > > RTE_ETH_RX_FREE_THRESH_OPTIMAL, that instructs the pmd to select
> > > whatever
> > > threshold is optimal for its own hardware?
> > >
> > > Neil
> > >
> > Actually, looking at the code, I would suggest a couple of options, some of
> which may be used together.
> > 1) we make NULL a valid value for the rxconf structure parameter to
> rte_eth_rx_queue_setup. There is little information in it that should really 
> need
> to be passed in by applications to the drivers, and that would allow the 
> drivers to
> be completely free to select the best options for their own operation.
> > 2) As a companion to that (or as an alternative), we could also allow
> each driver to provide its own functions for rte_eth_get_rxconf_default, and
> rte_eth_get_txconf_default, to be used by applications that want to use known-
> good values for thresholds but also want to tweak one of the other values e.g.
> for rx, set the drop_en bit, and for tx set the txqflags to disable offloads.
> > 3) Lastly, we could also consider removing the threshold and other not-
> generally-used values from the rxconf and txconf structures and make those
> removed fields completely driver-set values. Optionally, we could provide an
> alternate API to tune them, but I don't really see this being useful in most 
> cases,
> and I'd probably omit it unless someone can prove a need for such APIs.
> >
> These all sound fairly reasonable to me.
> Neil

Further thinking seems to me like 1 doesn't really go very far, so it falls 
between 2 and 3. Any preference between them?

/Bruce


[dpdk-dev] [RFC] PMD for performance measurement

2014-09-19 Thread muk...@igel.co.jp
From: Tetsuya Mukawa 

Hi,

I want to measure throughputs like following cases.

- path connected by RING PMDs
- path connected by librte_vhost and virtio-net PMD
- path connected by MEMNIC PMDs
- .

Is there anyone want to do same thing?

Anyway, I guess those throughputs may be too high for some devices like ixia.
But it's a bit pain to write or fix applications just for measuring.

I guess a PMD like '/dev/null' and testpmd application will help.
This patch set is RFC of a PMD like '/dev/null'.
Please see the first commit of this patch set.


Here is a my plan to use this PMD

+---+
|   testpmd1|
+-+--+--+
| Target PMD1 |  | null PMD |
+---+++  +--+
||
|| Target path
||
+---+++  +--+
| Target PMD2 |  | null PMD |
+-+--+--+
|   testpmd2|
+---+

The testpmd1 or testpmd2 will start with "start tx_first". It causes huge
transfers.

The result is not thuroughput of PMD1 or PMD2, but throughput
between PMD1 and PMD2. But I guess it's enough to know rough thuroughput.
Also it's nice for measuing that the same environment can be used.

Any suggestions or comments?

Thanks,
Tetsuya

--
Tetsuya Mukawa (1):
  librte_pmd_null: Add null PMD

 config/common_bsdapp   |   5 +
 config/common_linuxapp |   5 +
 lib/Makefile   |   1 +
 lib/librte_pmd_null/Makefile   |  58 +
 lib/librte_pmd_null/rte_eth_null.c | 474 +
 5 files changed, 543 insertions(+)
 create mode 100644 lib/librte_pmd_null/Makefile
 create mode 100644 lib/librte_pmd_null/rte_eth_null.c

-- 
1.9.1



[dpdk-dev] [RFC] librte_pmd_null: Add null PMD

2014-09-19 Thread muk...@igel.co.jp
From: Tetsuya Mukawa 

'null PMD' is a virtual device driver particulary designed to measure
performance of DPDK applications and DPDK PMDs. When an application call rx,
null PMD just allocate mbufs and return those. Also tx, the PMD just free
mbufs.

The PMD has following options.
- size: specify packe size allocated by RX. Default packet size is 64.
- copy: specify 1 or 0 to enable or disable copy while RX and TX.
Default value is 0(disbaled).
This option is used for emulating more realistic data transfer.
Copy size is equal to packet size.

Signed-off-by: Tetsuya Mukawa 
---
 config/common_bsdapp   |   5 +
 config/common_linuxapp |   5 +
 lib/Makefile   |   1 +
 lib/librte_pmd_null/Makefile   |  58 +
 lib/librte_pmd_null/rte_eth_null.c | 474 +
 5 files changed, 543 insertions(+)
 create mode 100644 lib/librte_pmd_null/Makefile
 create mode 100644 lib/librte_pmd_null/rte_eth_null.c

diff --git a/config/common_bsdapp b/config/common_bsdapp
index 645949f..a86321f 100644
--- a/config/common_bsdapp
+++ b/config/common_bsdapp
@@ -226,6 +226,11 @@ CONFIG_RTE_LIBRTE_PMD_PCAP=y
 CONFIG_RTE_LIBRTE_PMD_BOND=y

 #
+# Compile null PMD
+#
+CONFIG_RTE_LIBRTE_PMD_NULL=y
+
+#
 # Do prefetch of packet data within PMD driver receive function
 #
 CONFIG_RTE_PMD_PACKET_PREFETCH=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 5bee910..e3bd8c0 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -254,6 +254,11 @@ CONFIG_RTE_LIBRTE_PMD_BOND=y
 CONFIG_RTE_LIBRTE_PMD_XENVIRT=n

 #
+# Compile null PMD
+#
+CONFIG_RTE_LIBRTE_PMD_NULL=y
+
+#
 # Do prefetch of packet data within PMD driver receive function
 #
 CONFIG_RTE_PMD_PACKET_PREFETCH=y
diff --git a/lib/Makefile b/lib/Makefile
index 10c5bb3..61d6ed1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -50,6 +50,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += librte_pmd_pcap
 DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += librte_pmd_virtio
 DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += librte_pmd_vmxnet3
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += librte_pmd_xenvirt
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_NULL) += librte_pmd_null
 DIRS-$(CONFIG_RTE_LIBRTE_HASH) += librte_hash
 DIRS-$(CONFIG_RTE_LIBRTE_LPM) += librte_lpm
 DIRS-$(CONFIG_RTE_LIBRTE_ACL) += librte_acl
diff --git a/lib/librte_pmd_null/Makefile b/lib/librte_pmd_null/Makefile
new file mode 100644
index 000..e017918
--- /dev/null
+++ b/lib/librte_pmd_null/Makefile
@@ -0,0 +1,58 @@
+#   BSD LICENSE
+#
+#   Copyright (C) 2014 Nippon Telegraph and Telephone Corporation.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+# * Redistributions of source code must retain the above copyright
+#   notice, this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+#   notice, this list of conditions and the following disclaimer in
+#   the documentation and/or other materials provided with the
+#   distribution.
+# * Neither the name of Intel Corporation nor the names of its
+#   contributors may be used to endorse or promote products derived
+#   from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_pmd_null.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_NULL) += rte_eth_null.c
+
+#
+# Export include files
+#
+SYMLINK-y-include +=
+
+# this lib depends upon:
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_NULL) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_NULL) += lib/librte_ether
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_NULL) += lib/librte_malloc
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_NULL) += lib/librte_kvargs
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_pmd_null/rte_eth_null.c 
b/lib/librte_pmd_null/rte_eth_null.c
new file mode 100644
index 000..1a81843
--- /dev/null
+++ b/lib/librte_pmd_null/rte_eth_null.c
@@ -0,0 +1,474 @@
+

[dpdk-dev] [PATCH 2/2] bond: add mode 4 support

2014-09-19 Thread Wodkowski, PawelX
> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Thursday, September 18, 2014 18:03
> To: Wodkowski, PawelX
> Cc: dev at dpdk.org; Jastrzebski, MichalX K; Doherty, Declan
> Subject: Re: [dpdk-dev] [PATCH 2/2] bond: add mode 4 support
> 
> On Thu, Sep 18, 2014 at 08:07:31AM +, Wodkowski, PawelX wrote:
> > > > +int
> > > > +bond_mode_8023ad_deactivate_slave(struct rte_eth_dev *bond_dev,
> > > > +   uint8_t slave_pos)
> > > > +{
> > > > +   struct bond_dev_private *internals = 
> > > > bond_dev->data->dev_private;
> > > > +   struct mode8023ad_data *data = &internals->mode4;
> > > > +   struct port *port;
> > > > +   uint8_t i;
> > > > +
> > > > +   bond_mode_8023ad_stop(bond_dev);
> > > > +
> > > > +   /* Exclude slave from transmit policy. If this slave is an 
> > > > aggregator
> > > > +* make all aggregated slaves unselected to force sellection 
> > > > logic
> > > > +* to select suitable aggregator for this port   */
> > > > +   for (i = 0; i < internals->active_slave_count; i++) {
> > > > +   port = &data->port_list[slave_pos];
> > > > +   if (port->used_agregator_idx == slave_pos) {
> > > > +   port->selected = UNSELECTED;
> > > > +   port->actor_state &= ~(STATE_SYNCHRONIZATION |
> > > STATE_DISTRIBUTING |
> > > > +   STATE_COLLECTING);
> > > > +
> > > > +   /* Use default aggregator */
> > > > +   port->used_agregator_idx = i;
> > > > +   }
> > > > +   }
> > > > +
> > > > +   port = &data->port_list[slave_pos];
> > > > +   timer_cancel(&port->current_while_timer);
> > > > +   timer_cancel(&port->periodic_timer);
> > > > +   timer_cancel(&port->wait_while_timer);
> > > > +   timer_cancel(&port->tx_machine_timer);
> > > > +
> > > These all seem rather racy.  Alarm callbacks are executed with the alarm 
> > > list
> > > locks not held.  So there is every possibility that you could execute 
> > > these (or
> > > any timer_cancel calls in this PMD in parallel with the internal state 
> > > machine
> > > timer callback, and leave either with a corrupted timer list (resulting 
> > > from a
> > > double free between here, and the actual callback site),
> >
> > I don't think so. Yes, callbacks are executed with  alarm list locks not 
> > held, but
> > this is not the issue because access to list itself is guarded by lock and
> > ap->executing variable. So list will not be trashed. Check source of
> > eal_alarm_callback(), rte_eal_alarm_set() and rte_eal_alarm_cancel().
> >
> Yes, you're right, the list is probably safe wht the executing bit.
> 
> > > or a timer that is
> > > actually still pending when a slave is removed.
> > >
> > This is not the issue also, but problem might be similar. I assumed that 
> > alarms
> > are atomic but when I looked at rte alarms closer I saw a race condition
> > between and rte_eal_alarm_cancel() from  bond_mode_8023ad_stop()
> > and rte_eal_alarm_set() from state machines callback. This need to be
> > reworked in some way.
> 
> Yes, this is what I was referring to:
> 
> CPU0  CPU1
> rte_eal_alarm_callbackbond_8023ad_deactivate_slave
> -bond_8023_ad_periodic_cb timer_cancel
> timer_set
> 
> If those timer functions operate on the same timer, the result is that you can
> leave the stop/deactivate slave paths with a timer function for that slave 
> still
> pending. The bonding mode needs some internal state to serialize those
> operations and determine if the timer should be reactivated.
> 
> Neil

I did rethink the issue and problem is much simpler than it looks like. I did 
the 
following:
1. Change internal state machine alarms to use rte_rdtsc(). This makes all 
 mode 4 internal timer_*() function not affected by any race condition.
2. Do a busy loop when canceling main callback timer until cancel is 
successfull.
This should do the trick about race condition. Do you agree?

Here is part involving timers I have changed:

static void
-timer_expired_cb(void *arg)
+timer_stop(uint64_t *timer)
 {
-   enum timer_state *timer = arg;
-
-   BOND_ASSERT(*timer == TIMER_RUNNING);
-   *timer = TIMER_EXPIRED;
+   *timer = 0;
 }

 static void
-timer_cancel(enum timer_state *timer)
+timer_set(uint64_t *timer, uint64_t timeout_ms)
 {
-   rte_eal_alarm_cancel(&timer_expired_cb, timer);
-   *timer = TIMER_NOT_STARTED;
+   *timer = rte_rdtsc() + timeout_ms * rte_get_tsc_hz() / 1000;
 }

+/* Forces given timer to be in expired state. */
 static void
-timer_set(enum timer_state *timer, uint64_t timeout)
+timer_force_expired(uint64_t *timer)
 {
-   rte_eal_alarm_cancel(&timer_expired_cb, timer);
-   rte_eal_alarm_set(timeout * 1000, &timer_expired_cb, timer);
-   *timer = TIMER_RUNNING;
+   *timer = rte_rdtsc();
 }

 static bool
-timer_is_expir

[dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility

2014-09-19 Thread Venkatesan, Venky
On 9/18/2014 12:14 PM, Neil Horman wrote:
> On Thu, Sep 18, 2014 at 08:23:36PM +0200, Thomas Monjalon wrote:
>> Hi Neil,
>>
>> 2014-09-15 15:23, Neil Horman:
>>> The DPDK ABI develops and changes quickly, which makes it difficult for
>>> applications to keep up with the latest version of the library, especially 
>>> when
>>> it (the DPDK) is built as a set of shared objects, as applications may be 
>>> built
>>> against an older version of the library.
>>>
>>> To mitigate this, this patch series introduces support for library and 
>>> symbol
>>> versioning when the DPDK is built as a DSO.  Specifically, it does 4 things:
>>>
>>> 1) Adds initial support for library versioning.  Each library now has a 
>>> version
>>> map that explicitly calls out what symbols are exported to using 
>>> applications,
>>> and assigns version(s) to them
>>>
>>> 2) Adds support macros so that when libraries create incompatible ABI's,
>>> multiple versions may be supported so that applications linked against older
>>> DPDK releases can continue to function
>>>
>>> 3) Adds library soname versioning suffixes so that when ABI's must be 
>>> broken in
>>> a fashion that requires a rebuild of older applications, they will break at 
>>> load
>>> time, rather than cause unexpected issues at run time.
>>>
>>> 4) Adds documentation for ABI policy, and provides space to document 
>>> deprecated
>>> ABI versions, so that applications might be warned of impending changes.
>>>
>>> With these elements in place the DPDK has some support to allow for the 
>>> extended
>>> maintenence of older API's while still allowing the freedom to develop new 
>>> and
>>> improved API's.
>>>
>>> Implementing this feature will require some additional effort on the part of
>>> developers and reviewers.  When reviewing patches, must be checked against
>>> existing exports to ensure that the function prototypes are not changing.  
>>> If
>>> they are, the versioning macros must be used, and the library export map 
>>> should
>>> be updated to reflect the new version of the function.
>>>
>>> When data structures change, if those structures are application accessible,
>>> apis that accept or return instances of those data structures should have 
>>> new
>>> versions created so that users of the old data structure version might 
>>> co-exist
>>> at the same time.
>> Thanks for your efforts.
>> But I feel this change has too many constraints for the current status of
>> the DPDK. It's probably too early to adopt such policy.
>>
> I think you may be misunderstanding something.  What constraints do you 
> beleive
> that this patch imposes?  Note it doesn't in any way prevent changes to the 
> ABI
> of the DPDK, but rather gives us infrastructure to support multiple ABI
> revisions at the same time, so that applications built against DPDK shared
> libraries can continue to function properly at least for some time until we
> decide to deprecate that ABI level.
>
> This is all based on the versioning strategy outlined here:
> http://www.akkadia.org/drepper/dsohowto.pdf
>
> That may help clarify things for you.
>
>> By the way, this versioning doesn't cover structure changes.
> No, it doesn't.  No link-time mechanism does so.
>
>> How could it be managed?
> Thats a subject that is open to discussion, but my initial thinking is that we
> need to handle it on a case by case basis:
>
> * For minor updates, where allocation of a structure is done on the heap and 
> new
> fields need to be added, appending them to the end of a structure and 
> providing
> an initial value is sufficient.
>
> * For major changes, where fields need to be removed, or re-arranged, mostly
> likely the API surfaces which accept or return those structures as
> inputs/outputs will need to have new versions written to accept the new 
> version
> of the structure, and internally we will have to support both formats for a 
> time
> (according to the policy I documented, that is currently a single major
> release).  I.e. if you want to change struct foo, which is accepted as a
> parameter for the function bar(struct foo *ptr), then for a release we would
> need to create struct foo_v2 with the new format, map a new function foo_v2 to
> the exported foo@@DPDK_1.(X+1), and internally make the foo functions 
> understand
> both the origional and v2 versions of the structure.  Then in DPDK release
> 1.X+2, we can remove the old version after posting a deprecation notice with
> version 1.(X+1)
>
>> Don't you think it would be more reliable if managed by packaging?
> Solving this with packaging defeats the purpose of having shared libraries at
> all.  While packaging each version of the dpdk separately is possible stopgap
> solution, in that it allows applications to link to differing versions of the
> library independently, but that negates any expectation of timely bugfixes for
> any given version of the DPDK.  That is to say, if you package things this 
> way,
> and wind up with several parallel versions of t

[dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32

2014-09-19 Thread Neil Horman
On Fri, Sep 19, 2014 at 10:28:31AM +, Richardson, Bruce wrote:
> 
> 
> > -Original Message-
> > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > Sent: Friday, September 19, 2014 11:25 AM
> > To: Richardson, Bruce
> > Cc: Thomas Monjalon; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> > 
> > On Fri, Sep 19, 2014 at 09:18:26AM +, Richardson, Bruce wrote:
> > > > -Original Message-
> > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > Sent: Thursday, September 18, 2014 7:09 PM
> > > > To: Thomas Monjalon
> > > > Cc: Richardson, Bruce; dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 
> > > > 32
> > > >
> > > > On Thu, Sep 18, 2014 at 07:13:52PM +0200, Thomas Monjalon wrote:
> > > > > 2014-09-18 15:53, Richardson, Bruce:
> > > > > > > > --- a/app/test-pmd/testpmd.c
> > > > > > > > +++ b/app/test-pmd/testpmd.c
> > > > > > > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > > > > > > >  /*
> > > > > > > >   * Configurable value of RX free threshold.
> > > > > > > >   */
> > > > > > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX 
> > > > > > > > descriptors by
> > > > default. */
> > > > > > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once 
> > > > > > > > every 32
> > > > packets
> > > > > > > */
> > > > > > > >
> > > > > > >
> > > > > > > Why 32?  Was that an experimentally determined value?
> > > > > > > Does it hold true for all PMD's?
> > > > > >
> > > > > > This is primarily for the ixgbe PMD, which is right now the most
> > > > > > highly tuned driver, but it works fine for all other ones too,
> > > > > > as far as I'm aware.
> > > > >
> > > > > Yes, you are changing this value for all PMDs but you're targetting
> > > > > only one.
> > > > > These thresholds are dependent of the PMD implementation. There's
> > > > > something wrong here.
> > > > >
> > > > I agree. Its fine to do this, but it does seem like the sample 
> > > > application
> > > > should document why it does this and make note of the fact that other 
> > > > PMDs
> > > > may
> > > > have a separate optimal value.
> > > >
> > > > > > Basically, this is the minimum setting needed to enable either the
> > > > > > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > > > > > best made the default for that reason. Please see
> > > > > > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > > > > > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > > > > > the same file.
> > > > >
> > > > > Since this parameter is so important, it could be a default value
> > somewhere.
> > > > >
> > > > > I think we should split generic tuning parameters and tuning 
> > > > > parameters
> > > > > related to driver implementation or specific hardware.
> > > > > Then we should provide some good default values for each of them.
> > > > > At last, if needed, applications should be able to easily tune the
> > > > > pmd-specific parameters.
> > > > >
> > > > I like this idea.  I've not got an idea of how much work it is to do 
> > > > so, but in
> > > > principle it makes sense.
> > > >
> > > > Perhaps for the immediate need, since rte_eth_rx_queue_setup allows the
> > > > config
> > > > struct to get passed directly to PMDs, we can create a reserved value
> > > > RTE_ETH_RX_FREE_THRESH_OPTIMAL, that instructs the pmd to select
> > > > whatever
> > > > threshold is optimal for its own hardware?
> > > >
> > > > Neil
> > > >
> > > Actually, looking at the code, I would suggest a couple of options, some 
> > > of
> > which may be used together.
> > >   1) we make NULL a valid value for the rxconf structure parameter to
> > rte_eth_rx_queue_setup. There is little information in it that should 
> > really need
> > to be passed in by applications to the drivers, and that would allow the 
> > drivers to
> > be completely free to select the best options for their own operation.
> > >   2) As a companion to that (or as an alternative), we could also allow
> > each driver to provide its own functions for rte_eth_get_rxconf_default, and
> > rte_eth_get_txconf_default, to be used by applications that want to use 
> > known-
> > good values for thresholds but also want to tweak one of the other values 
> > e.g.
> > for rx, set the drop_en bit, and for tx set the txqflags to disable 
> > offloads.
> > >   3) Lastly, we could also consider removing the threshold and other not-
> > generally-used values from the rxconf and txconf structures and make those
> > removed fields completely driver-set values. Optionally, we could provide an
> > alternate API to tune them, but I don't really see this being useful in 
> > most cases,
> > and I'd probably omit it unless someone can prove a need for such APIs.
> > >
> > These all sound fairly reasonable to me.
> > Neil
> 
> Further thinking seems to me like 1 doesn't really go very far, so it falls 
> between 2 and 3. Any 

[dpdk-dev] [PATCH 2/2] bond: add mode 4 support

2014-09-19 Thread Neil Horman
On Fri, Sep 19, 2014 at 12:47:35PM +, Wodkowski, PawelX wrote:
> > -Original Message-
> > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > Sent: Thursday, September 18, 2014 18:03
> > To: Wodkowski, PawelX
> > Cc: dev at dpdk.org; Jastrzebski, MichalX K; Doherty, Declan
> > Subject: Re: [dpdk-dev] [PATCH 2/2] bond: add mode 4 support
> > 
> > On Thu, Sep 18, 2014 at 08:07:31AM +, Wodkowski, PawelX wrote:
> > > > > +int
> > > > > +bond_mode_8023ad_deactivate_slave(struct rte_eth_dev *bond_dev,
> > > > > + uint8_t slave_pos)
> > > > > +{
> > > > > + struct bond_dev_private *internals = 
> > > > > bond_dev->data->dev_private;
> > > > > + struct mode8023ad_data *data = &internals->mode4;
> > > > > + struct port *port;
> > > > > + uint8_t i;
> > > > > +
> > > > > + bond_mode_8023ad_stop(bond_dev);
> > > > > +
> > > > > + /* Exclude slave from transmit policy. If this slave is an 
> > > > > aggregator
> > > > > +  * make all aggregated slaves unselected to force sellection 
> > > > > logic
> > > > > +  * to select suitable aggregator for this port   */
> > > > > + for (i = 0; i < internals->active_slave_count; i++) {
> > > > > + port = &data->port_list[slave_pos];
> > > > > + if (port->used_agregator_idx == slave_pos) {
> > > > > + port->selected = UNSELECTED;
> > > > > + port->actor_state &= ~(STATE_SYNCHRONIZATION |
> > > > STATE_DISTRIBUTING |
> > > > > + STATE_COLLECTING);
> > > > > +
> > > > > + /* Use default aggregator */
> > > > > + port->used_agregator_idx = i;
> > > > > + }
> > > > > + }
> > > > > +
> > > > > + port = &data->port_list[slave_pos];
> > > > > + timer_cancel(&port->current_while_timer);
> > > > > + timer_cancel(&port->periodic_timer);
> > > > > + timer_cancel(&port->wait_while_timer);
> > > > > + timer_cancel(&port->tx_machine_timer);
> > > > > +
> > > > These all seem rather racy.  Alarm callbacks are executed with the 
> > > > alarm list
> > > > locks not held.  So there is every possibility that you could execute 
> > > > these (or
> > > > any timer_cancel calls in this PMD in parallel with the internal state 
> > > > machine
> > > > timer callback, and leave either with a corrupted timer list (resulting 
> > > > from a
> > > > double free between here, and the actual callback site),
> > >
> > > I don't think so. Yes, callbacks are executed with  alarm list locks not 
> > > held, but
> > > this is not the issue because access to list itself is guarded by lock and
> > > ap->executing variable. So list will not be trashed. Check source of
> > > eal_alarm_callback(), rte_eal_alarm_set() and rte_eal_alarm_cancel().
> > >
> > Yes, you're right, the list is probably safe wht the executing bit.
> > 
> > > > or a timer that is
> > > > actually still pending when a slave is removed.
> > > >
> > > This is not the issue also, but problem might be similar. I assumed that 
> > > alarms
> > > are atomic but when I looked at rte alarms closer I saw a race condition
> > > between and rte_eal_alarm_cancel() from  bond_mode_8023ad_stop()
> > > and rte_eal_alarm_set() from state machines callback. This need to be
> > > reworked in some way.
> > 
> > Yes, this is what I was referring to:
> > 
> > CPU0CPU1
> > rte_eal_alarm_callback  bond_8023ad_deactivate_slave
> > -bond_8023_ad_periodic_cb   timer_cancel
> > timer_set
> > 
> > If those timer functions operate on the same timer, the result is that you 
> > can
> > leave the stop/deactivate slave paths with a timer function for that slave 
> > still
> > pending. The bonding mode needs some internal state to serialize those
> > operations and determine if the timer should be reactivated.
> > 
> > Neil
> 
> I did rethink the issue and problem is much simpler than it looks like. I did 
> the 
> following:
> 1. Change internal state machine alarms to use rte_rdtsc(). This makes all 
>  mode 4 internal timer_*() function not affected by any race condition.
> 2. Do a busy loop when canceling main callback timer until cancel is 
> successfull.
> This should do the trick about race condition. Do you agree?
> 
I think that will work, but I believe you're making it more complicated (and
less reusable) than it needs to be.  What I think you really need to do is
create a new rte api call, rte_eal_alarm_cancel_sync (something like the
equivalent of del_timer_sync in linux, that wraps up the
while(rte_eal_alarm_cancel(...) == 0) {rte_pause} in its own function (so other
call sites can use it, as I don't think this is an uncommon problem), Then just
create a bonding-internal state flag to signal the periodic callback that it
shouldn't re-arm the timer.  That way all you have to do is set the flag, and
call rte_eal_alarm_cancel_sync, and you're done.  And other applications will be
able to handl

[dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility

2014-09-19 Thread Neil Horman
On Fri, Sep 19, 2014 at 07:18:36AM -0700, Venkatesan, Venky wrote:
> On 9/18/2014 12:14 PM, Neil Horman wrote:
> >On Thu, Sep 18, 2014 at 08:23:36PM +0200, Thomas Monjalon wrote:
> >>Hi Neil,
> >>
> >>2014-09-15 15:23, Neil Horman:
> >>>The DPDK ABI develops and changes quickly, which makes it difficult for
> >>>applications to keep up with the latest version of the library, especially 
> >>>when
> >>>it (the DPDK) is built as a set of shared objects, as applications may be 
> >>>built
> >>>against an older version of the library.
> >>>
> >>>To mitigate this, this patch series introduces support for library and 
> >>>symbol
> >>>versioning when the DPDK is built as a DSO.  Specifically, it does 4 
> >>>things:
> >>>
> >>>1) Adds initial support for library versioning.  Each library now has a 
> >>>version
> >>>map that explicitly calls out what symbols are exported to using 
> >>>applications,
> >>>and assigns version(s) to them
> >>>
> >>>2) Adds support macros so that when libraries create incompatible ABI's,
> >>>multiple versions may be supported so that applications linked against 
> >>>older
> >>>DPDK releases can continue to function
> >>>
> >>>3) Adds library soname versioning suffixes so that when ABI's must be 
> >>>broken in
> >>>a fashion that requires a rebuild of older applications, they will break 
> >>>at load
> >>>time, rather than cause unexpected issues at run time.
> >>>
> >>>4) Adds documentation for ABI policy, and provides space to document 
> >>>deprecated
> >>>ABI versions, so that applications might be warned of impending changes.
> >>>
> >>>With these elements in place the DPDK has some support to allow for the 
> >>>extended
> >>>maintenence of older API's while still allowing the freedom to develop new 
> >>>and
> >>>improved API's.
> >>>
> >>>Implementing this feature will require some additional effort on the part 
> >>>of
> >>>developers and reviewers.  When reviewing patches, must be checked against
> >>>existing exports to ensure that the function prototypes are not changing.  
> >>>If
> >>>they are, the versioning macros must be used, and the library export map 
> >>>should
> >>>be updated to reflect the new version of the function.
> >>>
> >>>When data structures change, if those structures are application 
> >>>accessible,
> >>>apis that accept or return instances of those data structures should have 
> >>>new
> >>>versions created so that users of the old data structure version might 
> >>>co-exist
> >>>at the same time.
> >>Thanks for your efforts.
> >>But I feel this change has too many constraints for the current status of
> >>the DPDK. It's probably too early to adopt such policy.
> >>
> >I think you may be misunderstanding something.  What constraints do you 
> >beleive
> >that this patch imposes?  Note it doesn't in any way prevent changes to the 
> >ABI
> >of the DPDK, but rather gives us infrastructure to support multiple ABI
> >revisions at the same time, so that applications built against DPDK shared
> >libraries can continue to function properly at least for some time until we
> >decide to deprecate that ABI level.
> >
> >This is all based on the versioning strategy outlined here:
> >http://www.akkadia.org/drepper/dsohowto.pdf
> >
> >That may help clarify things for you.
> >
> >>By the way, this versioning doesn't cover structure changes.
> >No, it doesn't.  No link-time mechanism does so.
> >
> >>How could it be managed?
> >Thats a subject that is open to discussion, but my initial thinking is that 
> >we
> >need to handle it on a case by case basis:
> >
> >* For minor updates, where allocation of a structure is done on the heap and 
> >new
> >fields need to be added, appending them to the end of a structure and 
> >providing
> >an initial value is sufficient.
> >
> >* For major changes, where fields need to be removed, or re-arranged, mostly
> >likely the API surfaces which accept or return those structures as
> >inputs/outputs will need to have new versions written to accept the new 
> >version
> >of the structure, and internally we will have to support both formats for a 
> >time
> >(according to the policy I documented, that is currently a single major
> >release).  I.e. if you want to change struct foo, which is accepted as a
> >parameter for the function bar(struct foo *ptr), then for a release we would
> >need to create struct foo_v2 with the new format, map a new function foo_v2 
> >to
> >the exported foo@@DPDK_1.(X+1), and internally make the foo functions 
> >understand
> >both the origional and v2 versions of the structure.  Then in DPDK release
> >1.X+2, we can remove the old version after posting a deprecation notice with
> >version 1.(X+1)
> >
> >>Don't you think it would be more reliable if managed by packaging?
> >Solving this with packaging defeats the purpose of having shared libraries at
> >all.  While packaging each version of the dpdk separately is possible stopgap
> >solution, in that it allows applications to link to differing version

[dpdk-dev] Porting DPDK to ARM platform

2014-09-19 Thread Aaro Koskinen
Hi,

On Fri, Sep 19, 2014 at 03:57:52PM +0530, Mukesh Dua wrote:
> Did someone tried porting DPDK to ARM platform?

Maybe check here: https://wiki.linaro.org/LNG/Engineering/DPDK

A.