from:"Avi Kivity"

Re: configuration of memseg lists number

2023-11-06 Thread Avi Kivity

Thanks, it makes sense. I'll get around to it "eventually".

On Thu, 2023-11-02 at 11:04 +0100, Thomas Monjalon wrote:
> Hello,
> 
> While looking at Seastar, I see it uses this patch on top of DPDK:
> 
> build: add meson options of max_memseg_lists
> 
> RTE_MAX_MEMSEG_LISTS = 128 is not enough for high-memory
> machines,
> in our case, we need to increase it to 8192.
> so add an option so user can override it.
> 
> https://github.com/scylladb/dpdk/commit/cafaa3cf457584de
> 
> I think we could allow to configure this at runtime,
> as we did already for RTE_MAX_MEMZONE:
> we've added rte_memzone_max_set() / rte_memzone_max_get().
> 
> Opinions, comments, volunteers?
> 
>

[dpdk-dev] [PATCH v3 1/5] mk: remove combined library and related options

2015-04-09 Thread Avi Kivity



On 04/09/2015 11:33 AM, Gonzalez Monroy, Sergio wrote:
> On 08/04/2015 19:26, Stephen Hemminger wrote:
>> On Wed,  8 Apr 2015 16:07:21 +0100
>> Sergio Gonzalez Monroy  wrote:
>>
>>> Currently, the target/rules to build combined libraries is different
>>> than the one to build individual libraries.
>>>
>>> By removing the combined library option as a build configuration option
>>> we simplify the build pocess by having a single point for 
>>> linking/archiving
>>> libraries in DPDK.
>>>
>>> This patch removes CONFIG_RTE_BUILD_COMBINE_LIB build config option and
>>> removes the makefiles associated with building a combined library.
>>>
>>> The CONFIG_RTE_LIBNAME config option is kept as it will be use to
>>> always generate a linker script that acts as a single combined library.
>>>
>>> Signed-off-by: Sergio Gonzalez Monroy 
>>> 
>> No. We use combined library and it greatly simplfies the application
>> linking process.
>>
> After all the opposition this patch had in v2, I did explain the 
> current issues
> (see http://dpdk.org/ml/archives/dev/2015-March/015366.html ) and this 
> was the agreed solution.
>
> As I mention in the cover letter (also see patch 2/5), building DPDK 
> (after applying this patch series) will always generate a very simple 
> linker script that behaves as a combined library.
> I encourage you to apply this patch series and try to build your app 
> (which links against combined lib).
> Your app should build without problem unless I messed up somewhere and 
> it needs fixing.

Is it possible to generate a pkgconfig file (dpdk.pc) that contains all 
of the setting needed to compile and link with dpdk?  That will greatly 
simplify usage.

A linker script is just too esoteric.

[dpdk-dev] [PATCH v3 1/5] mk: remove combined library and related options

2015-04-09 Thread Avi Kivity

On 04/09/2015 02:19 PM, Neil Horman wrote:
> On Thu, Apr 09, 2015 at 12:06:47PM +0300, Avi Kivity wrote:
>>
>> On 04/09/2015 11:33 AM, Gonzalez Monroy, Sergio wrote:
>>> On 08/04/2015 19:26, Stephen Hemminger wrote:
>>>> On Wed,  8 Apr 2015 16:07:21 +0100
>>>> Sergio Gonzalez Monroy  wrote:
>>>>
>>>>> Currently, the target/rules to build combined libraries is different
>>>>> than the one to build individual libraries.
>>>>>
>>>>> By removing the combined library option as a build configuration option
>>>>> we simplify the build pocess by having a single point for
>>>>> linking/archiving
>>>>> libraries in DPDK.
>>>>>
>>>>> This patch removes CONFIG_RTE_BUILD_COMBINE_LIB build config option and
>>>>> removes the makefiles associated with building a combined library.
>>>>>
>>>>> The CONFIG_RTE_LIBNAME config option is kept as it will be use to
>>>>> always generate a linker script that acts as a single combined library.
>>>>>
>>>>> Signed-off-by: Sergio Gonzalez Monroy
>>>>> 
>>>> No. We use combined library and it greatly simplfies the application
>>>> linking process.
>>>>
>>> After all the opposition this patch had in v2, I did explain the current
>>> issues
>>> (see http://dpdk.org/ml/archives/dev/2015-March/015366.html ) and this was
>>> the agreed solution.
>>>
>>> As I mention in the cover letter (also see patch 2/5), building DPDK
>>> (after applying this patch series) will always generate a very simple
>>> linker script that behaves as a combined library.
>>> I encourage you to apply this patch series and try to build your app
>>> (which links against combined lib).
>>> Your app should build without problem unless I messed up somewhere and it
>>> needs fixing.
>> Is it possible to generate a pkgconfig file (dpdk.pc) that contains all of
>> the setting needed to compile and link with dpdk?  That will greatly
>> simplify usage.
>>
>> A linker script is just too esoteric.
>>
> Why esoteric?  We're not talking about a linker script in the sense of a 
> binary
> layout file, we're talking about a prewritten/generated libdpdk_core.so that
> contains linker directives to include the appropriate libraries.  You link it
> just like you do any other library, but it lets you ignore how they are broken
> up.

You mean DT_NEEDED?  That's great, but it shouldn't be called a linker 
script.

> We could certainly do a pkg-config file, but I don't think thats any more
> adventageous than this solution.

It solves more problems -- cflags etc.  Of course having the right 
DT_NEEDED is a good thing regardless.

Re: [dpdk-dev] [PATCH v15 0/8] add Tx preparation

2017-01-06 Thread Avi Kivity


On 01/04/2017 09:41 PM, Thomas Monjalon wrote:

2016-12-23 19:40, Tomasz Kulasek:

v15 changes:
  - marked rte_eth_tx_prepare api as experimental
  - improved doxygen comments for nb_seg_max and nb_mtu_seg_max fields
  - removed unused "uint8_t tx_prepare" declaration from testpmd

No you didn't remove this useless declaration. I did it for you.

This feature is now applied! Thanks and congratulations :)



Congrats and thanks!  This will allow us to remove some hacks from seastar.

[dpdk-dev] [PATCH] mbuf: fix incompatibility with C++ in header file

2015-08-14 Thread Avi Kivity

C++ doesn't allow implied casting from void * to another pointer, so
supply an explicit cast.

Signed-off-by: Avi Kivity 
---
 lib/librte_mbuf/rte_mbuf.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index c3b8c98..8c2db1b 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -882,7 +882,7 @@ static inline uint16_t rte_pktmbuf_priv_size(struct 
rte_mempool *mp);
 static inline struct rte_mbuf *
 rte_mbuf_from_indirect(struct rte_mbuf *mi)
 {
-   return RTE_PTR_SUB(mi->buf_addr, sizeof(*mi) + mi->priv_size);
+   return (struct rte_mbuf *)RTE_PTR_SUB(mi->buf_addr, sizeof(*mi) + 
mi->priv_size);
 }

 /**
-- 
2.4.3

[dpdk-dev] [PATCH] mempool: fix incompatibility with C++ in header file

2015-08-14 Thread Avi Kivity

C++ doesn't allow implied casting from void * to another pointer, so
supply an explicit cast.

Signed-off-by: Avi Kivity 
---
 lib/librte_mempool/rte_mempool.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 075bcdf..8abeca9 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -268,7 +268,7 @@ struct rte_mempool {
 /* return the header of a mempool object (internal) */
 static inline struct rte_mempool_objhdr *__mempool_get_header(void *obj)
 {
-   return RTE_PTR_SUB(obj, sizeof(struct rte_mempool_objhdr));
+   return (struct rte_mempool_objhdr *)RTE_PTR_SUB(obj, sizeof(struct 
rte_mempool_objhdr));
 }

 /**
@@ -290,7 +290,7 @@ static inline struct rte_mempool *rte_mempool_from_obj(void 
*obj)
 static inline struct rte_mempool_objtlr *__mempool_get_trailer(void *obj)
 {
struct rte_mempool *mp = rte_mempool_from_obj(obj);
-   return RTE_PTR_ADD(obj, mp->elt_size);
+   return (struct rte_mempool_objtlr *)RTE_PTR_ADD(obj, mp->elt_size);
 }

 /**
-- 
2.4.3

[dpdk-dev] [PATCH] mempool: fix incompatibility with C++ in header file

2015-08-17 Thread Avi Kivity

(adding list+Thomas back to cc)

On 08/17/2015 11:33 AM, Olivier MATZ wrote:
> Hi,
>
> On 08/14/2015 10:33 AM, Avi Kivity wrote:
>> C++ doesn't allow implied casting from void * to another pointer, so
>> supply an explicit cast.
>>
>> Signed-off-by: Avi Kivity 
> For Thomas:
> This fix is already submitted in
> http://dpdk.org/dev/patchwork/patch/6750/
>
>
> Thanks Avi

[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

2015-08-25 Thread Avi Kivity

On 08/25/2015 08:33 PM, Ananyev, Konstantin wrote:
> Hi Vlad,
>
>> -Original Message-
>> From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com]
>> Sent: Thursday, August 20, 2015 10:07 AM
>> To: Ananyev, Konstantin; Lu, Wenzhuo
>> Cc: dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 
>> for all NICs but 82598
>>
>>
>>
>> On 08/20/15 12:05, Vlad Zolotarov wrote:
>>>
>>> On 08/20/15 11:56, Vlad Zolotarov wrote:

 On 08/20/15 11:41, Ananyev, Konstantin wrote:
> Hi Vlad,
>
>> -Original Message-
>> From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com]
>> Sent: Wednesday, August 19, 2015 11:03 AM
>> To: Ananyev, Konstantin; Lu, Wenzhuo
>> Cc: dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
>> above 1 for all NICs but 82598
>>
>>
>>
>> On 08/19/15 10:43, Ananyev, Konstantin wrote:
>>> Hi Vlad,
>>> Sorry for delay with review, I am OOO till next week.
>>> Meanwhile, few questions/comments from me.
>> Hi, Konstantin, long time no see... ;)
>>
 This patch fixes the Tx hang we were constantly hitting with a
 seastar-based
 application on x540 NIC.
>>> Could you help to share with us how to reproduce the tx hang
>>> issue,
 with using
>>> typical DPDK examples?
>> Sorry. I'm not very familiar with the typical DPDK examples to
>> help u
>> here. However this is quite irrelevant since without this this
>> patch
>> ixgbe PMD obviously abuses the HW spec as has been explained
>> above.
>>
>> We saw the issue when u stressed the xmit path with a lot of
>> highly
>> fragmented TCP frames (packets with up to 33 fragments with
>> non-headers
>> fragments as small as 4 bytes) with all offload features enabled.
>>> Could you provide us with the pcap file to reproduce the issue?
>> Well, the thing is it takes some time to reproduce it (a few
>> minutes of
>> heavy load) therefore a pcap would be quite large.
> Probably you can upload it to some place, from which we will be able
> to download it?
 I'll see what I can do but no promises...
>>> On a second thought pcap file won't help u much since in order to
>>> reproduce the issue u have to reproduce exactly the same structure of
>>> clusters i give to HW and it's not what u see on wire in a TSO case.
>> And not only in a TSO case... ;)
> I understand that, but my thought was you can add some sort of TX callback 
> for the rte_eth_tx_burst()
> into your code that would write the packet into pcap file and then re-run 
> your hang scenario.
> I know that it means extra work for you - but I think it would be very 
> helpful if we would be able to reproduce your hang scenario:
> - if HW guys would confirm that setting RS bit for every EOP packet is not 
> really required,
>then we probably have to look at what else can cause it.
> - it might be added to our validation cycle, to prevent hitting similar 
> problem in future.
> Thanks
> Konstantin
>


I think if you send packets with random fragment chains up to 32 mbufs 
you might see this.  TSO was not required to trigger this problem.

[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

2015-08-25 Thread Avi Kivity

On 08/25/2015 10:16 PM, Zhang, Helin wrote:
>
>> -Original Message-
>> From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com]
>> Sent: Tuesday, August 25, 2015 11:53 AM
>> To: Zhang, Helin
>> Cc: Lu, Wenzhuo; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for
>> all NICs but 82598
>>
>>
>>
>> On 08/25/15 21:43, Zhang, Helin wrote:
>>> Hi Vlad
>>>
>>> I think this could possibly be the root cause of your TX hang issue.
>>> Please try to limit the number to 8 or less, and then see if the issue
>>> will still be there or not?
>>>
>> Helin, the issue has been seen on x540 devices. Pls., see a chapter
>> 7.2.1.1 of x540 devices spec:
>>
>> A packet (or multiple packets in transmit segmentation) can span any number 
>> of
>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH minus 2 
>> (see
>> Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for WTHRESH
>> details). For best performance it is recommended to minimize the number of
>> buffers as possible.
>>
>> Could u, pls., clarify why do u think that the maximum number of data 
>> buffers is
>> limited by 8?
> OK, i40e hardware is 8, so I'd assume x540 could have a similar one. Yes, in 
> your case,
> the limit could be around 38, right?
> Could you help to make sure there is no packet to be transmitted uses more 
> than
> 38 descriptors?
> I heard that there is a similar hang issue on X710 if using more than 8 
> descriptors for
> a single packet. I am wondering if the issue is similar on x540.
>
>

I believe that the ixgbe Linux driver does not limit packets to 8 
fragments, so apparently the hardware is capable.

[dpdk-dev] [PATCH 1/2] eal/persistent: new library to hold memory region after program exit

2015-07-06 Thread Avi Kivity

On 07/06/2015 04:28 PM, leeopop wrote:
> Some NICs use host memory region as their scratch area.
> When DPDK user applications terminate, all the memory regions are lost,
> re-initialized (memzone), which causes HW faults.
> This libraray maintains shared memory regions that is persistent across
> multiple execution and termination of user level applications.
> It also manages physically contiguous memory regions.
>
> Signed-off-by: leeopop 
>

Does dpdk accept anonymous signoffs?

DCO usually requires a real name.

[dpdk-dev] [PATCH v1] eal: remove use of 'register' keyword

2018-01-09 Thread Avi Kivity

The 'register' keyword does nothing, and has been removed in C++17.

Remove it for compatibility.

Signed-off-by: Avi Kivity 
---
 lib/librte_eal/common/include/arch/arm/rte_byteorder.h | 2 +-
 lib/librte_eal/common/include/arch/x86/rte_byteorder.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_byteorder.h 
b/lib/librte_eal/common/include/arch/arm/rte_byteorder.h
index 0a29f4bb4..8af0a39ad 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_byteorder.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_byteorder.h
@@ -48,11 +48,11 @@ extern "C" {
 /* fix missing __builtin_bswap16 for gcc older then 4.8 */
 #if !(__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 8))
 
 static inline uint16_t rte_arch_bswap16(uint16_t _x)
 {
-   register uint16_t x = _x;
+   uint16_t x = _x;
 
asm volatile ("rev16 %w0,%w1"
  : "=r" (x)
  : "r" (x)
  );
diff --git a/lib/librte_eal/common/include/arch/x86/rte_byteorder.h 
b/lib/librte_eal/common/include/arch/x86/rte_byteorder.h
index 1b8ed5f99..56b0a31e2 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_byteorder.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_byteorder.h
@@ -22,11 +22,11 @@ extern "C" {
  *
  * Do not use this function directly. The preferred function is rte_bswap16().
  */
 static inline uint16_t rte_arch_bswap16(uint16_t _x)
 {
-   register uint16_t x = _x;
+   uint16_t x = _x;
asm volatile ("xchgb %b[x1],%h[x2]"
  : [x1] "=Q" (x)
  : [x2] "0" (x)
  );
return x;
@@ -37,11 +37,11 @@ static inline uint16_t rte_arch_bswap16(uint16_t _x)
  *
  * Do not use this function directly. The preferred function is rte_bswap32().
  */
 static inline uint32_t rte_arch_bswap32(uint32_t _x)
 {
-   register uint32_t x = _x;
+   uint32_t x = _x;
asm volatile ("bswap %[x]"
  : [x] "+r" (x)
  );
return x;
 }
-- 
2.14.3

[dpdk-dev] [PATCH v1] eal: remove another use of register keyword

2018-01-15 Thread Avi Kivity

The 'register' keyword does nothing, and has been removed in C++17.

Remove it for compatibility, like commit 0d5f2ed12f9eb.

Signed-off-by: Avi Kivity 
---
 lib/librte_eal/common/include/arch/x86/rte_byteorder_64.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_byteorder_64.h 
b/lib/librte_eal/common/include/arch/x86/rte_byteorder_64.h
index 6289404a3..8c6cf285b 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_byteorder_64.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_byteorder_64.h
@@ -18,11 +18,11 @@
   * Do not use this function directly. The preferred function is rte_bswap64().
  */
 /* 64-bit mode */
 static inline uint64_t rte_arch_bswap64(uint64_t _x)
 {
-   register uint64_t x = _x;
+   uint64_t x = _x;
asm volatile ("bswap %[x]"
  : [x] "+r" (x)
  );
return x;
 }
-- 
2.14.3

Re: [dpdk-dev] [PATCH v1] eal: remove another use of register keyword

2018-01-15 Thread Avi Kivity




On 01/15/2018 08:00 PM, Thomas Monjalon wrote:

15/01/2018 12:33, Avi Kivity:

The 'register' keyword does nothing, and has been removed in C++17.

Remove it for compatibility, like commit 0d5f2ed12f9eb.

Fixes: 0d5f2ed12f9e ("eal: remove use of register keyword")


Signed-off-by: Avi Kivity 

Applied, thanks.

Note that "register" is used in some drivers too:
git grep -l '\.*;' drivers/ | sed -r 's,([^/]*/[^/]*/[^/]*/).*,\1,' 
| sort -u
drivers/bus/dpaa/
drivers/crypto/dpaa2_sec/
drivers/crypto/qat/
drivers/event/sw/
drivers/net/ark/
drivers/net/avp/
drivers/net/bnx2x/
drivers/net/bnxt/
drivers/net/e1000/
drivers/net/i40e/
drivers/net/ixgbe/
drivers/net/qede/
drivers/net/sfc/
drivers/net/vhost/



I think those aren't a problem, since they aren't exposed to C++ programs.

Re: [dpdk-dev] [PATCH 01/38] eal: add support for 24 40 and 48 bit operations

2017-10-02 Thread Avi Kivity




On 06/16/2017 08:40 AM, Shreyansh Jain wrote:

From: Hemant Agrawal 

Bit Swap and LE<=>BE conversions for 23, 40 and 48 bit width

Signed-off-by: Hemant Agrawal 
---
  .../common/include/generic/rte_byteorder.h | 78 ++
  1 file changed, 78 insertions(+)

diff --git a/lib/librte_eal/common/include/generic/rte_byteorder.h 
b/lib/librte_eal/common/include/generic/rte_byteorder.h
index e00bccb..8903ff6 100644
--- a/lib/librte_eal/common/include/generic/rte_byteorder.h
+++ b/lib/librte_eal/common/include/generic/rte_byteorder.h
@@ -122,6 +122,84 @@ rte_constant_bswap64(uint64_t x)
((x & 0xff00ULL) >> 56);
  }
  
+/*

+ * An internal function to swap bytes of a 48-bit value.
+ */
+static inline uint64_t
+rte_constant_bswap48(uint64_t x)
+{
+   return  ((x & 0x00ffULL) << 40) |
+   ((x & 0xff00ULL) << 24) |
+   ((x & 0x00ffULL) <<  8) |
+   ((x & 0xff00ULL) >>  8) |
+   ((x & 0x00ffULL) >> 24) |
+   ((x & 0xff00ULL) >> 40);
+}
+


Won't something like bswap64(x << 16) be much more efficient? Two 
instructions for the non-constant case, compared to 15-20 here.

[dpdk-dev] [PATCH 1/3] kcp: add kernel control path kernel module

2016-02-28 Thread Avi Kivity

On 01/27/2016 06:24 PM, Ferruh Yigit wrote:
> This kernel module is based on KNI module, but this one is stripped
> version of it and only for control messages, no data transfer
> functionality provided.
>
> This Linux kernel module helps userspace application create virtual
> interfaces and when a control command issued into that virtual
> interface, module pushes the command to the userspace and gets the
> response back for the caller application.
>
> The Linux tools like ethtool/ifconfig/ip can be used on virtual
> interfaces but not ones for related data, like tcpdump.
>
> In long term this patch intends to replace the KNI and KNI will be
> depreciated.

Instead of adding yet another out-of-tree kernel module, why not extend 
the existing in-tree tap driver?  This will make everyone's life easier.

Since tap also supports data transfer, an application can also forward 
packets not intended to it to the kernel, and forward packets from the 
kernel through the device.

> Signed-off-by: Ferruh Yigit 
> ---
>   config/common_linuxapp |   6 +
>   lib/librte_eal/linuxapp/Makefile   |   5 +-
>   lib/librte_eal/linuxapp/eal/Makefile   |   3 +-
>   .../linuxapp/eal/include/exec-env/rte_kcp_common.h |  86 +++
>   lib/librte_eal/linuxapp/kcp/Makefile   |  58 +
>   lib/librte_eal/linuxapp/kcp/kcp_dev.h  |  65 +
>   lib/librte_eal/linuxapp/kcp/kcp_ethtool.c  | 261 +++
>   lib/librte_eal/linuxapp/kcp/kcp_misc.c | 282 
> +
>   lib/librte_eal/linuxapp/kcp/kcp_net.c  | 209 +++
>   lib/librte_eal/linuxapp/kcp/kcp_nl.c   | 194 ++
>   10 files changed, 1167 insertions(+), 2 deletions(-)
>   create mode 100644 
> lib/librte_eal/linuxapp/eal/include/exec-env/rte_kcp_common.h
>   create mode 100644 lib/librte_eal/linuxapp/kcp/Makefile
>   create mode 100644 lib/librte_eal/linuxapp/kcp/kcp_dev.h
>   create mode 100644 lib/librte_eal/linuxapp/kcp/kcp_ethtool.c
>   create mode 100644 lib/librte_eal/linuxapp/kcp/kcp_misc.c
>   create mode 100644 lib/librte_eal/linuxapp/kcp/kcp_net.c
>   create mode 100644 lib/librte_eal/linuxapp/kcp/kcp_nl.c
>
> diff --git a/config/common_linuxapp b/config/common_linuxapp
> index 74bc515..5d5e3e4 100644
> --- a/config/common_linuxapp
> +++ b/config/common_linuxapp
> @@ -503,6 +503,12 @@ CONFIG_RTE_KNI_VHOST_DEBUG_RX=n
>   CONFIG_RTE_KNI_VHOST_DEBUG_TX=n
>   
>   #
> +# Compile librte_ctrl_if
> +#
> +CONFIG_RTE_KCP_KMOD=y
> +CONFIG_RTE_KCP_KO_DEBUG=n
> +
> +#
>   # Compile vhost library
>   # fuse-devel is needed to run vhost-cuse.
>   # fuse-devel enables user space char driver development
> diff --git a/lib/librte_eal/linuxapp/Makefile 
> b/lib/librte_eal/linuxapp/Makefile
> index d9c5233..d1fa3a3 100644
> --- a/lib/librte_eal/linuxapp/Makefile
> +++ b/lib/librte_eal/linuxapp/Makefile
> @@ -1,6 +1,6 @@
>   #   BSD LICENSE
>   #
> -#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
>   #   All rights reserved.
>   #
>   #   Redistribution and use in source and binary forms, with or without
> @@ -38,6 +38,9 @@ DIRS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal
>   ifeq ($(CONFIG_RTE_KNI_KMOD),y)
>   DIRS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += kni
>   endif
> +ifeq ($(CONFIG_RTE_KCP_KMOD),y)
> +DIRS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += kcp
> +endif
>   ifeq ($(CONFIG_RTE_LIBRTE_XEN_DOM0),y)
>   DIRS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += xen_dom0
>   endif
> diff --git a/lib/librte_eal/linuxapp/eal/Makefile 
> b/lib/librte_eal/linuxapp/eal/Makefile
> index 26eced5..dded8cb 100644
> --- a/lib/librte_eal/linuxapp/eal/Makefile
> +++ b/lib/librte_eal/linuxapp/eal/Makefile
> @@ -1,6 +1,6 @@
>   #   BSD LICENSE
>   #
> -#   Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
>   #   All rights reserved.
>   #
>   #   Redistribution and use in source and binary forms, with or without
> @@ -116,6 +116,7 @@ CFLAGS_eal_thread.o += -Wno-return-type
>   endif
>   
>   INC := rte_interrupts.h rte_kni_common.h rte_dom0_common.h
> +INC += rte_kcp_common.h
>   
>   SYMLINK-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP)-include/exec-env := \
>   $(addprefix include/exec-env/,$(INC))
> diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kcp_common.h 
> b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kcp_common.h
> new file mode 100644
> index 000..b3a6ee3
> --- /dev/null
> +++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kcp_common.h
> @@ -0,0 +1,86 @@
> +/*-
> + *   This file is provided under a dual BSD/LGPLv2 license.  When using or
> + *   redistributing this file, you may do so under either license.
> + *
> + *   GNU LESSER GENERAL PUBLIC LICENSE
> + *
> + *   Copyright(c) 2016 Intel Corporation. All rights reserved.
> + *
> + *   This program is fr

[dpdk-dev] [PATCH 1/3] kcp: add kernel control path kernel module

2016-02-29 Thread Avi Kivity

On 02/28/2016 10:16 PM, Ferruh Yigit wrote:
> On 2/28/2016 3:34 PM, Avi Kivity wrote:
>> On 01/27/2016 06:24 PM, Ferruh Yigit wrote:
>>> This kernel module is based on KNI module, but this one is stripped
>>> version of it and only for control messages, no data transfer
>>> functionality provided.
>>>
>>> This Linux kernel module helps userspace application create virtual
>>> interfaces and when a control command issued into that virtual
>>> interface, module pushes the command to the userspace and gets the
>>> response back for the caller application.
>>>
>>> The Linux tools like ethtool/ifconfig/ip can be used on virtual
>>> interfaces but not ones for related data, like tcpdump.
>>>
>>> In long term this patch intends to replace the KNI and KNI will be
>>> depreciated.
>> Instead of adding yet another out-of-tree kernel module, why not extend
>> the existing in-tree tap driver?  This will make everyone's life easier.
>>
>> Since tap also supports data transfer, an application can also forward
>> packets not intended to it to the kernel, and forward packets from the
>> kernel through the device.
>>
> Hi Avi,
>
> KDP (Kernel Data Path) does what you have described, it is implemented
> as PMD and it benefits from tap driver to data transfer through the
> kernel. It also support custom kernel module for better performance.
>
> For KCP (Kernel Control Path), network driver forwards control commands
> to the userspace driver, I doubt this is something wanted for tun/tap
> driver, so extending tun/tap driver like this can be hard to upstream.

Have you tried asking?  Maybe if you explain it they will be open to the 
extension.

Certainly it will be better to have KCP and KDP use the same kernel 
interface name; so we'll need to either add data path support to kcp 
(causing duplication with tap), or add control path support to tap. I 
think the latter is preferable.

> We are investigating about adding a native support to Linux kernel for
> KCP, but there is no task started for this right now, any support is
> welcome.
>
>

[dpdk-dev] [PATCH 1/3] kcp: add kernel control path kernel module

2016-02-29 Thread Avi Kivity



On 02/29/2016 12:43 PM, Ferruh Yigit wrote:
> On 2/29/2016 9:43 AM, Avi Kivity wrote:
>> On 02/28/2016 10:16 PM, Ferruh Yigit wrote:
>>> On 2/28/2016 3:34 PM, Avi Kivity wrote:
>>>> On 01/27/2016 06:24 PM, Ferruh Yigit wrote:
>>>>> This kernel module is based on KNI module, but this one is stripped
>>>>> version of it and only for control messages, no data transfer
>>>>> functionality provided.
>>>>>
>>>>> This Linux kernel module helps userspace application create virtual
>>>>> interfaces and when a control command issued into that virtual
>>>>> interface, module pushes the command to the userspace and gets the
>>>>> response back for the caller application.
>>>>>
>>>>> The Linux tools like ethtool/ifconfig/ip can be used on virtual
>>>>> interfaces but not ones for related data, like tcpdump.
>>>>>
>>>>> In long term this patch intends to replace the KNI and KNI will be
>>>>> depreciated.
>>>> Instead of adding yet another out-of-tree kernel module, why not extend
>>>> the existing in-tree tap driver?  This will make everyone's life easier.
>>>>
>>>> Since tap also supports data transfer, an application can also forward
>>>> packets not intended to it to the kernel, and forward packets from the
>>>> kernel through the device.
>>>>
>>> Hi Avi,
>>>
>>> KDP (Kernel Data Path) does what you have described, it is implemented
>>> as PMD and it benefits from tap driver to data transfer through the
>>> kernel. It also support custom kernel module for better performance.
>>>
>>> For KCP (Kernel Control Path), network driver forwards control commands
>>> to the userspace driver, I doubt this is something wanted for tun/tap
>>> driver, so extending tun/tap driver like this can be hard to upstream.
>> Have you tried asking?  Maybe if you explain it they will be open to the
>> extension.
>>
> Not communicated but tun/tap already doing something different.
> For KCP, created interface is map of the DPDK port. All data interface
> shows coming from DPDK port. For example if you get stats information
> with ifconfig, the values you observe are DPDK port statistics -not
> statistics of data between userspace and kernelspace, statistics of data
> forwarded between DPDK ports. If you down the interface, DPDK port
> stopped, etc...
>
> If you extend the tun/tap, it won't be map of the DPDK port, and if you
> get statistics information from that interface, what do you expect to
> see, the data transferred between kernel and userspace, or underlying
> DPDK port forwarding statistics?

Good point.  But you really have to involve netdev on this, or you'll 
live out-of-tree forever.

> Extending tun/tap in a way we want, forwarding all control commands to
> userspace, will break the current tun/tap, this doesn't looks like a
> valid option to me.

It's possible to enhance it while preserving backwards compatibility, by 
enabling a feature flag (statistics from userspace).

> For data path, using tun/tap is OK and we are already doing it, for the
> control path I believe we need a new driver.
>
>> Certainly it will be better to have KCP and KDP use the same kernel
>> interface name; so we'll need to either add data path support to kcp
>> (causing duplication with tap), or add control path support to tap. I
>> think the latter is preferable.
>>
> Why it is better to have same interface? Anyone who is not interested
> with kernel data path may want to control DPDK ports using common tools,
> or want to get some basic information and stats using ethtool or
> ifconfig. Why we need to bind two different functionality together?

Having two interfaces will be confusing for the user.  If I wish to 
firewall data packets coming from the dpdk port, do I set firewall rules 
on dpdk0 or tap0?

I don't think it matters whether you extend tap, or add a data path to 
kcp, but if you want to upstream it, it needs to be blessed by netdev.

>
>>> We are investigating about adding a native support to Linux kernel for
>>> KCP, but there is no task started for this right now, any support is
>>> welcome.
>>>
>>>

[dpdk-dev] [PATCH 1/3] kcp: add kernel control path kernel module

2016-02-29 Thread Avi Kivity



On 02/29/2016 01:27 PM, Ferruh Yigit wrote:
> On 2/29/2016 10:58 AM, Avi Kivity wrote:
>>
>> On 02/29/2016 12:43 PM, Ferruh Yigit wrote:
>>> On 2/29/2016 9:43 AM, Avi Kivity wrote:
>>>> On 02/28/2016 10:16 PM, Ferruh Yigit wrote:
>>>>> On 2/28/2016 3:34 PM, Avi Kivity wrote:
>>>>>> On 01/27/2016 06:24 PM, Ferruh Yigit wrote:
>>>>>>> This kernel module is based on KNI module, but this one is stripped
>>>>>>> version of it and only for control messages, no data transfer
>>>>>>> functionality provided.
>>>>>>>
>>>>>>> This Linux kernel module helps userspace application create virtual
>>>>>>> interfaces and when a control command issued into that virtual
>>>>>>> interface, module pushes the command to the userspace and gets the
>>>>>>> response back for the caller application.
>>>>>>>
>>>>>>> The Linux tools like ethtool/ifconfig/ip can be used on virtual
>>>>>>> interfaces but not ones for related data, like tcpdump.
>>>>>>>
>>>>>>> In long term this patch intends to replace the KNI and KNI will be
>>>>>>> depreciated.
>>>>>> Instead of adding yet another out-of-tree kernel module, why not
>>>>>> extend
>>>>>> the existing in-tree tap driver?  This will make everyone's life
>>>>>> easier.
>>>>>>
>>>>>> Since tap also supports data transfer, an application can also forward
>>>>>> packets not intended to it to the kernel, and forward packets from the
>>>>>> kernel through the device.
>>>>>>
>>>>> Hi Avi,
>>>>>
>>>>> KDP (Kernel Data Path) does what you have described, it is implemented
>>>>> as PMD and it benefits from tap driver to data transfer through the
>>>>> kernel. It also support custom kernel module for better performance.
>>>>>
>>>>> For KCP (Kernel Control Path), network driver forwards control commands
>>>>> to the userspace driver, I doubt this is something wanted for tun/tap
>>>>> driver, so extending tun/tap driver like this can be hard to upstream.
>>>> Have you tried asking?  Maybe if you explain it they will be open to the
>>>> extension.
>>>>
>>> Not communicated but tun/tap already doing something different.
>>> For KCP, created interface is map of the DPDK port. All data interface
>>> shows coming from DPDK port. For example if you get stats information
>>> with ifconfig, the values you observe are DPDK port statistics -not
>>> statistics of data between userspace and kernelspace, statistics of data
>>> forwarded between DPDK ports. If you down the interface, DPDK port
>>> stopped, etc...
>>>
>>> If you extend the tun/tap, it won't be map of the DPDK port, and if you
>>> get statistics information from that interface, what do you expect to
>>> see, the data transferred between kernel and userspace, or underlying
>>> DPDK port forwarding statistics?
>> Good point.  But you really have to involve netdev on this, or you'll
>> live out-of-tree forever.
>>
> Why do we need to touch netdev?

By netdev, I meant the mailing list.  If you don't touch it, your driver 
will remain out-of-tree forever.

> A simple network driver, similar to kcp, can be solution.
>
> This driver implements all net_device_ops and ethtool_ops in a way to
> forward everything to the userspace via netlink. All needs to know about
> userspace driver is it's unique id. Any userspace application, not only
> DPDK drivers, can listen the netlink messages and response to the
> requests come to itself.
>
> This kind of driver is not big or complicated, kcp already does %90 of
> what described above.

I am not arguing against kcp.  It fulfills an important need.  This is 
my argument:

1. having multiple interfaces for the control and data path is bad for 
the user
2. therefore, we need to either add tap functionality to kcp, or add kcp 
functionality to tap
3. netdev@ is more likely (IMO) to accept additional functionality to 
tap than a new driver, but the only way to know is to engage with them

>
>>> Extending tun/tap in a way we want, forwarding all control commands to
>>> userspace, will break the current tun/tap, this doesn't looks like a
>>> valid option to me.
>> It's possible to enhance it while preservin

[dpdk-dev] Appropriate DPDK data structures for TCP sockets

2015-02-23 Thread Avi Kivity

On 02/23/2015 11:16 PM, Matthew Hall wrote:
> On Mon, Feb 23, 2015 at 08:48:57AM -0600, Matt Laswell wrote:
>> Apologies in advance for likely being a bit long-winded.
> Long winded is great, helps me get context.
>
>> First, you really need to take cache performance into account when you're
>> choosing a data structure.  Something like a balanced tree can seem awfully
>> appealing at first blush
> Agreed. I did some amount of DPDK stuff before but without TCP. This is why I
> was figuring a packet-hash is better than a tree.
>
>> Second, rather than synchronizing (perhaps with locks, perhaps with
>> lockless data structures), it's often beneficial to create multiple
>> threads, each of which holds a fraction of your connection tracking data.
> Yes, I REALLY REALLY REALLY wanted to do RSS. But the virtio-net and other
> VM's don't support RSS, unlike the classic PCIe NIC's. In order to get the
> community to use my app I have to give them a "batteries included"
> environment, where the system can still work even with no RSS.

For an example of a tcp stack on top of dpdk please see seastar [1]. It 
supports hardware RSS, software RSS, or a combination (if the number of 
hardware queues is smaller than the number of cores).

>> Third, it's very worthwhile to have a cache for the most recently accessed
>> connection.  First, because network traffic is bursty, and you'll
>> frequently see multiple packets from the same connection in succession.
>> Second, because it can make life easier for your application code.  If you
>> have multiple places that need to access connection data, you don't have to
>> worry so much about the cost of repeated searches.  Again, this may or may
>> not matter for your particular application.  But for ones I've worked on,
>> it's been a win.
> Yes, this sounds like a really good idea. One advantage in my product, I am
> only doing TCP Syslog, so I don't have an arbitrary zillion connections like
> FW or IPS would want. I could cap it at something like 8192 or 16384 and be
> good enough for some time until a better solution is worked out.
>
> I could make some capped array or linked list of the X most recent ones for
> cheap access. It's just socket pointers so it doesn't hardly cost anything to
> copy a couple pointers into a cache and quickly invalidate when the connection
> closes.

A simple per-core hash table is sufficient in our experience.  Yes, you 
will take a cache miss, but it's not the end of the world.


[1] https://github.com/cloudius-systems/seastar

[dpdk-dev] [PATCH v1 5/5] ixgbe: Add LRO support

2015-03-04 Thread Avi Kivity

On 03/04/2015 02:33 AM, Stephen Hemminger wrote:
> On Tue,  3 Mar 2015 21:48:43 +0200
> Vlad Zolotarov  wrote:
>
>> + * TODO:
>> + *- Get rid of "volatile" crap and let the compiler do its
>> + *  job.
>> + *- Use the proper memory barrier (rte_rmb()) to ensure the
>> + *  memory ordering below.
> This comment screams "this is broken".
> Why not get proper architecture independent barriers in DPDK first.

C11 has arch independent memory barriers, so this can be as simple as 
-std=gnu11 (default in gcc 5, anyway).

Not only do we get the barriers for free, but they are also properly 
integrated with the compiler, so for example a release barrier won't 
stop the compiler from hoisting a later accesses to before the store, or 
cause spurious reloads, due to the memory clobber.

[dpdk-dev] Beyond DPDK 2.0

2015-05-07 Thread Avi Kivity

On Wed, Apr 22, 2015 at 6:11 PM, O'Driscoll, Tim 
wrote:

> Does anybody have any input or comments on this?
>
>
> > -Original Message-
> > From: O'Driscoll, Tim
> > Sent: Thursday, April 16, 2015 11:39 AM
> > To: dev at dpdk.org
> > Subject: Beyond DPDK 2.0
> >
> > Following the launch of DPDK by Intel as an internal development
> > project, the launch of dpdk.org by 6WIND in 2013, and the first DPDK RPM
> > packages for Fedora in 2014, 6WIND, Red Hat and Intel would like to
> > prepare for future releases after DPDK 2.0 by starting a discussion on
> > its evolution. Anyone is welcome to join this initiative.
> >
> > Since then, the project has grown significantly:
> > -The number of commits and mailing list posts has increased
> > steadily.
> > -Support has been added for a wide range of new NICs (Mellanox
> > support submitted by 6WIND, Cisco VIC, Intel i40e and fm10k etc.).
> > -DPDK is now supported on multiple architectures (IBM Power support
> > in DPDK 1.8, Tile support submitted by EZchip but not yet reviewed or
> > applied).
> >
> > While this is great progress, we need to make sure that the project is
> > structured in a way that enables it to continue to grow. To achieve
> > this, 6WIND, Red Hat and Intel would like to start a discussion about
> > the future of the project, so that we can agree and establish processes
> > that satisfy the needs of the current and future DPDK community.
> >
> > We're very interested in hearing the views of everybody in the
> > community. In addition to debate on the mailing list, we'll also
> > schedule community calls to discuss this.
> >
> >
> > Project Goals
> > -
> >
> > Some topics to be considered for the DPDK project include:
> > -Project Charter: The charter of the DPDK project should be clearly
> > defined, and should explain the limits of DPDK (what it does and does
> > not cover). This does not mean that we would be stuck with a singular
> > charter for all time, but the direction and intent of the project should
> > be well understood.
>


One problem we've seen with dpdk is that it is a framework, not a library:
it wants to create threads, manage memory, and generally take over.  This
is a problem for us, as we are writing a framework (seastar, [1]) and need
to create threads, manage memory, and generally take over ourselves.

Perhaps dpdk can be split into two layers, a library layer that only
provides mechanisms, and a framework layer that glues together those
mechanisms and applies a policy, trading in generality for ease of use.

[1] http://seastar-project.org

[dpdk-dev] Beyond DPDK 2.0

2015-05-07 Thread Avi Kivity

On 05/07/2015 06:27 PM, Wiles, Keith wrote:
>
> On 5/7/15, 7:02 AM, "Avi Kivity"  wrote:
>
>> On Wed, Apr 22, 2015 at 6:11 PM, O'Driscoll, Tim
>> 
>> wrote:
>>
>>> Does anybody have any input or comments on this?
>>>
>>>
>>>> -Original Message-
>>>> From: O'Driscoll, Tim
>>>> Sent: Thursday, April 16, 2015 11:39 AM
>>>> To: dev at dpdk.org
>>>> Subject: Beyond DPDK 2.0
>>>>
>>>> Following the launch of DPDK by Intel as an internal development
>>>> project, the launch of dpdk.org by 6WIND in 2013, and the first DPDK
>>> RPM
>>>> packages for Fedora in 2014, 6WIND, Red Hat and Intel would like to
>>>> prepare for future releases after DPDK 2.0 by starting a discussion on
>>>> its evolution. Anyone is welcome to join this initiative.
>>>>
>>>> Since then, the project has grown significantly:
>>>> -The number of commits and mailing list posts has increased
>>>> steadily.
>>>> -Support has been added for a wide range of new NICs (Mellanox
>>>> support submitted by 6WIND, Cisco VIC, Intel i40e and fm10k etc.).
>>>> -DPDK is now supported on multiple architectures (IBM Power
>>> support
>>>> in DPDK 1.8, Tile support submitted by EZchip but not yet reviewed or
>>>> applied).
>>>>
>>>> While this is great progress, we need to make sure that the project is
>>>> structured in a way that enables it to continue to grow. To achieve
>>>> this, 6WIND, Red Hat and Intel would like to start a discussion about
>>>> the future of the project, so that we can agree and establish
>>> processes
>>>> that satisfy the needs of the current and future DPDK community.
>>>>
>>>> We're very interested in hearing the views of everybody in the
>>>> community. In addition to debate on the mailing list, we'll also
>>>> schedule community calls to discuss this.
>>>>
>>>>
>>>> Project Goals
>>>> -
>>>>
>>>> Some topics to be considered for the DPDK project include:
>>>> -Project Charter: The charter of the DPDK project should be
>>> clearly
>>>> defined, and should explain the limits of DPDK (what it does and does
>>>> not cover). This does not mean that we would be stuck with a singular
>>>> charter for all time, but the direction and intent of the project
>>> should
>>>> be well understood.
>>
>> One problem we've seen with dpdk is that it is a framework, not a library:
>> it wants to create threads, manage memory, and generally take over.  This
>> is a problem for us, as we are writing a framework (seastar, [1]) and need
>> to create threads, manage memory, and generally take over ourselves.
>>
>> Perhaps dpdk can be split into two layers, a library layer that only
>> provides mechanisms, and a framework layer that glues together those
>> mechanisms and applies a policy, trading in generality for ease of use.
> The DPDK system is somewhat divided now between the EAL, PMDS and utility
> functions like malloc/rings/?
>
> The problem I see is the PMDs need a framework to be usable and the EAL
> plus the ethdev layers provide that support today. Setting up and
> initializing the DPDK system is pretty clean just call the EAL init
> routines along with the pool creates and the basic configs for the
> PMDs/hardware. Once the system is inited one can create new threads and
> not requiring anyone to use DPDK launch routines. Maybe I am not
> understanding your needs can you explain more?

An initialization routine that accepts argc/argv can hardly be called clean.

In seastar, we have our own malloc() (since seastar is sharded we can 
provide a faster thread-unsafe malloc implementation).  We also have our 
own threading, and since dpdk is an optional component in seastar, dpdk 
support requires code duplication.

I would like to launch my own threads, pin them where I like, and call 
PMD drivers to send and receive packets.  Practically everything else 
that dpdk does gets in my way, including mbuf pools.  I'd much prefer to 
allocate mbufs myself.


>> [1] http://seastar-project.org

[dpdk-dev] Beyond DPDK 2.0

2015-05-07 Thread Avi Kivity

On 05/07/2015 06:27 PM, Wiles, Keith wrote:
>
> On 5/7/15, 7:02 AM, "Avi Kivity"  wrote:
>
>> On Wed, Apr 22, 2015 at 6:11 PM, O'Driscoll, Tim
>> 
>> wrote:
>>
>>> Does anybody have any input or comments on this?
>>>
>>>
>>>> -Original Message-
>>>> From: O'Driscoll, Tim
>>>> Sent: Thursday, April 16, 2015 11:39 AM
>>>> To: dev at dpdk.org
>>>> Subject: Beyond DPDK 2.0
>>>>
>>>> Following the launch of DPDK by Intel as an internal development
>>>> project, the launch of dpdk.org by 6WIND in 2013, and the first DPDK
>>> RPM
>>>> packages for Fedora in 2014, 6WIND, Red Hat and Intel would like to
>>>> prepare for future releases after DPDK 2.0 by starting a discussion on
>>>> its evolution. Anyone is welcome to join this initiative.
>>>>
>>>> Since then, the project has grown significantly:
>>>> -The number of commits and mailing list posts has increased
>>>> steadily.
>>>> -Support has been added for a wide range of new NICs (Mellanox
>>>> support submitted by 6WIND, Cisco VIC, Intel i40e and fm10k etc.).
>>>> -DPDK is now supported on multiple architectures (IBM Power
>>> support
>>>> in DPDK 1.8, Tile support submitted by EZchip but not yet reviewed or
>>>> applied).
>>>>
>>>> While this is great progress, we need to make sure that the project is
>>>> structured in a way that enables it to continue to grow. To achieve
>>>> this, 6WIND, Red Hat and Intel would like to start a discussion about
>>>> the future of the project, so that we can agree and establish
>>> processes
>>>> that satisfy the needs of the current and future DPDK community.
>>>>
>>>> We're very interested in hearing the views of everybody in the
>>>> community. In addition to debate on the mailing list, we'll also
>>>> schedule community calls to discuss this.
>>>>
>>>>
>>>> Project Goals
>>>> -
>>>>
>>>> Some topics to be considered for the DPDK project include:
>>>> -Project Charter: The charter of the DPDK project should be
>>> clearly
>>>> defined, and should explain the limits of DPDK (what it does and does
>>>> not cover). This does not mean that we would be stuck with a singular
>>>> charter for all time, but the direction and intent of the project
>>> should
>>>> be well understood.
>>
>> One problem we've seen with dpdk is that it is a framework, not a library:
>> it wants to create threads, manage memory, and generally take over.  This
>> is a problem for us, as we are writing a framework (seastar, [1]) and need
>> to create threads, manage memory, and generally take over ourselves.
>>
>> Perhaps dpdk can be split into two layers, a library layer that only
>> provides mechanisms, and a framework layer that glues together those
>> mechanisms and applies a policy, trading in generality for ease of use.
> The DPDK system is somewhat divided now between the EAL, PMDS and utility
> functions like malloc/rings/?
>
> The problem I see is the PMDs need a framework to be usable and the EAL
> plus the ethdev layers provide that support today. Setting up and
> initializing the DPDK system is pretty clean just call the EAL init
> routines along with the pool creates and the basic configs for the
> PMDs/hardware. Once the system is inited one can create new threads and
> not requiring anyone to use DPDK launch routines. Maybe I am not
> understanding your needs can you explain more?

An initialization routine that accepts argc/argv can hardly be called clean.

In seastar, we have our own malloc() (since seastar is sharded we can 
provide a faster thread-unsafe malloc implementation).  We also have our 
own threading, and since dpdk is an optional component in seastar, dpdk 
support requires code duplication.

I would like to launch my own threads, pin them where I like, and call 
PMD drivers to send and receive packets.  Practically everything else 
that dpdk does gets in my way, including mbuf pools.  I'd much prefer to 
allocate mbufs myself.


>> [1] http://seastar-project.org

[dpdk-dev] Beyond DPDK 2.0

2015-05-07 Thread Avi Kivity

On 05/07/2015 06:49 PM, Wiles, Keith wrote:
>
> On 5/7/15, 8:33 AM, "Avi Kivity"  wrote:
>
>> On 05/07/2015 06:27 PM, Wiles, Keith wrote:
>>> On 5/7/15, 7:02 AM, "Avi Kivity"  wrote:
>>>
>>>> On Wed, Apr 22, 2015 at 6:11 PM, O'Driscoll, Tim
>>>> 
>>>> wrote:
>>>>
>>>>> Does anybody have any input or comments on this?
>>>>>
>>>>>
>>>>>> -Original Message-
>>>>>> From: O'Driscoll, Tim
>>>>>> Sent: Thursday, April 16, 2015 11:39 AM
>>>>>> To: dev at dpdk.org
>>>>>> Subject: Beyond DPDK 2.0
>>>>>>
>>>>>> Following the launch of DPDK by Intel as an internal development
>>>>>> project, the launch of dpdk.org by 6WIND in 2013, and the first DPDK
>>>>> RPM
>>>>>> packages for Fedora in 2014, 6WIND, Red Hat and Intel would like to
>>>>>> prepare for future releases after DPDK 2.0 by starting a discussion
>>>>>> on
>>>>>> its evolution. Anyone is welcome to join this initiative.
>>>>>>
>>>>>> Since then, the project has grown significantly:
>>>>>> -The number of commits and mailing list posts has increased
>>>>>> steadily.
>>>>>> -Support has been added for a wide range of new NICs (Mellanox
>>>>>> support submitted by 6WIND, Cisco VIC, Intel i40e and fm10k etc.).
>>>>>> -DPDK is now supported on multiple architectures (IBM Power
>>>>> support
>>>>>> in DPDK 1.8, Tile support submitted by EZchip but not yet reviewed or
>>>>>> applied).
>>>>>>
>>>>>> While this is great progress, we need to make sure that the project
>>>>>> is
>>>>>> structured in a way that enables it to continue to grow. To achieve
>>>>>> this, 6WIND, Red Hat and Intel would like to start a discussion about
>>>>>> the future of the project, so that we can agree and establish
>>>>> processes
>>>>>> that satisfy the needs of the current and future DPDK community.
>>>>>>
>>>>>> We're very interested in hearing the views of everybody in the
>>>>>> community. In addition to debate on the mailing list, we'll also
>>>>>> schedule community calls to discuss this.
>>>>>>
>>>>>>
>>>>>> Project Goals
>>>>>> -
>>>>>>
>>>>>> Some topics to be considered for the DPDK project include:
>>>>>> -Project Charter: The charter of the DPDK project should be
>>>>> clearly
>>>>>> defined, and should explain the limits of DPDK (what it does and does
>>>>>> not cover). This does not mean that we would be stuck with a singular
>>>>>> charter for all time, but the direction and intent of the project
>>>>> should
>>>>>> be well understood.
>>>> One problem we've seen with dpdk is that it is a framework, not a
>>>> library:
>>>> it wants to create threads, manage memory, and generally take over.
>>>> This
>>>> is a problem for us, as we are writing a framework (seastar, [1]) and
>>>> need
>>>> to create threads, manage memory, and generally take over ourselves.
>>>>
>>>> Perhaps dpdk can be split into two layers, a library layer that only
>>>> provides mechanisms, and a framework layer that glues together those
>>>> mechanisms and applies a policy, trading in generality for ease of use.
>>> The DPDK system is somewhat divided now between the EAL, PMDS and
>>> utility
>>> functions like malloc/rings/?
>>>
>>> The problem I see is the PMDs need a framework to be usable and the EAL
>>> plus the ethdev layers provide that support today. Setting up and
>>> initializing the DPDK system is pretty clean just call the EAL init
>>> routines along with the pool creates and the basic configs for the
>>> PMDs/hardware. Once the system is inited one can create new threads and
>>> not requiring anyone to use DPDK launch routines. Maybe I am not
>>> understanding your needs can you explain more?
>> An initialization routine that accepts argc/argv can hardly be called
>> clean.
> You want a config file or structur

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Avi Kivity

On 10/06/2015 10:33 AM, Stephen Hemminger wrote:
> Other than implementation objections, so far the two main arguments
> against this reduce to:
>1. If you allow UIO ioctl then it opens an API hook for all the crap out
>   of tree UIO drivers to do what they want.
>2. If you allow UIO MSI-X then you are expanding the usage of userspace
>   device access in an insecure manner.
>
> Another alternative which I explored was making a version of VFIO that
> works without IOMMU. It solves #1 but actually increases the likely negative
> response to arguent #2. This would keep same API, and avoid having to
> modify UIO. But we would still have the same (if not more resistance)
> from IOMMU developers who believe all systems have to be secure against
> root.

vfio's charter was explicitly aiming for modern setups with iommus.

This could be revisited, but I agree it will have even more resistance, 
justified IMO.

btw, (2) doesn't really add any insecurity.  The user could already poke 
at the msix tables (as well as perform DMA); they just couldn't get a 
useful interrupt out of them.

Maybe a module parameter "allow_insecure_dma" can be added to 
uio_pci_generic.  Without the parameter, bus mastering and msix is 
disabled, with the parameter it is allowed.  This requires the sysadmin 
to take a positive step in order to make use of their hardware.

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Avi Kivity

On 10/06/2015 05:07 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote:
>> btw, (2) doesn't really add any insecurity.  The user could already poke at
>> the msix tables (as well as perform DMA); they just couldn't get a useful
>> interrupt out of them.
> Poking at msix tables won't cause memory corruption unless msix and bus
> mastering is enabled.

It's a given that bus mastering is enabled.  It's true that msix is 
unlikely to be enabled, unless msix support is added.

>It's true root can enable msix and bus mastering
> through sysfs - but that's easy to block or detect. Even if you don't
> buy a security story, it seems less likely to trigger as a result
> of a userspace bug.

If you're doing DMA, that's the least of your worries.

Still, zero-mapping the msix space seems reasonable, and can protect 
userspace from silly stuff.  It can't be considered to have anything to 
do with security though, as long as users can simply DMA to every bit of 
RAM in the system they want to.

[dpdk-dev] Network Stack discussion notes from 2015 DPDK Userspace

2015-10-12 Thread Avi Kivity

On 10/10/2015 02:19 AM, Wiles, Keith wrote:
> Here are some notes from the DPDK Network Stack discussion, I can remember 
> please help me fill in anything I missed.
>
> Items I remember we talked about:
>
>*   The only reason for a DPDK TCP/IP stack is for performance and 
> possibly lower latency
>   *   Meaning the developer is willing to re-write or write his 
> application to get the best performance.
>*   A TCP/IPv4/v6 stack is the minimum stack we need to support 
> applications linked with DPDK.
>   *   SCTP is also another protocol that maybe required
>   *   TCP is the primary protocol, usage model for most use cases
>   *   Stack must be able to terminate TCP traffic to an application 
> linked to DPDK
>*   For DPDK the customer is looking for fast applications and is willing 
> to write the application just for DPDK network stack
>   *   Converting an existing application could  be done, but the design 
> is for performance and may require a lot of changes to an application
>   *   Using an application API that is not Socket is fine for high 
> performance and maybe the only way we get best performance.
>   *   Need to supply a Socket layer interface as a option if customer is 
> willing to take a performance hit instead of rewriting the application
>*   Native application acceleration is desired, but not required when 
> using DPDK network stack
>*   We have two projects related to network stack in DPDK
>   *   The first one is porting some TCP/IP stack to DPDK plus it needs to 
> give a reasonable performance increase over native Linux applications
>  *   The stack code needs to be BSD/MIT like licensed (Open Sourced)
>  *   The stack should be up to date with the latest RFCs or at least 
> close
>  *   A stack could be written for DPDK (not using a existing code 
> base) and its environment for best performance
>  *   Need to be able to configure the DPDK stack(s) from the Linux 
> command line tools if possible
>  *   Need a DPDK specific application layer API for application to 
> interface with the network stack
>  *   Could have a socket layer API on top of the specific API for 
> applications needing to use sockets (not expected to be the best performance)
>   *   The second item is figuring out a new IPC for East/West traffic 
> within the same system.
>  *   The design needs to improve performance between applications and 
> be transparent to the application when the remote end is not on the same 
> system.
>  *   The new IPC path should be agnostic to local or remote end points
>  *   Needs to be very fast compared to current Linux IPC designs. 
> (Will OVS work here?)

Basically, seastar [1] matches this exactly.  Its TCP stack, unlike most 
stacks, is sharded -- there is a separate stack running on each core 
(but with a single IP address), no locking, zero-copy for both transmit 
and receive.  It has a fast IPC between cores (all data sharing in 
seastar is via IPC queues; locks or atomic RMW operations are not 
used).  There is also an RPC subsystem that can be used for inter-node 
communications.  We've seen 7X performance improvements over the Linux 
TCP stack when coding a simple HTTP server.

Of course, it's not all roses. Seastar is written in C++, and the higher 
layers are asynchronous, so there's a high barrier to entry for dpdk 
developers.  Maybe it can't be merged outright, but perhaps it can 
provide some inspiration.

(seastar supports subsets of TCP, UDP, ICMP, and DHCP over IPv4; no IPv6 
support)

[1] https://github.com/scylladb/seastar

> Did I miss any details or comments, please reply and help me correct the 
> comment or understanding.
>
> Thanks for everyone attending and packing into a small space.
>
> ?
> Regards,
> ++Keith Wiles
> Intel Corporation

[dpdk-dev] Broken RSS hash computation on Intel 82574L

2015-09-01 Thread Avi Kivity

On 09/01/2015 05:47 PM, Matthew Hall wrote:
> On Tue, Sep 01, 2015 at 04:37:18PM +0200, Martin Dra?ar wrote:
>> Dne 1.9.2015 v 15:45 De Lara Guarch, Pablo napsal(a):
>>> 82574L NIC uses em PMD, which does not support more than 1 queue.
>>> Therefore RSS is disabled in the NIC and then you cannot have RSS hashes.
>>>
>>> Thanks,
>>> Pablo
>> Hi Pablo,
>>
>> that is an interesting information. I read the rationale in em_ethdev.c
>> and I was wondering, what would have to be done to enable RSS hash
>> computation on that card. I can live with just one RX queue, but hashes
>> would help me a lot. The computer which is using those NICs is not that
>> powerful and every bit of offloaded computation counts...
>>
>> Thanks,
>> Martin
> RSS calculations are used to direct packets across multiple RX queues. With
> only one RX queue it cannot possibly increase performance by enabling it.
>

As an example, seastar uses the RSS hash computed by the NIC to select a 
core to process on, if the number of hardware queues is smaller than the 
number of cores.

[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

2015-09-11 Thread Avi Kivity

On 09/11/2015 05:25 PM, didier.pallard wrote:
> On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
>>
>> Helin, the issue has been seen on x540 devices. Pls., see a chapter 
>> 7.2.1.1 of x540 devices spec:
>>
>> A packet (or multiple packets in transmit segmentation) can span any 
>> number of
>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH 
>> minus 2 (see
>> Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for 
>> WTHRESH
>> details). For best performance it is recommended to minimize the 
>> number of buffers
>> as possible.
>>
>> Could u, pls., clarify why do u think that the maximum number of data 
>> buffers is limited by 8?
>>
>> thanks,
>> vlad
>
> Hi vlad,
>
> Documentation states that a packet (or multiple packets in transmit 
> segmentation) can span any number of
> buffers (and their descriptors) up to a limit of 40 minus WTHRESH 
> minus 2.
>
> Shouldn't there be a test in transmit function that drops properly the 
> mbufs with a too large number of
> segments, while incrementing a statistic; otherwise transmit function 
> may be locked by the faulty packet without
> notification.
>

What we proposed is that the pmd expose to dpdk, and dpdk expose to the 
application, an mbuf check function.  This way applications that can 
generate complex packets can verify that the device will be able to 
process them, and applications that only generate simple mbufs can avoid 
the overhead by not calling the function.

[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

2015-09-11 Thread Avi Kivity

On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>
>
> On Sep 11, 2015 5:55 PM, "Thomas Monjalon"  <mailto:thomas.monjalon at 6wind.com>> wrote:
> >
> > 2015-09-11 17:47, Avi Kivity:
> > > On 09/11/2015 05:25 PM, didier.pallard wrote:
> > > > On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
> > > >>
> > > >> Helin, the issue has been seen on x540 devices. Pls., see a chapter
> > > >> 7.2.1.1 of x540 devices spec:
> > > >>
> > > >> A packet (or multiple packets in transmit segmentation) can 
> span any
> > > >> number of
> > > >> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
> > > >> minus 2 (see
> > > >> Section 7.2.3.3 for Tx Ring details and section Section 
> 7.2.3.5.1 for
> > > >> WTHRESH
> > > >> details). For best performance it is recommended to minimize the
> > > >> number of buffers
> > > >> as possible.
> > > >>
> > > >> Could u, pls., clarify why do u think that the maximum number 
> of data
> > > >> buffers is limited by 8?
> > > >>
> > > >> thanks,
> > > >> vlad
> > > >
> > > > Hi vlad,
> > > >
> > > > Documentation states that a packet (or multiple packets in transmit
> > > > segmentation) can span any number of
> > > > buffers (and their descriptors) up to a limit of 40 minus WTHRESH
> > > > minus 2.
> > > >
> > > > Shouldn't there be a test in transmit function that drops 
> properly the
> > > > mbufs with a too large number of
> > > > segments, while incrementing a statistic; otherwise transmit 
> function
> > > > may be locked by the faulty packet without
> > > > notification.
> > > >
> > >
> > > What we proposed is that the pmd expose to dpdk, and dpdk expose 
> to the
> > > application, an mbuf check function.  This way applications that can
> > > generate complex packets can verify that the device will be able to
> > > process them, and applications that only generate simple mbufs can 
> avoid
> > > the overhead by not calling the function.
> >
> > More than a check, it should be exposed as a capability of the port.
> > Anyway, if the application sends too much segments, the driver must
> > drop it to avoid hang, and maintain a dedicated statistic counter to 
> allow
> > easy debugging.
>
> I agree with Thomas - this should not be optional. Malformed packets 
> should be dropped. In the icgbe case it's a very simple test - it's a 
> single branch per packet so i doubt that it could impose any 
> measurable performance degradation.
>
>

A drop allows the application no chance to recover.  The driver must 
either provide the ability for the application to know that it cannot 
accept the packet, or it must fix it up itself.

[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

2015-09-11 Thread Avi Kivity

On 09/11/2015 07:07 PM, Richardson, Bruce wrote:
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Vladislav Zolotarov
>> Sent: Friday, September 11, 2015 5:04 PM
>> To: Avi Kivity
>> Cc: dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1
>> for all NICs but 82598
>>
>> On Sep 11, 2015 6:43 PM, "Avi Kivity"  wrote:
>>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>>>>
>>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon"
>>>> 
>> wrote:
>>>>> 2015-09-11 17:47, Avi Kivity:
>>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
>>>>>>> On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
>>>>>>>> Helin, the issue has been seen on x540 devices. Pls., see a
>> chapter
>>>>>>>> 7.2.1.1 of x540 devices spec:
>>>>>>>>
>>>>>>>> A packet (or multiple packets in transmit segmentation) can
>>>>>>>> span
>> any
>>>>>>>> number of
>>>>>>>> buffers (and their descriptors) up to a limit of 40 minus
>>>>>>>> WTHRESH minus 2 (see Section 7.2.3.3 for Tx Ring details and
>>>>>>>> section Section 7.2.3.5.1
>> for
>>>>>>>> WTHRESH
>>>>>>>> details). For best performance it is recommended to minimize
>>>>>>>> the number of buffers as possible.
>>>>>>>>
>>>>>>>> Could u, pls., clarify why do u think that the maximum number
>>>>>>>> of
>> data
>>>>>>>> buffers is limited by 8?
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> vlad
>>>>>>> Hi vlad,
>>>>>>>
>>>>>>> Documentation states that a packet (or multiple packets in
>>>>>>> transmit
>>>>>>> segmentation) can span any number of buffers (and their
>>>>>>> descriptors) up to a limit of 40 minus WTHRESH minus 2.
>>>>>>>
>>>>>>> Shouldn't there be a test in transmit function that drops
>>>>>>> properly
>> the
>>>>>>> mbufs with a too large number of segments, while incrementing a
>>>>>>> statistic; otherwise transmit
>> function
>>>>>>> may be locked by the faulty packet without notification.
>>>>>>>
>>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
>>>>>> to
>> the
>>>>>> application, an mbuf check function.  This way applications that
>>>>>> can generate complex packets can verify that the device will be
>>>>>> able to process them, and applications that only generate simple
>>>>>> mbufs can
>> avoid
>>>>>> the overhead by not calling the function.
>>>>> More than a check, it should be exposed as a capability of the port.
>>>>> Anyway, if the application sends too much segments, the driver must
>>>>> drop it to avoid hang, and maintain a dedicated statistic counter
>>>>> to
>> allow
>>>>> easy debugging.
>>>> I agree with Thomas - this should not be optional. Malformed packets
>> should be dropped. In the icgbe case it's a very simple test - it's a
>> single branch per packet so i doubt that it could impose any measurable
>> performance degradation.allows
>>>>
>>> A drop allows the application no chance to recover.  The driver must
>> either provide the ability for the application to know that it cannot
>> accept the packet, or it must fix it up itself.
>>
>> An appropriate statistics counter would be a perfect tool to detect such
>> issues. Knowingly sending a packet that will cause a HW to hang is not
>> acceptable.
> I would agree. Drivers should provide a function to query the max number of
> segments they can accept and the driver should be able to discard any packets
> exceeding that number, and just track it via a stat.
>

There is no such max number of segments.  The i40e card, as an extreme 
example, allows 8 fragments per packet, but that is after TSO 
segmentation.  So if the header is in three fragments, that leaves 5 
data fragments per packet.  Another card (ixgbe) has a 38-fragment 
pre-TSO limit.  With such a variety of limitations, the only generic way 
to expose them is via a function.

[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

2015-09-11 Thread Avi Kivity

On 09/11/2015 07:08 PM, Thomas Monjalon wrote:
> 2015-09-11 18:43, Avi Kivity:
>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" >> <mailto:thomas.monjalon at 6wind.com>> wrote:
>>>> 2015-09-11 17:47, Avi Kivity:
>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
>>>>>> Hi vlad,
>>>>>>
>>>>>> Documentation states that a packet (or multiple packets in transmit
>>>>>> segmentation) can span any number of
>>>>>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
>>>>>> minus 2.
>>>>>>
>>>>>> Shouldn't there be a test in transmit function that drops
>>> properly the
>>>>>> mbufs with a too large number of
>>>>>> segments, while incrementing a statistic; otherwise transmit
>>> function
>>>>>> may be locked by the faulty packet without
>>>>>> notification.
>>>>>>
>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
>>> to the
>>>>> application, an mbuf check function.  This way applications that can
>>>>> generate complex packets can verify that the device will be able to
>>>>> process them, and applications that only generate simple mbufs can
>>> avoid
>>>>> the overhead by not calling the function.
>>>> More than a check, it should be exposed as a capability of the port.
>>>> Anyway, if the application sends too much segments, the driver must
>>>> drop it to avoid hang, and maintain a dedicated statistic counter to
>>>> allow easy debugging.
>>> I agree with Thomas - this should not be optional. Malformed packets
>>> should be dropped. In the icgbe case it's a very simple test - it's a
>>> single branch per packet so i doubt that it could impose any
>>> measurable performance degradation.
>> A drop allows the application no chance to recover.  The driver must
>> either provide the ability for the application to know that it cannot
>> accept the packet, or it must fix it up itself.
> I have the feeling that everybody agrees on the same thing:
> the application must be able to make a well formed packet by checking
> limitations of the port. What about a field rte_eth_dev_info.max_tx_segs?

It is not generic enough.  i40e has a limit that it imposes post-TSO.


> In case the application fails in its checks, the driver must drop it and
> notify the user via a stat counter.
> The driver can also remove the hardware limitation by gathering the segments
> but it may be hard to implement and would be a slow operation.

I think that to satisfy both the 64b full line rate applications and the 
more complicated full stack applications, this must be made optional.  
In particular, and application that only forwards packets will never hit 
a NIC's limits, so it need not take any action. That's why I think a 
verification function is ideal; a forwarding application can ignore it, 
and a complex application can call it, and if it fails the packet, it 
can linearize it itself, removing complexity from dpdk itself.

[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

2015-09-13 Thread Avi Kivity

On 09/13/2015 02:47 PM, Ananyev, Konstantin wrote:
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Avi Kivity
>> Sent: Friday, September 11, 2015 6:48 PM
>> To: Thomas Monjalon; Vladislav Zolotarov; didier.pallard
>> Cc: dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 
>> for all NICs but 82598
>>
>> On 09/11/2015 07:08 PM, Thomas Monjalon wrote:
>>> 2015-09-11 18:43, Avi Kivity:
>>>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>>>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" >>>> <mailto:thomas.monjalon at 6wind.com>> wrote:
>>>>>> 2015-09-11 17:47, Avi Kivity:
>>>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
>>>>>>>> Hi vlad,
>>>>>>>>
>>>>>>>> Documentation states that a packet (or multiple packets in transmit
>>>>>>>> segmentation) can span any number of
>>>>>>>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
>>>>>>>> minus 2.
>>>>>>>>
>>>>>>>> Shouldn't there be a test in transmit function that drops
>>>>> properly the
>>>>>>>> mbufs with a too large number of
>>>>>>>> segments, while incrementing a statistic; otherwise transmit
>>>>> function
>>>>>>>> may be locked by the faulty packet without
>>>>>>>> notification.
>>>>>>>>
>>>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
>>>>> to the
>>>>>>> application, an mbuf check function.  This way applications that can
>>>>>>> generate complex packets can verify that the device will be able to
>>>>>>> process them, and applications that only generate simple mbufs can
>>>>> avoid
>>>>>>> the overhead by not calling the function.
>>>>>> More than a check, it should be exposed as a capability of the port.
>>>>>> Anyway, if the application sends too much segments, the driver must
>>>>>> drop it to avoid hang, and maintain a dedicated statistic counter to
>>>>>> allow easy debugging.
>>>>> I agree with Thomas - this should not be optional. Malformed packets
>>>>> should be dropped. In the icgbe case it's a very simple test - it's a
>>>>> single branch per packet so i doubt that it could impose any
>>>>> measurable performance degradation.
>>>> A drop allows the application no chance to recover.  The driver must
>>>> either provide the ability for the application to know that it cannot
>>>> accept the packet, or it must fix it up itself.
>>> I have the feeling that everybody agrees on the same thing:
>>> the application must be able to make a well formed packet by checking
>>> limitations of the port. What about a field rte_eth_dev_info.max_tx_segs?
>> It is not generic enough.  i40e has a limit that it imposes post-TSO.
>>
>>
>>> In case the application fails in its checks, the driver must drop it and
>>> notify the user via a stat counter.
>>> The driver can also remove the hardware limitation by gathering the segments
>>> but it may be hard to implement and would be a slow operation.
>> I think that to satisfy both the 64b full line rate applications and the
>> more complicated full stack applications, this must be made optional.
>> In particular, and application that only forwards packets will never hit
>> a NIC's limits, so it need not take any action. That's why I think a
>> verification function is ideal; a forwarding application can ignore it,
>> and a complex application can call it, and if it fails the packet, it
>> can linearize it itself, removing complexity from dpdk itself.
> I think that's a good approach to that problem.
> As I remember we discussed something similar a while ago -
> A function (tx_prep() or something) that would check nb_segs and probably 
> some other HW specific restrictions,
> calculate pseudo-header checksum, reset ip header len, etc.
>
>  From other hand we also can add two more fields into rte_eth_dev_info:
> 1) Max num of segs per TSO packet (tx_max_seg ?).
> 2) Max num of segs per single packet/TSO segment (tx_max_mtu_seg ?).
> So for ixgbe both will have value 40 - wthresh,
> while for i40e 1) would be UINT8_MAX and 2) will be 8.
> Then upper layer can use that information to select an optimal size for its 
> TX buffers.
>   
>

This will break whenever the fevered imagination of hardware designers 
comes up with a new limit.

We can have an internal function that accepts these two parameters, and 
then the driver-specific function can call this internal function:

static bool i40e_validate_packet(mbuf* m) {
 return rte_generic_validate_packet(m, 0, 8);
}

static bool ixgbe_validate_packet(mbuf* m) {
 return rte_generic_validate_packet(m, 40, 2);
}

this way, the application is isolated from changes in how invalid 
packets are detected.

[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

2015-09-13 Thread Avi Kivity

On Sep 13, 2015 6:54 PM, "Ananyev, Konstantin" 
wrote:
>
>
>
> > -Original Message-
> > From: Avi Kivity [mailto:avi at cloudius-systems.com]
> > Sent: Sunday, September 13, 2015 1:33 PM
> > To: Ananyev, Konstantin; Thomas Monjalon; Vladislav Zolotarov;
didier.pallard
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above
1 for all NICs but 82598
> >
> > On 09/13/2015 02:47 PM, Ananyev, Konstantin wrote:
> > >
> > >> -----Original Message-
> > >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Avi Kivity
> > >> Sent: Friday, September 11, 2015 6:48 PM
> > >> To: Thomas Monjalon; Vladislav Zolotarov; didier.pallard
> > >> Cc: dev at dpdk.org
> > >> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
above 1 for all NICs but 82598
> > >>
> > >> On 09/11/2015 07:08 PM, Thomas Monjalon wrote:
> > >>> 2015-09-11 18:43, Avi Kivity:
> > >>>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
> > >>>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <
thomas.monjalon at 6wind.com
> > >>>>> <mailto:thomas.monjalon at 6wind.com>> wrote:
> > >>>>>> 2015-09-11 17:47, Avi Kivity:
> > >>>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
> > >>>>>>>> Hi vlad,
> > >>>>>>>>
> > >>>>>>>> Documentation states that a packet (or multiple packets in
transmit
> > >>>>>>>> segmentation) can span any number of
> > >>>>>>>> buffers (and their descriptors) up to a limit of 40 minus
WTHRESH
> > >>>>>>>> minus 2.
> > >>>>>>>>
> > >>>>>>>> Shouldn't there be a test in transmit function that drops
> > >>>>> properly the
> > >>>>>>>> mbufs with a too large number of
> > >>>>>>>> segments, while incrementing a statistic; otherwise transmit
> > >>>>> function
> > >>>>>>>> may be locked by the faulty packet without
> > >>>>>>>> notification.
> > >>>>>>>>
> > >>>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
> > >>>>> to the
> > >>>>>>> application, an mbuf check function.  This way applications
that can
> > >>>>>>> generate complex packets can verify that the device will be
able to
> > >>>>>>> process them, and applications that only generate simple mbufs
can
> > >>>>> avoid
> > >>>>>>> the overhead by not calling the function.
> > >>>>>> More than a check, it should be exposed as a capability of the
port.
> > >>>>>> Anyway, if the application sends too much segments, the driver
must
> > >>>>>> drop it to avoid hang, and maintain a dedicated statistic
counter to
> > >>>>>> allow easy debugging.
> > >>>>> I agree with Thomas - this should not be optional. Malformed
packets
> > >>>>> should be dropped. In the icgbe case it's a very simple test -
it's a
> > >>>>> single branch per packet so i doubt that it could impose any
> > >>>>> measurable performance degradation.
> > >>>> A drop allows the application no chance to recover.  The driver
must
> > >>>> either provide the ability for the application to know that it
cannot
> > >>>> accept the packet, or it must fix it up itself.
> > >>> I have the feeling that everybody agrees on the same thing:
> > >>> the application must be able to make a well formed packet by
checking
> > >>> limitations of the port. What about a field
rte_eth_dev_info.max_tx_segs?
> > >> It is not generic enough.  i40e has a limit that it imposes post-TSO.
> > >>
> > >>
> > >>> In case the application fails in its checks, the driver must drop
it and
> > >>> notify the user via a stat counter.
> > >>> The driver can also remove the hardware limitation by gathering the
segments
> > >>> but it may be hard to implement and would be a slow operation.
> > >> I think that to satisfy both the 64b full line rate applications and

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-28 Thread Avi Kivity

On 07/21/2016 06:24 PM, Tomasz Kulasek wrote:
> This is an ABI deprecation notice for DPDK 16.11 in librte_ether about
> changes in rte_eth_dev and rte_eth_desc_lim structures.
>
> As discussed in that thread:
>
> http://dpdk.org/ml/archives/dev/2015-September/023603.html
>
> Different NIC models depending on HW offload requested might impose
> different requirements on packets to be TX-ed in terms of:
>
>   - Max number of fragments per packet allowed
>   - Max number of fragments per TSO segments
>   - The way pseudo-header checksum should be pre-calculated
>   - L3/L4 header fields filling
>   - etc.
>
>
> MOTIVATION:
> ---
>
> 1) Some work cannot (and didn't should) be done in rte_eth_tx_burst.
> However, this work is sometimes required, and now, it's an
> application issue.
>
> 2) Different hardware may have different requirements for TX offloads,
> other subset can be supported and so on.
>
> 3) Some parameters (eg. number of segments in ixgbe driver) may hung
> device. These parameters may be vary for different devices.
>
> For example i40e HW allows 8 fragments per packet, but that is after
> TSO segmentation. While ixgbe has a 38-fragment pre-TSO limit.
>
> 4) Fields in packet may require different initialization (like eg. will
> require pseudo-header checksum precalculation, sometimes in a
> different way depending on packet type, and so on). Now application
> needs to care about it.
>
> 5) Using additional API (rte_eth_tx_prep) before rte_eth_tx_burst let to
> prepare packet burst in acceptable form for specific device.
>
> 6) Some additional checks may be done in debug mode keeping tx_burst
> implementation clean.

Thanks a lot for this.  Seastar suffered from this issue and had to 
apply NIC-specific workarounds.

The proposal will work well for seastar.

>
> PROPOSAL:
> -
>
> To help user to deal with all these varieties we propose to:
>
> 1. Introduce rte_eth_tx_prep() function to do necessary preparations of
> packet burst to be safely transmitted on device for desired HW
> offloads (set/reset checksum field according to the hardware
> requirements) and check HW constraints (number of segments per
> packet, etc).
>
> While the limitations and requirements may differ for devices, it
> requires to extend rte_eth_dev structure with new function pointer
> "tx_pkt_prep" which can be implemented in the driver to prepare and
> verify packets, in devices specific way, before burst, what should to
> prevent application to send malformed packets.
>
> 2. Also new fields will be introduced in rte_eth_desc_lim:
> nb_seg_max and nb_mtu_seg_max, providing an information about max
> segments in TSO and non-TSO packets acceptable by device.
>
> This information is useful for application to not create/limit
> malicious packet.
>
>
> APPLICATION (CASE OF USE):
> --
>
> 1) Application should to initialize burst of packets to send, set
> required tx offload flags and required fields, like l2_len, l3_len,
> l4_len, and tso_segsz
>
> 2) Application passes burst to the rte_eth_tx_prep to check conditions
> required to send packets through the NIC.
>
> 3) The result of rte_eth_tx_prep can be used to send valid packets
> and/or restore invalid if function fails.
>
> eg.
>
>   for (i = 0; i < nb_pkts; i++) {
>
>   /* initialize or process packet */
>
>   bufs[i]->tso_segsz = 800;
>   bufs[i]->ol_flags = PKT_TX_TCP_SEG | PKT_TX_IPV4
>   | PKT_TX_IP_CKSUM;
>   bufs[i]->l2_len = sizeof(struct ether_hdr);
>   bufs[i]->l3_len = sizeof(struct ipv4_hdr);
>   bufs[i]->l4_len = sizeof(struct tcp_hdr);
>   }
>
>   /* Prepare burst of TX packets */
>   nb_prep = rte_eth_tx_prep(port, 0, bufs, nb_pkts);
>
>   if (nb_prep < nb_pkts) {
>   printf("tx_prep failed\n");
>
>   /* drop or restore invalid packets */
>
>   }
>
>   /* Send burst of TX packets */
>   nb_tx = rte_eth_tx_burst(port, 0, bufs, nb_prep);
>
>   /* Free any unsent packets. */
>
>
>
> Signed-off-by: Tomasz Kulasek 
> ---
>   doc/guides/rel_notes/deprecation.rst |7 +++
>   1 file changed, 7 insertions(+)
>
> diff --git a/doc/guides/rel_notes/deprecation.rst 
> b/doc/guides/rel_notes/deprecation.rst
> index f502f86..485aacb 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -41,3 +41,10 @@ Deprecation Notices
>   * The mempool functions for single/multi producer/consumer are deprecated 
> and
> will be removed in 16.11.
> It is replaced by rte_mempool_generic_get/put functions.
> +
> +* In 16.11 ABI changes are plained: the ``rte_eth_dev`` structure will be
> +  extended with new function pointer ``tx_pkt_prep`` allowing verification
> +  and processing of packet burst to meet HW specific requi

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-28 Thread Avi Kivity



On 07/28/2016 02:38 PM, Jerin Jacob wrote:
> On Thu, Jul 28, 2016 at 10:36:07AM +, Ananyev, Konstantin wrote:
>>> If it does not cope up then it can skip tx'ing in the actual tx burst
>>> itself and move the "skipped" tx packets to end of the list in the tx
>>> burst so that application can take the action on "skipped" packet after
>>> the tx burst
>> Sorry, that's too cryptic for me.
>> Can you reword it somehow?
> OK.
> 1) lets say application requests 32 packets to send it using tx_burst.
> 2) packets are from p0 to p31
> 3) in driver due to some reason, it is not able to send the packets due to 
> some
> constraints in the driver(say expect p2 and p16 everything else sent
> successfully by the driver)
> 4) driver can move p2 and p16 at pkt[0] and pkt[1] on tx_burst and
> return 30
> 5) application can take action on p2 and p16 based the return value of
> 30(ie 32-30 = 2 packets needs to handle at pkt[0] and pkt[1]

That can cause reordering; while it is legal, it reduces tcp performance.

Better to preserve the application-provided order.


>
>>>
 Instead it just setups the ol_flags, fills tx_offload fields and calls 
 tx_prep().
 Please read the original Tomasz's patch, I think he explained possible 
 use-cases
 with lot of details.
>>> Sorry, it is not very clear in terms of use cases.
>> Ok, what I meant to say:
>> Right now, if user wants to use HW TX cksum/TSO offloads he might have to:
>> - setup ipv4 header cksum field.
>> - calculate the pseudo header checksum
>> - setup tcp/udp cksum field.
>>
>> Rules how these calculations need to be done and which fields need to be 
>> updated,
>> may vary depending on HW underneath and requested offloads.
>> tx_prep() - supposed to hide all these nuances from user and allow him to 
>> use TX HW offloads
>> in a transparent way.
> Not sure I understand it completely. Bit contradicting with below
> statement
> |We would document what tx_prep() supposed to do, and in what cases user
> |don't need it.
>
> How about introducing a new ethdev generic eal command-line mode OR
> new ethdev_configure hint that PMD driver is in "tx_prep->tx_burst" mode
> instead of just tx_burst? That way no fast-path performance degradation
> for the PMD that does not need it
>
>
>> Another main purpose of tx_prep(): for multi-segment packets is to check
>> that number of segments doesn't exceed  HW limit.
>> Again right now users have to do that on their own.
>>
>>> In HW perspective, It it tries to avoid the illegal state. But not sure
>>> calling "back to back" tx prepare and then tx burst how does it improve the
>>> situation as the check illegal state check introduce in actual tx burst
>>> it self.
>>>
>>> In SW perspective, its try to avoid sending malformed packets. In my
>>> view the same can achieved with existing tx burst it self as PMD is the
>>> one finally send the packets on the wire.
>> Ok, so your question is: why not to put that functionality into
>> tx_burst() itself, right?
>> For few reasons:
>> 1. putting that functionality into tx_burst() would introduce unnecessary
>>  slowdown for cases when that functionality is not needed
>>  (one segment per packet, no HW offloads).
> These parameters can be configured on init time
>
>> 2. User might don't want to use tx_prep() - he/she might have its
>>  own analog, which he/she belives is faster/smarter,etc.
> That's the current mode. Right?
>> 3.  Having it a s separate function would allow user control when/where
>>to call it, let say only for some packets, or probably call tx_prep()
>>on one core, and do actual tx_burst() for these packets on the other.
> Why to process it under tx_prep() as application can always process the
> packet in one core
>
>>> proposal quote:
>>>
>>> 1. Introduce rte_eth_tx_prep() function to do necessary preparations of
>>> packet burst to be safely transmitted on device for desired HW
>>> offloads (set/reset checksum field according to the hardware
>>> requirements) and check HW constraints (number of segments per
>>> packet, etc).
>>>
>>> While the limitations and requirements may differ for devices, it
>>> requires to extend rte_eth_dev structure with new function pointer
>>> "tx_pkt_prep" which can be implemented in the driver to prepare and
>>> verify packets, in devices specific way, before burst, what should to
>>> prevent application to send malformed packets.
>>>
>>>
> and what if the PMD does not implement that callback then it is of waste 
> cycles. Right?
 If you refer as lost cycles here something like:
 RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->tx_prep, -ENOTSUP);
 then yes.
 Though comparing to actual work need to be done for most HW TX offloads,
 I think it is neglectable.
>>> Not sure.
>>>
 Again, as I said before, it is totally voluntary for the application.
>>> Not according to proposal. It can't be too as application has no idea
>>> what PMD driver does w

[dpdk-dev] [PATCH] vfio: Include No-IOMMU mode

2015-11-16 Thread Avi Kivity

On 11/16/2015 07:06 PM, Alex Williamson wrote:
> On Wed, 2015-10-28 at 15:21 -0600, Alex Williamson wrote:
>> There is really no way to safely give a user full access to a DMA
>> capable device without an IOMMU to protect the host system.  There is
>> also no way to provide DMA translation, for use cases such as device
>> assignment to virtual machines.  However, there are still those users
>> that want userspace drivers even under those conditions.  The UIO
>> driver exists for this use case, but does not provide the degree of
>> device access and programming that VFIO has.  In an effort to avoid
>> code duplication, this introduces a No-IOMMU mode for VFIO.
>>
>> This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
>> the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
>> should make it very clear that this mode is not safe.  Additionally,
>> CAP_SYS_RAWIO privileges are necessary to work with groups and
>> containers using this mode.  Groups making use of this support are
>> named /dev/vfio/noiommu-$GROUP and can only make use of the special
>> VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
>> binding a device without a native IOMMU group to a VFIO bus driver
>> will taint the kernel and should therefore not be considered
>> supported.  This patch includes no-iommu support for the vfio-pci bus
>> driver only.
>>
>> Signed-off-by: Alex Williamson 
>> ---
>>
>> This is pretty well the same as RFCv2, I've changed the pr_warn to a
>> dev_warn and added another, printing the pid and comm of the task when
>> it actually opens the device.  If Stephen can port the driver code
>> over and prove that this actually works sometime next week, and there
>> aren't any objections to this code, I'll include it in a pull request
>> for the next merge window.  MST, I dropped your ack due to the
>> changes, but I'll be happy to add it back if you like.  Thanks,
>>
>> Alex
>>
>>   drivers/vfio/Kconfig|   15 +++
>>   drivers/vfio/pci/vfio_pci.c |8 +-
>>   drivers/vfio/vfio.c |  186 
>> ++-
>>   include/linux/vfio.h|3 +
>>   include/uapi/linux/vfio.h   |7 ++
>>   5 files changed, 209 insertions(+), 10 deletions(-)
> FYI, this is now in v4.4-rc1 (the slightly modified v2 version).  I want
> to give fair warning though that while we seem to agree on this idea, it
> hasn't been proven with a userspace driver port.  I've opted to include
> it in this merge window rather than delaying it until v4.5, but I really
> need to see a user for this before the end of the v4.4 cycle or I think
> we'll need to revert and revisit for v4.5 anyway.  I don't really have
> any interest in adding and maintaining code that has no users.  Please
> keep me informed of progress with a dpdk port.  Thanks,
>
>

Thanks Alex.  Copying the dpdk mailing list, where the users live.

dpdk-ers: vfio-noiommu is a replacement for uio_pci_generic and 
uio_igb.  It supports MSI-X and so can be used on SR/IOV VF devices.  
The intent is that you can use dpdk without an external module, using 
vfio, whether you are on bare metal with an iommu, bare metal without an 
iommu, or virtualized.  However, dpdk needs modification to support this.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 11:44 AM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
>>> And for what, to prevent
>>> root from touching memory via dma that they can access in a million other
>>> ways?
>> So one can be reasonably sure a kernel oops is not a result of a
>> userspace bug.
> Actually, I thought about this overnight, and  it should be possible to
> drive it securely from userspace, without hypervisor changes.

Also without the performance that was the whole reason from doing it in 
userspace in the first place.

I still don't understand your objection to the patch:

> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.

If a distribution feels it can't support this configuration, it can 
disable the uio_pci_generic driver, or refuse to support tainted 
kernels.  If it feels it can (and many distributions are starting to 
support dpdk), then you're just denying it the ability to serve its users.

> See
>
> https://mid.gmane.org/20151001104505-mutt-send-email-mst at redhat.com
>
>
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 11:52 AM, Avi Kivity wrote:
>
>
> On 10/01/2015 11:44 AM, Michael S. Tsirkin wrote:
>> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
>>>> And for what, to prevent
>>>> root from touching memory via dma that they can access in a million other
>>>> ways?
>>> So one can be reasonably sure a kernel oops is not a result of a
>>> userspace bug.
>> Actually, I thought about this overnight, and  it should be possible to
>> drive it securely from userspace, without hypervisor changes.
>
> Also without the performance that was the whole reason from doing it 
> in userspace in the first place.
>
> I still don't understand your objection to the patch:
>
>> MSI messages are memory writes so any generic device capable
>> of MSI is capable of corrupting kernel memory.
>> This means that a bug in userspace will lead to kernel memory corruption
>> and crashes.  This is something distributions can't support.
>

And this:

> What userspace can't be allowed to do:
>
>   access BAR
>   write rings
>

It can access the BAR by mmap()ing the resourceN files under sysfs. 
You're not denying userspace the ability to oops the kernel, just the 
ability to do useful things with hardware.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 12:15 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 11:52:26AM +0300, Avi Kivity wrote:
>> I still don't understand your objection to the patch:
>>
>>
>>  MSI messages are memory writes so any generic device capable
>>  of MSI is capable of corrupting kernel memory.
>>  This means that a bug in userspace will lead to kernel memory corruption
>>  and crashes.  This is something distributions can't support.
>>
>>
>> If a distribution feels it can't support this configuration, it can disable 
>> the
>> uio_pci_generic driver, or refuse to support tainted kernels.  If it feels it
>> can (and many distributions are starting to support dpdk), then you're just
>> denying it the ability to serve its users.
> I don't, and can't deny users anything.  I merely think upstream should
> avoid putting this driver in-tree.  By doing this, driver writers will
> be pushed to develop solutions that can't crash kernel.
>
> I pointed out one way to build it, there are sure to be more.

And I pointed out that your solution is unworkable.  It's easy to claim 
that a solution is around the corner, only no one was looking for it, 
but the reality is that kernel bypass has been a solution for years for 
high performance users, that it cannot be made safe without an iommu, 
and that iommus are not available everywhere; and even when they are 
some users prefer to avoid the performance penalty.

> As far as I could see, without this kind of motivation, people do not
> even want to try.

You are mistaken.  The problem is a lot harder than you think.

People didn't go and write userspace drivers because they were lazy.  
They wrote them because there was no other way.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 12:29 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:15:49PM +0300, Avi Kivity wrote:
>>  What userspace can't be allowed to do:
>>
>>  access BAR
>>  write rings
>>
>>
>>
>>
>> It can access the BAR by mmap()ing the resourceN files under sysfs.  You're 
>> not
>> denying userspace the ability to oops the kernel, just the ability to do 
>> useful
>> things with hardware.
>
> This interface has to stay there to support existing applications.  A
> variety of measures (selinux, secureboot) can be used to make sure
> modern ones to not get to touch it.

By all means, secure the driver with selinux as well.

>   Most distributions enable
> some or all of these by default.

There is no problem accessing the BARs on the most modern and secure 
enterprise distribution I am aware of (CentOS 7.1).

>
> And it doesn't mean modern drivers can do this kind of thing.
>
> Look, without an IOMMU, sysfs can not be used securely:
> you need some other interface. This is what this driver is missing.

What is this magical missing interface?

It simply cannot be done.  You either have an iommu, or you accept that 
userspace can access anything via DMA.

The sad thing is that you can do this since forever on a non-virtualized 
system, or on a virtualized system if you don't need interrupt support.  
All you're doing is blocking interrupt support on virtualized systems.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 12:42 PM, Vincent JARDIN wrote:
> On 01/10/2015 11:22, Avi Kivity wrote:
>>> As far as I could see, without this kind of motivation, people do not
>>> even want to try.
>>
>> You are mistaken.  The problem is a lot harder than you think.
>>
>> People didn't go and write userspace drivers because they were lazy.
>> They wrote them because there was no other way.
>
> I disagree, it is possible to write a 'partial' userspace driver.
>
> Here it is an example:
>   http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4
>
> It benefits of the kernel's capabilities while the userland manages 
> only the IOs.
>

That is because the device itself contains an iommu.

> There were some tentative to get it for other (older) drivers, named 
> 'bifurcated drivers', but it is stalled.

IIRC they still exposed the ring to userspace.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 12:42 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>> even when they are some users
>> prefer to avoid the performance penalty.
> I don't think there's a measureable penalty from passing through the
> IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
> never saw any numbers that show such.
>

Maybe not.  But again, virtualized setups will not have a guest iommu 
and therefore can't use it; and those happen to be exactly the setups 
you're blocking.

Non-virtualized setups have an iommu available, but they can also use 
pci_uio_generic without patching if they like.

The virtualized setups have no other option; you're leaving them out in 
the cold.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 12:48 PM, Vincent JARDIN wrote:
> On 01/10/2015 11:43, Avi Kivity wrote:
>>
>> That is because the device itself contains an iommu.
>
> Yes.
>
> It could be an option:
>   - we could flag the Linux system unsafe when the device does not 
> have any IOMMU
>   - we flag the Linux system safe when the device has an IOMMU

This already exists, it's called the tainted flag.

I don't know if pci_uio_generic already taints the kernel; it certainly 
should with DMA capable devices.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>> It's easy to claim that
>> a solution is around the corner, only no one was looking for it, but the
>> reality is that kernel bypass has been a solution for years for high
>> performance users,
> I never said that it's trivial.
>
> It's probably a lot of work. It's definitely more work than just abusing
> sysfs.
>
> But it looks like a write system call into an eventfd is about 1.5
> microseconds on my laptop. Even with a system call per packet, system
> call overhead is not what makes DPDK drivers outperform Linux ones.
>

1.5 us = 0.6 Mpps per core limit.  dpdk performance is in the tens of 
millions of packets per system.

It's not just the lack of system calls, of course, the architecture is 
completely different.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 01:07 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:38:51PM +0300, Avi Kivity wrote:
>> The sad thing is that you can do this since forever on a non-virtualized
>> system, or on a virtualized system if you don't need interrupt support.  All
>> you're doing is blocking interrupt support on virtualized systems.
> True, Linux could do more to prevent this kind of abuse.
> In fact IIRC, if you enable secureboot, it does exactly that.
>
> A generic uio driver isn't a good interface because it relies on these
> sysfs files. We are luckly it doesn't work for VFs, I don't think we
> should do anything that relies on this interface in future applications.
>

I agree that uio is not a good solution.  But for some users, which we 
are discussing now, it is the only solution.

A bad solution is better than no solution.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 01:14 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
>>> There were some tentative to get it for other (older) drivers, named
>>> 'bifurcated drivers', but it is stalled.
>> IIRC they still exposed the ring to userspace.
> How much would a ring write syscall cost? 1-2 microseconds, isn't it?
> Measureable, but it's not the end of the world.

Plus a page table walk per packet fragment (dpdk has the physical 
address prepared in the mbuf IIRC).  The 10Mpps+ users of dpdk should 
comment on whether the performance hit is acceptable (my use case is 
much more modest).

> ring read might be safe to allow.
> Should buy us enough time until hypervisors support IOMMU.

All the relevant drivers need to be converted to support ring 
translation, and exposing the ring to userspace in the new API.  It 
shouldn't take more than 3-4 years.

Meanwhile, users of virtualized systems that need interrupt support 
cannot use their machines, while non-virtualized users are free to DMA 
wherever they like, in the name of security.

btw, an API like you describe already exists -- vhost.  Of course the 
virtio protocol is nowhere near fast enough, but at least it's an example.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 01:17 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote:
>> Non-virtualized setups have an iommu available, but they can also use
>> pci_uio_generic without patching if they like.
> Not with VFs, they can't.
>

They can and they do (I use it myself).

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 01:24 PM, Avi Kivity wrote:
> On 10/01/2015 01:17 PM, Michael S. Tsirkin wrote:
>> On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote:
>>> Non-virtualized setups have an iommu available, but they can also use
>>> pci_uio_generic without patching if they like.
>> Not with VFs, they can't.
>>
>
> They can and they do (I use it myself).

I mean with a PF.  Why use a VF on a non-virtualized system?

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 01:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
>>
>> On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>>>> It's easy to claim that
>>>> a solution is around the corner, only no one was looking for it, but the
>>>> reality is that kernel bypass has been a solution for years for high
>>>> performance users,
>>> I never said that it's trivial.
>>>
>>> It's probably a lot of work. It's definitely more work than just abusing
>>> sysfs.
>>>
>>> But it looks like a write system call into an eventfd is about 1.5
>>> microseconds on my laptop. Even with a system call per packet, system
>>> call overhead is not what makes DPDK drivers outperform Linux ones.
>>>
>> 1.5 us = 0.6 Mpps per core limit.
> Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.

You also trimmed the extra work that needs to be done, that I 
mentioned.  Maybe your ring proxy can work, maybe it can't.  In any case 
it's a hefty chunk of work.  Should this work block users from using 
their VFs, if they happen to need interrupt support?

> But for RX, you can batch a lot of packets.
>
> You can see by now I'm not that good at benchmarking.
> Here's what I wrote:
>
>
> #include 
> #include 
> #include 
> #include 
>
>
> int main(int argc, char **argv)
> {
>  int e = eventfd(0, 0);
>  uint64_t v = 1;
>
>  int i;
>
>  for (i = 0; i < 1000; ++i) {
>  write(e, &v, sizeof v);
>  }
> }
>
>
> This takes 1.5 seconds to run on my laptop:
>
> $ time ./a.out
>
> real0m1.507s
> user0m0.179s
> sys 0m1.328s
>
>
>> dpdk performance is in the tens of
>> millions of packets per system.
> I think that's with a bunch of batching though.

Yes, it's also with their application code running as well.  They didn't 
reach this kind of performance by spending cycles unnecessarily.

I'm not saying that the ring proxy is not workable; just that we don't 
know whether it is or not, and I don't think that a patch that enables 
_existing functionality_ for VFs should be blocked in favor of a new and 
unproven approach.

>
>> It's not just the lack of system calls, of course, the architecture is
>> completely different.
> Absolutely - I'm not saying move all of DPDK into kernel.
> We just need to protect the RX rings so hardware does
> not corrupt kernel memory.
>
>
> Thinking about it some more, many devices
> have separate rings for DMA: TX (device reads memory)
> and RX (device writes memory).
> With such devices, a mode where userspace can write TX ring
> but not RX ring might make sense.

I'm sure you can cause havoc just by reading, if you read from I/O memory.

>
> This will mean userspace might read kernel memory
> through the device, but can not corrupt it.
>
> That's already a big win!
>
> And RX buffers do not have to be added one at a time.
> If we assume 0.2usec per system call, batching some 100 buffers per
> system call gives you 2 nano seconds overhead.  That seems quite
> reasonable.

You're ignoring the page table walk and other per-descriptor processing.

Again^2, maybe this can work.  But it shouldn't block a patch enabling 
interrupt support of VFs.  After the ring proxy is available and proven 
for a few years, we can deprecate bus mastering from uio, and after a 
few more years remove it.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 01:44 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:25:17PM +0300, Avi Kivity wrote:
>> Why use a VF on a non-virtualized system?
> 1. So a userspace bug does not destroy your hardware
> (PFs generally assume trusted non-buggy drivers, VFs
>  generally don't).

People who use dpdk trust their drivers (those drivers are the reason 
for the system to exist in the first place).

> 2. So you can use a PF or another VF for regular networking.

This is valid, but usually those systems have a separate management 
network.  Unfortunately VFs limit the number of queues you can expose, 
making them less performant than PFs.

The "bifurcated drivers" were meant as a way of enabling this 
functionality without resorting to VFs, but it seems they are stalled.

> 3. So you can manage this system, to some level.
>

Again existing practice doesn't follow this.

[dpdk-dev] [PATCH 0/2] uio_msi: device driver

2015-10-01 Thread Avi Kivity

On 10/01/2015 01:28 AM, Stephen Hemminger wrote:
> This is a new UIO device driver to allow supporting MSI-X and MSI devices
> in userspace.  It has been used in environments like VMware and older versions
> of QEMU/KVM where no IOMMU support is available.

Why not add msi/msix support to uio_pci_generic?

> Stephen Hemminger (2):
>
> *** BLURB HERE ***
>
> Stephen Hemminger (2):
>uio: add support for ioctls
>uio: new driver to support PCI MSI-X
>
>   drivers/uio/Kconfig  |   9 ++
>   drivers/uio/Makefile |   1 +
>   drivers/uio/uio.c|  15 ++
>   drivers/uio/uio_msi.c| 378 
> +++
>   include/linux/uio_driver.h   |   3 +
>   include/uapi/linux/Kbuild|   1 +
>   include/uapi/linux/uio_msi.h |  22 +++
>   7 files changed, 429 insertions(+)
>   create mode 100644 drivers/uio/uio_msi.c
>   create mode 100644 include/uapi/linux/uio_msi.h
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
>>>> It's not just the lack of system calls, of course, the architecture is
>>>> completely different.
>>> Absolutely - I'm not saying move all of DPDK into kernel.
>>> We just need to protect the RX rings so hardware does
>>> not corrupt kernel memory.
>>>
>>>
>>> Thinking about it some more, many devices
>>> have separate rings for DMA: TX (device reads memory)
>>> and RX (device writes memory).
>>> With such devices, a mode where userspace can write TX ring
>>> but not RX ring might make sense.
>> I'm sure you can cause havoc just by reading, if you read from I/O memory.
> Not talking about I/O memory here. These are device rings in RAM.

Right.  But you program them with DMA addresses, so the device can read 
another device's memory.

>>> This will mean userspace might read kernel memory
>>> through the device, but can not corrupt it.
>>>
>>> That's already a big win!
>>>
>>> And RX buffers do not have to be added one at a time.
>>> If we assume 0.2usec per system call, batching some 100 buffers per
>>> system call gives you 2 nano seconds overhead.  That seems quite
>>> reasonable.
>> You're ignoring the page table walk
> Some caching strategy might work here.

It may, or it may not.  I'm not against this.  I'm against blocking 
user's access to their hardware, using an existing, established 
interface, for a small subset of setups.  It doesn't help you in any way 
(you can still get reports of oopses due to buggy userspace drivers on 
physical machines, or on virtual machines that don't require 
interrupts), and it harms them.

>> and other per-descriptor processing.
> You probably can let userspace pre-format it all,
> just validate addresses.

You have to figure out if the descriptor contains an address or not 
(many devices have several descriptor formats, some with addresses and 
some without, which are intermixed).  You also have to parse the 
descriptor size and see if it crosses a page boundary or not.

>
>> Again^2, maybe this can work.  But it shouldn't block a patch enabling
>> interrupt support of VFs.  After the ring proxy is available and proven for
>> a few years, we can deprecate bus mastering from uio, and after a few more
>> years remove it.
> We are talking about DPDK patches posted in June 2015.  It's not some
> software proven for years.

dpdk has been used for years, it just won't work on VFs, if you need 
interrupt support.

>If Linux keeps enabling hacks, no one will
> bother doing the right thing.  Upstream inclusion is the only carrot
> Linux has to make people do the right thing.

It's not a carrot, it's a stick.  Implementing you scheme will take a 
huge effort, is not guaranteed to provide the performance needed, and 
will not be available for years.  Meanwhile exactly the same thing on 
physical machines is supported.

People will just use out of tree drivers (dpdk has several already).  
It's a pain, but nowhere near what you are proposing.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
>> People will just use out of tree drivers (dpdk has several already).  It's a
>> pain, but nowhere near what you are proposing.
> What's the issue with that?

Out of tree drivers have to be compiled on the target system (cannot 
ship a binary package), and occasionally break.

dkms helps with that, as do distributions that promise binary 
compatibility, but it is still a pain, compared to just shipping a 
userspace package.

>   We already agreed this kernel
> is going to be tainted, and unsupportable.

Yes.  So your only motivation in rejecting the patch is to get the 
author to write the ring translation patch and port it to all relevant 
drivers instead?

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 02:31 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
>>
>> On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
>>>>>> It's not just the lack of system calls, of course, the architecture is
>>>>>> completely different.
>>>>> Absolutely - I'm not saying move all of DPDK into kernel.
>>>>> We just need to protect the RX rings so hardware does
>>>>> not corrupt kernel memory.
>>>>>
>>>>>
>>>>> Thinking about it some more, many devices
>>>>> have separate rings for DMA: TX (device reads memory)
>>>>> and RX (device writes memory).
>>>>> With such devices, a mode where userspace can write TX ring
>>>>> but not RX ring might make sense.
>>>> I'm sure you can cause havoc just by reading, if you read from I/O memory.
>>> Not talking about I/O memory here. These are device rings in RAM.
>> Right.  But you program them with DMA addresses, so the device can read
>> another device's memory.
> It can't if host has limited it to only DMA into guest RAM, which is
> pretty common.
>

Ok.  So yes, the tx ring can be mapped R/W.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 06:01 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 14:32:19 +0300
> Avi Kivity  wrote:
>
>> On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
>>>> People will just use out of tree drivers (dpdk has several already).  It's 
>>>> a
>>>> pain, but nowhere near what you are proposing.
>>> What's the issue with that?
>> Out of tree drivers have to be compiled on the target system (cannot
>> ship a binary package), and occasionally break.
>>
>> dkms helps with that, as do distributions that promise binary
>> compatibility, but it is still a pain, compared to just shipping a
>> userspace package.
>>
>>>We already agreed this kernel
>>> is going to be tainted, and unsupportable.
>> Yes.  So your only motivation in rejecting the patch is to get the
>> author to write the ring translation patch and port it to all relevant
>> drivers instead?
> The per-driver ring method is what netmap did.
> The problem with that is that it forces infrastructure into already
> complex network driver. It never was accepted. There were also still
> security issues like time of check/time of use with the ring.

There would have to be two rings, with the driver picking up descriptors 
from the software ring, translating virtual addresses, and pushing them 
into the hardware ring.

I'm not familiar enough with the truly high performance dpdk 
applications to estimate the performance impact.  Seastar/scylla gets a 
huge benefit from dpdk, but is still nowhere near line rate.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 06:11 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote:
>>>   We already agreed this kernel
>>> is going to be tainted, and unsupportable.
>> Yes.  So your only motivation in rejecting the patch is to get the author to
>> write the ring translation patch and port it to all relevant drivers
>> instead?
> Not only that.
>
> To make sure users are aware they are doing insecure
> things when using software poking at device BARs in sysfs.

I don't think you need to worry about that.  People who program DMA are 
aware of the damage is can cause.  If you want to be extra sure, have 
uio taint the kernel when bus mastering is enabled.

> To avoid giving virtualization a bad name for security.

There is no security issue here.  Those VMs run a single application, 
and cannot attack the host or other VMs.

> To get people to work on safe, maintainable solutions.

That's a great goal but I don't think it can be achieved without 
sacrificing performance, which is the only reason for dpdk's existence.  
If safe and maintainable were the only requirements, people would not 
bypass the kernel.

The only thing you are really achieving by blocking this is causing pain.

[dpdk-dev] RFC: i40e xmit path HW limitation

2015-07-30 Thread Avi Kivity



On 07/30/2015 07:17 PM, Stephen Hemminger wrote:
> On Thu, 30 Jul 2015 17:57:33 +0300
> Vlad Zolotarov  wrote:
>
>> Hi, Konstantin, Helin,
>> there is a documented limitation of xl710 controllers (i40e driver)
>> which is not handled in any way by a DPDK driver.
>>   From the datasheet chapter 8.4.1:
>>
>> "? A single transmit packet may span up to 8 buffers (up to 8 data 
>> descriptors per packet including
>> both the header and payload buffers).
>> ? The total number of data descriptors for the whole TSO (explained later on 
>> in this chapter) is
>> unlimited as long as each segment within the TSO obeys the previous rule (up 
>> to 8 data descriptors
>> per segment for both the TSO header and the segment payload buffers)."
>>
>> This means that, for instance, long cluster with small fragments has to
>> be linearized before it may be placed on the HW ring.
>> In more standard environments like Linux or FreeBSD drivers the solution
>> is straight forward - call skb_linearize()/m_collapse() corresponding.
>> In the non-conformist environment like DPDK life is not that easy -
>> there is no easy way to collapse the cluster into a linear buffer from
>> inside the device driver
>> since device driver doesn't allocate memory in a fast path and utilizes
>> the user allocated pools only.
>>
>> Here are two proposals for a solution:
>>
>>   1. We may provide a callback that would return a user TRUE if a give
>>  cluster has to be linearized and it should always be called before
>>  rte_eth_tx_burst(). Alternatively it may be called from inside the
>>  rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return some
>>  error code for a case when one of the clusters it's given has to be
>>  linearized.
>>   2. Another option is to allocate a mempool in the driver with the
>>  elements consuming a single page each (standard 2KB buffers would
>>  do). Number of elements in the pool should be as Tx ring length
>>  multiplied by "64KB/(linear data length of the buffer in the pool
>>  above)". Here I use 64KB as a maximum packet length and not taking
>>  into an account esoteric things like "Giant" TSO mentioned in the
>>  spec above. Then we may actually go and linearize the cluster if
>>  needed on top of the buffers from the pool above, post the buffer
>>  from the mempool above on the HW ring, link the original cluster to
>>  that new cluster (using the private data) and release it when the
>>  send is done.
> Or just silently drop heavily scattered packets (and increment oerrors)
> with a PMD_TX_LOG debug message.
>
> I think a DPDK driver doesn't have to accept all possible mbufs and do
> extra work. It seems reasonable to expect caller to be well behaved
> in this restricted ecosystem.
>

How can the caller know what's well behaved?  It's device dependent.

[dpdk-dev] RFC: i40e xmit path HW limitation

2015-07-30 Thread Avi Kivity

On 07/30/2015 08:01 PM, Stephen Hemminger wrote:
> On Thu, 30 Jul 2015 19:50:27 +0300
> Vlad Zolotarov  wrote:
>
>>
>> On 07/30/15 19:20, Avi Kivity wrote:
>>>
>>> On 07/30/2015 07:17 PM, Stephen Hemminger wrote:
>>>> On Thu, 30 Jul 2015 17:57:33 +0300
>>>> Vlad Zolotarov  wrote:
>>>>
>>>>> Hi, Konstantin, Helin,
>>>>> there is a documented limitation of xl710 controllers (i40e driver)
>>>>> which is not handled in any way by a DPDK driver.
>>>>>From the datasheet chapter 8.4.1:
>>>>>
>>>>> "? A single transmit packet may span up to 8 buffers (up to 8 data
>>>>> descriptors per packet including
>>>>> both the header and payload buffers).
>>>>> ? The total number of data descriptors for the whole TSO (explained
>>>>> later on in this chapter) is
>>>>> unlimited as long as each segment within the TSO obeys the previous
>>>>> rule (up to 8 data descriptors
>>>>> per segment for both the TSO header and the segment payload buffers)."
>>>>>
>>>>> This means that, for instance, long cluster with small fragments has to
>>>>> be linearized before it may be placed on the HW ring.
>>>>> In more standard environments like Linux or FreeBSD drivers the
>>>>> solution
>>>>> is straight forward - call skb_linearize()/m_collapse() corresponding.
>>>>> In the non-conformist environment like DPDK life is not that easy -
>>>>> there is no easy way to collapse the cluster into a linear buffer from
>>>>> inside the device driver
>>>>> since device driver doesn't allocate memory in a fast path and utilizes
>>>>> the user allocated pools only.
>>>>>
>>>>> Here are two proposals for a solution:
>>>>>
>>>>>1. We may provide a callback that would return a user TRUE if a give
>>>>>   cluster has to be linearized and it should always be called before
>>>>>   rte_eth_tx_burst(). Alternatively it may be called from inside the
>>>>>   rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return
>>>>> some
>>>>>   error code for a case when one of the clusters it's given has
>>>>> to be
>>>>>   linearized.
>>>>>2. Another option is to allocate a mempool in the driver with the
>>>>>   elements consuming a single page each (standard 2KB buffers would
>>>>>   do). Number of elements in the pool should be as Tx ring length
>>>>>   multiplied by "64KB/(linear data length of the buffer in the pool
>>>>>   above)". Here I use 64KB as a maximum packet length and not taking
>>>>>   into an account esoteric things like "Giant" TSO mentioned in the
>>>>>   spec above. Then we may actually go and linearize the cluster if
>>>>>   needed on top of the buffers from the pool above, post the buffer
>>>>>   from the mempool above on the HW ring, link the original
>>>>> cluster to
>>>>>   that new cluster (using the private data) and release it when the
>>>>>   send is done.
>>>> Or just silently drop heavily scattered packets (and increment oerrors)
>>>> with a PMD_TX_LOG debug message.
>>>>
>>>> I think a DPDK driver doesn't have to accept all possible mbufs and do
>>>> extra work. It seems reasonable to expect caller to be well behaved
>>>> in this restricted ecosystem.
>>>>
>>> How can the caller know what's well behaved?  It's device dependent.
>> +1
>>
>> Stephen, how do you imagine this well-behaved application? Having switch
>> case by an underlying device type and then "well-behaving" correspondingly?
>> Not to mention that to "well-behave" the application writer has to read
>> HW specs and understand them, which would limit the amount of DPDK
>> developers to a very small amount of people... ;) Not to mention that
>> the mentioned above switch-case would be a super ugly thing to be found
>> in an application that would raise a big question about the
>> justification of a DPDK existence as as SDK providing device drivers
>> interface. ;)
> Either have a RTE_MAX_MBUF_SEGMENTS that is global or
> a mbuf_linearize function?  Driver already can stash the
> mbuf pool used for Rx and reuse it for the transient Tx buffers.
>

The pass/fail criteria is much more complicated than that.  You might 
have a packet with 340 fragments successfully transmitted (64k/1500*8) 
or a packet with 9 fragments fail.

What's wrong with exposing the pass/fail criteria as a driver-supplied 
function?  If the application is sure that its mbufs pass, it can choose 
not to call it.  A less constrained application will call it, and 
linearize the packet itself if it fails the test.

[dpdk-dev] [ANNOUNCE] ScyllaDB: new NoSQL database powered by DPDK

2015-09-22 Thread Avi Kivity

Hello dpdk-ers,

We are pleased to announce Scylla, a new open-source NoSQL database 
powered by DPDK.  Scylla's performance (one million transactions per 
second per node) derives in part from a user-space TCP/IP stack, using 
DPDK to drive the network cards.

Scylla is open source and can be found in 
https://github.com/scylladb/scylla.

Scylla's TCP stack and DPDK integration are part of a server-side 
framework, seastar, which can be found in 
https://github.com/scylladb/seastar.

For more information, visit http://scylladb.com.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Avi Kivity



On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
 On 09/30/15 14:41, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>> The whole idea is to bypass kernel. Especially for networking...
> ... on dumb hardware that doesn't support doing that securely.
 On a very capable HW that supports whatever security requirements needed
 (e.g. 82599 Intel's SR-IOV VF devices).
>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>> otherwise you would just use e.g. VFIO.
>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>> example where there *is* iommu but it's not virtualized
>> and thus VFIO is
>> useless and there is an option to use directly assigned SR-IOV networking
>> device there where using the kernel drivers impose a performance impact
>> compared to user space UIO-based user space kernel bypass mode of usage. How
>> is it irrelevant? Could u, pls, clarify your point?
>>
> So it's not even dumb hardware, it's another piece of software
> that forces an "all or nothing" approach where either
> device has access to all VM memory, or none.
> And this, unfortunately, leaves you with no secure way to
> allow userspace drivers.

Some setups don't need security (they are single-user, single 
application). But do need a lot of performance (like 5X-10X 
performance).  An example is OpenVSwitch, security doesn't help it at 
all and if you force it to use the kernel drivers you cripple it.

Also, I'm root.  I can do anything I like, including loading a patched 
pci_uio_generic.  You're not providing _any_ security, you're simply 
making life harder for users.

> So it makes even less sense to add insecure work-arounds in the kernel.
> It seems quite likely that by the time the new kernel reaches
> production X years from now, EC2 will have a virtual iommu.

I can adopt a new kernel tomorrow.  I have no influence on EC2.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Avi Kivity

On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
>>
>> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>>>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
>>>>>> On 09/30/15 14:41, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>>>>>>>> The whole idea is to bypass kernel. Especially for networking...
>>>>>>> ... on dumb hardware that doesn't support doing that securely.
>>>>>> On a very capable HW that supports whatever security requirements needed
>>>>>> (e.g. 82599 Intel's SR-IOV VF devices).
>>>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>>>> otherwise you would just use e.g. VFIO.
>>>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>>>> example where there *is* iommu but it's not virtualized
>>>> and thus VFIO is
>>>> useless and there is an option to use directly assigned SR-IOV networking
>>>> device there where using the kernel drivers impose a performance impact
>>>> compared to user space UIO-based user space kernel bypass mode of usage. 
>>>> How
>>>> is it irrelevant? Could u, pls, clarify your point?
>>>>
>>> So it's not even dumb hardware, it's another piece of software
>>> that forces an "all or nothing" approach where either
>>> device has access to all VM memory, or none.
>>> And this, unfortunately, leaves you with no secure way to
>>> allow userspace drivers.
>> Some setups don't need security (they are single-user, single application).
>> But do need a lot of performance (like 5X-10X performance).  An example is
>> OpenVSwitch, security doesn't help it at all and if you force it to use the
>> kernel drivers you cripple it.
> We'd have to see there are actual users that need this.  So far, dpdk
> seems like the only one,

dpdk is a whole class if users.  It's not a specific application.

>   and it wants to use UIO for slow path stuff
> like polling link status.  Why this needs kernel bypass support, I don't
> know.  I asked, and got no answer.

First, it's more than link status.  dpdk also has an interrupt mode, 
which applications can fall back to when when the load is light in order 
to save power (and in order not to get support calls about 100% cpu when 
idle).

Even for link status, you don't want to poll for that, because accessing 
device registers is expensive.  An interrupt is the best approach for 
rare events like link changed.

>
>> Also, I'm root.  I can do anything I like, including loading a patched
>> pci_uio_generic.  You're not providing _any_ security, you're simply making
>> life harder for users.
> Maybe that's true on your system. But I guess you know that's not true
> for everyone, not in 2015.

Why is it not true?  if I'm root, I can do anything I like to my system, 
and everyone is root in 2015.  I can access the BARs directly and 
program DMA, how am I more secure by uio not allowing me to setup msix?

Non-root users are already secured by their inability to load the 
module, and by the device permissions.

>
>>> So it makes even less sense to add insecure work-arounds in the kernel.
>>> It seems quite likely that by the time the new kernel reaches
>>> production X years from now, EC2 will have a virtual iommu.
>> I can adopt a new kernel tomorrow.  I have no influence on EC2.
>>
>>
> Xen grant tables sound like they could be the right interface
> for EC2.  google search for "grant tables iommu" immediately gives me:
> http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
> Maybe latest Xen is already doing the right thing, and it's just the
> question of making VFIO use that.
>

grant tables only work for virtual devices, not physical devices.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Avi Kivity

On 09/30/2015 06:21 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 05:53:54PM +0300, Avi Kivity wrote:
>> On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
>>>> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>>>>>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
>>>>>>>> On 09/30/15 14:41, Michael S. Tsirkin wrote:
>>>>>>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>>>>>>>>>> The whole idea is to bypass kernel. Especially for networking...
>>>>>>>>> ... on dumb hardware that doesn't support doing that securely.
>>>>>>>> On a very capable HW that supports whatever security requirements 
>>>>>>>> needed
>>>>>>>> (e.g. 82599 Intel's SR-IOV VF devices).
>>>>>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>>>>>> otherwise you would just use e.g. VFIO.
>>>>>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>>>>>> example where there *is* iommu but it's not virtualized
>>>>>> and thus VFIO is
>>>>>> useless and there is an option to use directly assigned SR-IOV networking
>>>>>> device there where using the kernel drivers impose a performance impact
>>>>>> compared to user space UIO-based user space kernel bypass mode of usage. 
>>>>>> How
>>>>>> is it irrelevant? Could u, pls, clarify your point?
>>>>>>
>>>>> So it's not even dumb hardware, it's another piece of software
>>>>> that forces an "all or nothing" approach where either
>>>>> device has access to all VM memory, or none.
>>>>> And this, unfortunately, leaves you with no secure way to
>>>>> allow userspace drivers.
>>>> Some setups don't need security (they are single-user, single application).
>>>> But do need a lot of performance (like 5X-10X performance).  An example is
>>>> OpenVSwitch, security doesn't help it at all and if you force it to use the
>>>> kernel drivers you cripple it.
>>> We'd have to see there are actual users that need this.  So far, dpdk
>>> seems like the only one,
>> dpdk is a whole class if users.  It's not a specific application.
>>
>>>   and it wants to use UIO for slow path stuff
>>> like polling link status.  Why this needs kernel bypass support, I don't
>>> know.  I asked, and got no answer.
>> First, it's more than link status.  dpdk also has an interrupt mode, which
>> applications can fall back to when when the load is light in order to save
>> power (and in order not to get support calls about 100% cpu when idle).
> Aha, looks like it appeared in June. Interesting, thanks for the info.
>
>> Even for link status, you don't want to poll for that, because accessing
>> device registers is expensive.  An interrupt is the best approach for rare
>> events like link changed.
> Yea, but you probably can get by with a timer for that, even if it's ugly.

Maybe you can, but (a) why increase link status change detection latency 
(b) link status change detection is not the only user of the feature, 
since June.

>>>> Also, I'm root.  I can do anything I like, including loading a patched
>>>> pci_uio_generic.  You're not providing _any_ security, you're simply making
>>>> life harder for users.
>>> Maybe that's true on your system. But I guess you know that's not true
>>> for everyone, not in 2015.
>> Why is it not true?  if I'm root, I can do anything I like to my
>> system, and everyone is root in 2015.  I can access the BARs directly
>> and program DMA, how am I more secure by uio not allowing me to setup
>> msix?
> That's not the point.  The point always was that using uio for these
> devices (capable of DMA, in particular of msix) isn't possible in a
> secure way.

uio is used today for DMA-capable devices.  Some users are perfectly 
willing to give up security for functionality (that's all users who have 
root access to their machines, not just uio users).  You aren't adding 
any security by disallowing uio, you're just removing functionality.

A

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 09/30/2015 11:40 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 06:36:17PM +0300, Avi Kivity wrote:
>> As it happens, you're removing the functionality from the users who have no
>> other option.  They can't use vfio because it doesn't work on virtualized
>> setups.
> ...
>
>> Root can already do anything.
> I think there's a contradiction between the two claims above.

Yes, root can replace the current kernel with a patched kernel.  In that 
sense, root can do anything, and the kernel is complete.  Now let's stop 
playing word games.

>>   So what security issue is there?
> A buggy userspace can and will corrupt kernel memory.
>
> ...
>
>> And for what, to prevent
>> root from touching memory via dma that they can access in a million other
>> ways?
> So one can be reasonably sure a kernel oops is not a result of a
> userspace bug.
>

That's not security.  It's a legitimate concern though, one that is 
addressed by tainting the kernel.

[dpdk-dev] VMXNET3 on vmware, ping delay

2015-06-25 Thread Avi Kivity



On 06/25/2015 06:18 PM, Matthew Hall wrote:
> On Thu, Jun 25, 2015 at 09:14:53AM +, Vass, Sandor (Nokia - HU/Budapest) 
> wrote:
>> According to my understanding each packet should go
>> through BR as fast as possible, but it seems that the rte_eth_rx_burst
>> retrieves packets only when there are at least 2 packets on the RX queue of
>> the NIC. At least most of the times as there are cases (rarely - according
>> to my console log) when it can retrieve 1 packet also and sometimes only 3
>> packets can be retrieved...
> By default DPDK is optimized for throughput not latency. Try a test with
> heavier traffic.
>
> There is also some work going on now for DPDK interrupt-driven mode, which
> will work more like traditional Ethernet drivers instead of polling mode
> Ethernet drivers.
>
> Though I'm not an expert on it, there is also a series of ways to optimize for
> latency, which hopefully some others could discuss... or maybe search the
> archives / web site / Intel tuning documentation.
>

What would be useful is a runtime switch between polling and interrupt 
modes.  This was if the load is load you use interrupts, and as 
mitigation, you switch to poll mode, until the load drops again.

[dpdk-dev] VMXNET3 on vmware, ping delay

2015-06-25 Thread Avi Kivity

On 06/25/2015 09:44 PM, Thomas Monjalon wrote:
> 2015-06-25 18:46, Avi Kivity:
>> On 06/25/2015 06:18 PM, Matthew Hall wrote:
>>> On Thu, Jun 25, 2015 at 09:14:53AM +, Vass, Sandor (Nokia - 
>>> HU/Budapest) wrote:
>>>> According to my understanding each packet should go
>>>> through BR as fast as possible, but it seems that the rte_eth_rx_burst
>>>> retrieves packets only when there are at least 2 packets on the RX queue of
>>>> the NIC. At least most of the times as there are cases (rarely - according
>>>> to my console log) when it can retrieve 1 packet also and sometimes only 3
>>>> packets can be retrieved...
>>> By default DPDK is optimized for throughput not latency. Try a test with
>>> heavier traffic.
>>>
>>> There is also some work going on now for DPDK interrupt-driven mode, which
>>> will work more like traditional Ethernet drivers instead of polling mode
>>> Ethernet drivers.
>>>
>>> Though I'm not an expert on it, there is also a series of ways to optimize 
>>> for
>>> latency, which hopefully some others could discuss... or maybe search the
>>> archives / web site / Intel tuning documentation.
>>>
>> What would be useful is a runtime switch between polling and interrupt
>> modes.  This was if the load is load you use interrupts, and as
>> mitigation, you switch to poll mode, until the load drops again.
> DPDK is not a stack. It's up to the DPDK application to poll or use interrupts
> when needed.

As long as DPDK provides a mechanism for a runtime switch, the 
application can do that.

66 matches

Mail list logo