[dpdk-dev] Pktgen-DPDK compile error on Ubuntu 14.04

2014-07-24 Thread Luke Gorrie
Howdy!

I am having trouble building Pktgen-DPDK from Github on Ubuntu 14.04. Is
this supported? If so, would anybody tell me how to get the build working?

I have tried to be faithful to the instructions in README.md and I have
tested with both master and the latest release tag (pktgen-2.7.1).

The compiler output is pasted below and also in this Gist:
https://gist.github.com/lukego/bc8f7ef97ad0110c60e1

Any ideas appreciated :).

  CC [M]
 
/home/luke/github/Pktgen-DPDK.release/dpdk/x86_64-pktgen-linuxapp-gcc/build/lib/librte_eal/linuxapp/kni/igb_main.o

/home/luke/github/Pktgen-DPDK.release/dpdk/x86_64-pktgen-linuxapp-gcc/build/lib/librte_eal/linuxapp/kni/igb_main.c:
In function ?igb_rx_hash?:

/home/luke/github/Pktgen-DPDK.release/dpdk/x86_64-pktgen-linuxapp-gcc/build/lib/librte_eal/linuxapp/kni/igb_main.c:7379:3:
error: implicit declaration of function ?skb_set_hash?
[-Werror=implicit-function-declaration]

   skb_set_hash(skb, le32_to_cpu(rx_desc->wb.lower.hi_dword.rss),

   ^

/home/luke/github/Pktgen-DPDK.release/dpdk/x86_64-pktgen-linuxapp-gcc/build/lib/librte_eal/linuxapp/kni/igb_main.c:7380:9:
error: ?PKT_HASH_TYPE_L3? undeclared (first use in this function)

 PKT_HASH_TYPE_L3);

 ^

/home/luke/github/Pktgen-DPDK.release/dpdk/x86_64-pktgen-linuxapp-gcc/build/lib/librte_eal/linuxapp/kni/igb_main.c:7380:9:
note: each undeclared identifier is reported only once for each function it
appears in

cc1: all warnings being treated as errors

make[10]: ***
[/home/luke/github/Pktgen-DPDK.release/dpdk/x86_64-pktgen-linuxapp-gcc/build/lib/librte_eal/linuxapp/kni/igb_main.o]
Error 1


[dpdk-dev] Pktgen-DPDK compile error on Ubuntu 14.04

2014-07-24 Thread Luke Gorrie
Hi Pablo,

On 24 July 2014 12:33, De Lara Guarch, Pablo  wrote:

> I think you are seeing the same error as other people are seeing for
> DPDK-1.7 on Ubuntu 14.04.
> Are you using kernel 3.13.0-24 or 3.13.0-30/32?
>

Thanks for the quick response. I'm currently using kernel 3.13.0-24.


[dpdk-dev] [snabb-devel] Re: memory barriers in virtq.lua?

2015-04-07 Thread Luke Gorrie
Hi Michael,

I'm writing to follow up the previous discussion about memory barriers in
virtio-net device implementations, and Cc'ing the DPDK list because I
believe this is relevant to them too.

First, thanks again for getting in touch and reviewing our code.

I have now found a missed case where we *do* require a hardware memory
barrier on x86 in our vhost/virtio-net device. That is when checking the
interrupt suppression flag after updating used->idx. This is needed because
x86 can reorder the write to used->idx after the read from avail->flags,
and that causes the guest to see a stale value of used->idx after it
toggles interrupt suppression.

If I may spell out my mental model, for the sake of being corrected and/or
as an example of how third party developers are reading and interpreting
the Virtio-net spec:

Relating this to Virtio 1.0, the most relevant section is 3.2.1 (Supplying
Buffers to the Device) which calls for two "suitable memory barriers". The
spec talks about these from the driver perspective, but they are both
relevant to the device side too.

The first barrier (write to descriptor table before write to used->idx) is
implicit on x86 because writes by the same core are not reordered. This
means that no explicit hardware barrier is needed. (A compiler barrier may
be needed, however.)

The second memory barrier (write to used->idx before reading avail->flags)
is not implicit on x86 because stores are reordered after loads. So an
explicit hardware memory barrier is needed.

I hope that is a correct assessment of the situation. (Forgive my
x86centricity, I am sure that seems very foreign to kernel hackers.)

If this assessment is correct then the DPDK developers might also want to
review librte_vhost/vhost_rxtx.c and consider adding a hardware memory
barrier between writing used->idx and reading avail->flags.

Cheers,
-Luke

P.S. I notice that the Linux virtio-net driver does not seem to tolerate
spurious interrupts, even though the Virtio 1.0 spec requires this
("must"). On 3.13.11-ckt15 I see them trigger an "irq nobody cared" kernel
log message and then the irq is disabled. If that sounds suspicious I can
supply more information.


[dpdk-dev] [snabb-devel] Re: memory barriers in virtq.lua?

2015-04-08 Thread Luke Gorrie
On 7 April 2015 at 17:30, Michael S. Tsirkin  wrote:

> Just guessing from the available info:
>
> I think you refer to this:
> The driver MUST handle spurious interrupts from the device.
>
> The intent is to be able to handle some spurious interrupts once in a
> while.  AFAIK linux triggers the message if it gets a huge number of
> spurious interrupts for an extended period of time.
> For example, this will trigger if the device does not clear interrupt
> line after interrupt register read.
>

Thanks for that info.

The only spurious interrupt that I think we need is one when vhost-user
reconnects. That would be to cover the case where the vswitch is restarted
after writing used->idx but before sending the interrupt.

Or perhaps there is a better solution to that case?

Looking forward to getting an upstream vhost-user reconnect. one thing at a
time.. :)

Cheers,
-Luke


[dpdk-dev] [snabb-devel] Re: memory barriers in virtq.lua?

2015-04-09 Thread Luke Gorrie
Howdy,

On 8 April 2015 at 17:15, Xie, Huawei  wrote:

> luke:
> 1. host read the flag. 2 guest toggles the flag 3.guest checks the used.
> 4. host update used.
> Is this your case?
>

Yep, that is exactly the case I mean.

Cheers,
-Luke


[dpdk-dev] [PATCH] Implement memcmp using AVX/SSE instructio

2015-04-23 Thread Luke Gorrie
On 23 April 2015 at 10:11, Bruce Richardson 
wrote:

> Also, if I read your quoted performance numbers in your earlier mail
> correctly,
> we are only looking at a 1-4% performance increase. Is the additional code
> to
> maintain worth the benefit?
>

... and if so, how would one decide whether it is better to to add this to
DPDK vs contribute it to GNU libc?

Pawel noted that this is not compatible with memcmp(3). It is very similar
to the legacy function bcmp(3) though so perhaps libc would accept it as
such.

Cheers,
-Luke


[dpdk-dev] [PATCH] vhost: flush used->idx update before reading avail->flags

2015-04-23 Thread Luke Gorrie
On 22 April 2015 at 18:33, Huawei Xie  wrote:

> update of used->idx and read of avail->flags could be reordered.
> memory fence should be used to ensure the order, otherwise guest could see
> a stale used->idx value after it toggles the interrupt suppression flag.
>

This patch looks right to me.


[dpdk-dev] [PATCH] vhost: flush used->idx update before reading avail->flags

2015-04-24 Thread Luke Gorrie
On 24 April 2015 at 03:01, Linhaifeng  wrote:

> If not add memory fence what would happen? Packets loss or interrupt
> loss?How to test it ?
>

You should be able to test it like this:

1. Boot two Linux kernel (e.g. 3.13) guests.
2. Connect them via vhost switch.
3. Run continuous traffic between them (e.g. iperf).

I would expect that within a reasonable timeframe (< 1 hour) one of the
guests' network interfaces will hang indefinitely due to a missed interrupt.

You won't be able to reproduce this using DPDK guests because they are not
using the same interrupt suppression method.

This is a serious real-world problem. I wouldn't deploy the vhost
implementation without this fix.

Cheers,
-Luke


[dpdk-dev] Beyond DPDK 2.0

2015-04-24 Thread Luke Gorrie
Hi Tim,

On 16 April 2015 at 12:38, O'Driscoll, Tim  wrote:

> Following the launch of DPDK by Intel as an internal development project,
> the launch of dpdk.org by 6WIND in 2013, and the first DPDK RPM packages
> for Fedora in 2014, 6WIND, Red Hat and Intel would like to prepare for
> future releases after DPDK 2.0 by starting a discussion on its evolution.
> Anyone is welcome to join this initiative.
>

Thank you for the open invitation.

I have a couple of questions about the long term of DPDK:

1. How will DPDK manage overlap with other project over time?

In some ways DPDK is growing more overlap with other projects e.g.
forking/rewriting functionality from Linux (e.g. ixgbe), FreeBSD (e.g.
Broadcom PMD), GLIBC (e.g. memcpy).

In other ways DPDK is delegating functionality to external systems instead
e.g. the bifurcated driver (delegate to kernel) and Mellanox PMD (delegate
to vendor shared library).

How is this going to play out over the long term? And is there an
existential risk that it will end up being easier to port the good bits of
DPDK into the kernel than the rest of the good bits of the kernel into DPDK?

2. How will DPDK users justify contributing to DPDK upstream?

Engineers in network equipment vendors want to contribute to open source,
but what is the incentive for the companies to support this? This would be
easy if DPDK were GPL'd (they are compelled) or if everybody were
dynamically linking with the upstream libdpdk (can't have private patches).
However, in a world where DPDK is BSD-licensed and statically linked, is it
not both cheaper and competitively advantageous to keep fixes and
optimizations in house?

Today the community is benefiting immensely from the contributions of
companies like 6WIND and Brocade, but I wonder if this going to be the
exception or the rule.

That's all from me. Thanks for listening :-).

Cheers,
-Luke


[dpdk-dev] Beyond DPDK 2.0

2015-04-26 Thread Luke Gorrie
Hi Neil,

Thanks for taking the time to reflect on my ideas.

On 24 April 2015 at 19:00, Neil Horman  wrote:

> DPDK will always be
> something of a niche market for user to whoom every last ounce of
> performance is
> the primary requirement


This does seem like an excellent position. It is succinct, it sets
expectations for users, and it tells developers how to resolve trade-offs
(performance takes priority over FOO, for all values of FOO). I agree that
this niche will always be there and so it seems like there is a permanent
place in the world for DPDK.

This focus on performance also makes DPDK useful as a reference for other
projects. People making trade-offs between performance and other factors
(portability, compatibility, simplicity, etc) can use DPDK as a yardstick
to estimate what this costs. This benefits everybody doing networking on
x86.

I suppose that a separate discussion would be how to increase participation
from people who are using DPDK as a reference but not as a software
dependency. That is perhaps a less pressing topic for the future.

OVS is a great example here.  If we can make it easy for them to use DPDK
> to get
> better performance, I think we'll see a larger uptake in adoption.
>

I will be interested to see how this plays out.

I agree it is a great opportunity for DPDK and a chance to take it
mainstream.

I also think it is fundamentally a missed opportunity of the kernel. OVS
would be just fine with a kernel data plane that performs adequately. OVS
users don't seem to be in the "maximum performance at any cost" niche
defined above. Many of them benefit a lot from the kernel integration.
However, if the kernel can't promise the meet their performance
requirements then DPDK does seem like a knight in shining armour.

It's an exciting time in open source networking :-)

Cheers,
-Luke


[dpdk-dev] Possible bug in mlx5_tx_burst_mpw?

2016-09-14 Thread Luke Gorrie
Howdy,

Just noticed a line of code that struck me as odd and so I am writing just
in case it is a bug:

http://dpdk.org/browse/dpdk/tree/drivers/net/mlx5/mlx5_rxtx.c#n1014

Specifically the check "(mpw.length != length)" in mlx_tx_burst_mpw() looks
like a descriptor-format optimization for the special case where
consecutive packets on the wire are exactly the same size. This would
strike me as peculiar.

Just wanted to check, is that interpretation correct and if so then is this
intentional?

Cheers,
-Luke


[dpdk-dev] Possible bug in mlx5_tx_burst_mpw?

2016-09-14 Thread Luke Gorrie
Hi Adrien,

On 14 September 2016 at 16:30, Adrien Mazarguil 
wrote:

> Your interpretation is correct (this is intentional and not a bug).
>

Thanks very much for clarifying.

This is interesting to me because I am also working on a ConnectX-4 (Lx)
driver based on the newly released driver interface specification [1] and I
am wondering how interested I should be in this MPW feature that is
currently not documented.

In the event successive packets share a few properties (length, number of
> segments, offload flags), these can be factored out as an optimization to
> lower the amount of traffic on the PCI bus. This feature is currently
> supported by the ConnectX-4 Lx family of adapters.
>

I have a concern here that I hope you will forgive me for voicing.

This optimization seems to run the risk of inflating scores on
constant-packet-size IXIA-style benchmarks like [2] and making them less
useful for predicting real-world performance. That seems like a negative to
me as an application developer. I wonder if I am overlooking some practical
benefits that motivate implementing this in silicon and in the driver and
enabling it by default?

Cheers,
-Luke

[1]
http://www.mellanox.com/related-docs/user_manuals/Ethernet_Adapters_Programming_Manual.pdf
[2]
https://www.mellanox.com/blog/2016/06/performance-beyond-numbers-stephen-curry-style-server-io/


[dpdk-dev] Possible bug in mlx5_tx_burst_mpw?

2016-09-16 Thread Luke Gorrie
Hi Adrien,

Thanks for taking the time to write a detailed reply. This indeed sounds
reasonable to me. Users will need to take these special-cases into account
when predicting performance on their own anticipated workloads, which is a
bit tricky, but then that is life when dealing with complex new technology.
I am eager to see what new techniques come down the pipeline for
efficiently moving packets and descriptors across PCIe.

Thanks again for the detailed reply.

Cheers!
-Luke


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-09-26 Thread Luke Gorrie
On 22 September 2016 at 11:01, Jianbo Liu  wrote:

> Tested with testpmd, host: txonly, guest: rxonly
> size (bytes) improvement (%)
> 644.12
> 128   6
> 256   2.65
> 512   -1.12
> 1024 -7.02
>

Have you considered testing with more diverse workloads e.g. mixed packet
sizes that are not always multiples of the cache line & register sizes?


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-22 Thread Luke Gorrie
Howdy!

This memcpy discussion is absolutely fascinating. Glad to be a fly on the
wall!

On 21 January 2015 at 22:25, Jim Thompson  wrote:

>
> The differences with DPDK are that a) entire cores (including the AVX/SSE
> units and even AES-NI (FPU) are dedicated to DPDK, and b) DPDK is a library,
> and the resulting networking applications are exactly that, applications.
> The "operating system? is now a control plane.
>
>
Here is another thought: when is it time to start thinking of packet copy
as a cheap unit-time operation?

Packets are shrinking exponentially when measured in:

- Cache lines
- Cache load/store operations needed to copy
- Number of vector move instructions needed to copy

because those units are all based on exponentially growing quantities,
while the byte size of packets stays the same for many applications.

So when is it time to stop caring?

(Are we already there, even, for certain conditions? How about Haswell CPU,
data already exclusively in our L1 cache, start and end both known to be
cache-line-aligned?)

Cheers,
-Luke (eagerly awaiting arrival of Haswell server...)


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-22 Thread Luke Gorrie
On 22 January 2015 at 14:29, Jay Rolette  wrote:

> Microseconds matter. Scaling up to 100GbE, nanoseconds matter.
>

True. Is there a cut-off point though? Does one nanosecond matter?

AVX512 will fit a 64-byte packet in one register and move that to or from
memory with one instruction. L1/L2 cache bandwidth per server is growing on
a double-exponential curve (both bandwidth per core and cores per CPU). I
wonder if moving data around in cache will soon be too cheap for us to
justify worrying about.

I suppose that 1500 byte wide registers are still a ways off though ;-)

Cheers!
-Luke (begging your indulgence for wandering off on a tangent)


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-25 Thread Luke Gorrie
Hi John,

On 19 January 2015 at 02:53,  wrote:

> This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> It also extends memcpy test coverage with unaligned cases and more test
> points.
>

I am really interested in this work you are doing on memory copies
optimized for packet data. I would like to understand it in more depth. I
have a lot of questions and ideas but let me try to keep it simple for now
:-)

How do you benchmark? where does the "factor of 2-8" cited elsewhere in the
thread come from? how can I reproduce? what results are you seeing compared
with libc?

I did a quick benchmark this weekend based on cachebench
. This seems like
a fairly weak benchmark (always L1 cache, always same alignment, always
predictable branches). Do you think this is relevant? How does this compare
with your results?

I compared:
  rte_memcpy (the new optimized one compiled with gcc-4.9 and -march=native
and -O3)
  memcpy from glibc 2.19 (ubuntu 14.04)
  memcpy from glibc 2.20 (arch linux)

on hardware:
  E5-2620v3 (Haswell)
  E5-2650 (Sandy Bridge)

running cachebench like this:

./cachebench -p -e1 -x1 -m14


rte_memcpy.h on Haswell:

Memory Copy Library Cache Test

C Size  Nanosec MB/sec  % Chnge
--- --- --- ---
256 0.0189191.881.00
384 0.0196505.430.92
512 0.0196509.191.00
768 0.0191475.721.06
10240.0196293.820.95
15360.0196521.661.00
20480.0196522.871.00
30720.0196525.531.00
40960.0196522.791.00
61440.0196507.711.00
81920.0194584.411.02
12288   0.0195062.800.99
16384   0.0180493.461.18


libc 2.20 on Haswell:

Memory Copy Library Cache Test

C Size  Nanosec MB/sec  % Chnge
--- --- --- ---
256 0.0165978.641.00
384 0.01100249.01   0.66
512 0.01123476.55   0.81
768 0.01144699.86   0.85
10240.01159459.88   0.91
15360.01168001.92   0.95
20480.0180738.312.08
30720.0180270.021.01
40960.0184239.840.95
61440.0190600.130.93
81920.0189767.941.01
12288   0.0192085.980.97
16384   0.0192719.950.99


libc 2.19 on Haswell:

Memory Copy Library Cache Test

C Size  Nanosec MB/sec  % Chnge
--- --- --- ---
256 0.0259871.691.00
384 0.0168545.940.87
512 0.0172674.230.94
768 0.0179257.470.92
10240.0179740.430.99
15360.0185483.670.93
20480.0187703.680.97
30720.0186685.711.01
40960.0187147.840.99
61440.0168622.961.27
81920.0170591.250.97
12288   0.0172621.280.97
16384   0.0167713.631.07


rte_memcpy on Sandy Bridge:

Memory Copy Library Cache Test

C Size Nanosec MB/sec % Chnge
--- --- --- ---
256 0.0262158.191.00
384 0.0173256.410.85
512 0.0182032.160.89
768 0.0173919.921.11
10240.0175937.510.97
15360.0178280.200.97
20480.0179562.540.98
30720.0180800.930.98
40960.0181453.710.99
61440.0181915.840.99
81920.0182427.980.99
12288   0.0182789.821.00
16384   0.0167519.661.23



libc 2.20 on Sandy Bridge:

Memory Copy Library Cache Test

C Size Nanosec MB/sec % Chnge
--- --- --- ---
256 0.0248651.201.00
384 0.0257653.910.84
512 0.0167909.770.85
768 0.0171177.750.95
10240.01   

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-26 Thread Luke Gorrie
On 26 January 2015 at 02:30, Wang, Zhihong  wrote:

>  Hi Luke,
>
>
>
> I?m very glad that you?re interested in this work. J
>

Great :).

  I never published any performance data, and haven?t run cachebench.
>
> We use test_memcpy_perf.c in DPDK to do the test mainly, because it?s the
> environment that DPDK runs. You can also find the performance comparison
> there with glibc.
>
> It can be launched in /app/test: memcpy_perf_autotest.
>

Could you give me a command-line example to run this please? (Sorry if this
should be obvious.)


>   Finally, inline can bring benefits based on practice, constant value
> unrolling for example, and for DPDK we need all possible optimization.
>

Do we need to think about code size and potential instruction cache
thrashing?

For me one call to rte_memcpy compiles to 3520 instructions
 in 20KB of object
code. That's more than half the size of the Haswell instruction cache
(32KB) per call.

glibc 2.20's memcpy_avx_unaligned

is only 909 bytes shared/total and also seems to have basically excellent
performance on Haswell.

So I am concerned about the code size of rte_memcpy, especially when
inlined, and meta-concerned about the nonlinear impact of nested inlined
functions on both compile time and object code size.


There is another issue that I am concerned about:

The Intel Optimization Guide suggests that rep movs is very efficient
starting in Ivy Bridge. In practice though it seems to be much slower than
using vector instructions, even though it is faster than it used to be in
Sandy Bridge. Is that true?

This could have a substantial impact on off-the-shelf memcpy. glibc 2.20's
memcpy uses movs for sizes >= 2048 and that is where performance takes a
dive for me (in microbenchmarks). GCC will also emit inline string move
instructions for certain constant-size memcpy calls at certain optimization
levels.


So I feel like I haven't yet found the right memcpy for me. and we haven't
even started to look at the interesting parts like cache-coherence
behaviour when sharing data between cores (vhost) and whether streaming
load/store can be used to defend the state of cache lines between cores.


Do I make any sense? What do I miss?


Cheers,
-Luke


[dpdk-dev] [snabb-devel] RE: [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Luke Gorrie
Hi again John,

Thank you for the patient answers :-)

Thank you for pointing this out: I was mistakenly testing your Sandy Bridge
code on Haswell (lacking -DRTE_MACHINE_CPUFLAG_AVX2).

Correcting that, your code is both the fastest and the smallest in my
humble micro benchmarking tests.

Looks like you have done great work! You probably knew that already :-) but
thank you for walking me through it.

The code compiles to 745 bytes of object code (smaller than glibc 2.20
memcpy) and cachebenches like this:

Memory Copy Library Cache Test

C Size  Nanosec MB/sec  % Chnge
--- --- --- ---
256 0.0197587.601.00
384 0.0197628.831.00
512 0.0197613.951.00
768 0.01147811.44   0.66
10240.01158938.68   0.93
15360.01168487.49   0.94
20480.01174278.83   0.97
30720.01156922.58   1.11
40960.01145811.59   1.08
61440.01157388.27   0.93
81920.01149616.95   1.05
12288   0.01149064.26   1.00
16384   0.01107895.06   1.38

the key difference from my perspective is that glibc 2.20 memcpy
performance goes way down for >= 2048 bytes when they switch from vector
moves to string moves, while your code stays consistent.

I will take it for a spin in a real application.

Cheers,
-Luke


[dpdk-dev] SIMD checksum

2015-03-12 Thread Luke Gorrie
Howdy,

I am writing to share some SIMD (SSE2 and AVX2) IP checksum routines. The
commit log for rte_ip.h said that this was an area of future interest for
DPDK.

Code:
https://github.com/lukego/snabbswitch/blob/ipchecksum-simd/src/lib/checksum.c

Feedback welcome. We are currently reviewing and integrating this
ourselves. Great if it is of use to other people too. The performance seems
to be better and I see this as very valuable for reducing variance between
offloaded and non-offloaded configurations. (I would like to replace NIC
offload with SIMD offload completely to simplify the end-user's mental
model but it remains to be seen if I will get away with this.)

Sorry that this is not sent as a pull request. We use DPDK as a reference
implementation but not as a software dependency. The rest of our
implementation (tests, runtime CPU feature dispatching) would not fit into
the DPDK code base and would have to be ported.

In the perfect universe I would love to see useful routines like this
living in a small repository of their own that everyboy could share. "Small
and stand-alone subroutines for userspace networking." Perhaps we will
collectively break up our big libraries like this over time.

(I do really appreciate the fact that many DPDK library routines are easy
to excerpt. You can see that we are using DPDK's IP checksum as the
non-SIMD fallback, albeit we forked it to fit it into our project.)

Cheers,
-Luke


[dpdk-dev] Beyond DPDK 2.0

2015-05-07 Thread Luke Gorrie
On 7 May 2015 at 16:02, Avi Kivity  wrote:

> One problem we've seen with dpdk is that it is a framework, not a library:
> it wants to create threads, manage memory, and generally take over.  This
> is a problem for us, as we are writing a framework (seastar, [1]) and need
> to create threads, manage memory, and generally take over ourselves.
>

That is also broadly why we don't currently use DPDK in Snabb Switch [1].

There is a bunch of functionality in DPDK that would be tempting for us to
use and contribute back to: device drivers, SIMD routines, data structures,
and so on. I think that we would do this if they were available piecemeal
as stand-alone libi40e, libsimd, liblpn, etc.

The whole DPDK platform/framework is too much for us to adopt though. Some
aspects of it are in conflict with our goals and it is an all-or-nothing
proposition. So for now we are staying self-sufficient even when it means
writing our own ixgbe replacement, etc.

Having said that we are able to share code that doesn't require linking
into our address space e.g. vhost-user and potentially the bifurcated
drivers in the future. That seems like a nice direction for things to be
going in and a way to collaborate even without our directly linking with
DPDK.

[1] https://github.com/lukego/snabbswitch/blob/README/README.md


[dpdk-dev] Beyond DPDK 2.0

2015-05-08 Thread Luke Gorrie
On 8 May 2015 at 06:16, Wiles, Keith  wrote:

> The PMDs or drivers would not be useful without DPDK MBUFS IMO
>

Surprisingly perhaps, I would find them very useful.

To me there are two parts to a driver: the hardware setup and the
transmit/receive.

The hardware setup is complex and generic. You have to read a thousand-page
data sheet and then write code to initialize the hardware, setup queues,
enable promisc/multicast, enable features you want like vmdq or flow
director, and so on. You need to accumulate workarounds for hard-to-test
problems like cards being discovered with unsuitable values in their
EEPROM. There is not much intellectual value in this code being written
more than once.

I would like to see this hardware setup code shared between many projects.
That code does not depend on a specific mbuf struct. Sharing could be done
with an embeddable PMD library, with a bifurcated driver in the kernel,
with the SR-IOV PF/VF model, or surely other ways too. These all have
limited applicability today.

The transmit/receive part, on the other hand, seems very
application-dependent. This part depends on the specific mbuf struct and
the way you are developing your application around it. You will need to
write code to suit your design for using scatter/gather, allowed sizes of
individual buffers, the granularity at which you are keeping track of
checksum validity, how you use TSO/LRO, how you use interrupts, how you
batch work together, and so on. This is easy or hard depending on how
simple or complex the application is.

I am not so interested in sharing this code. I think that different
applications will legitimately have different designs - including mbuf
structs - and they all need code that suits their own design. I think there
is a lot of value in people being creative in these areas and trying
different things.

So while Avi might only mean that he wants to allocate the bytes for his
mbufs himself, on our side we want to design our own mbuf struct. The cost
of that today is to write our own device drivers from scratch but for now
that seems justified. Going forward if there were a simpler mechanism that
reduced our workload and gave us access to more hardware - libixgbe,
libi40e, etc - that would be extremely interesting to us.

I suppose that another background question is whether the DPDK community
are chiefly concerned with advancing DPDK as a platform and a brand or are
broadly keen to develop and share code that is useful in diverse networking
projects. (Is this whole discussion off-topic for dpdk-devel?)

This is one of the many reasons why I would love to use parts of DPDK but
do not want to use all of it. (We also allocate our HugeTLBs differently,
etc, because we have different priorities.)


[dpdk-dev] Beyond DPDK 2.0

2015-05-08 Thread Luke Gorrie
Hi Bruce,

On 8 May 2015 at 11:06, Bruce Richardson  wrote:

> For the Intel NIC drivers, the hardware setup part used in DPDK is based
> off
> the other Intel drivers for other OS's. The code you are interested in
> should
> therefore be contained within the subfolders off each individual PMD. As
> you point
> out below, the mbuf specific part is only present in the files in the
> top-level
> PMD folder with the DPDK-specific RX/TX and queue setup routines.


Interesting!

How could one embed these Intel drivers (igb, ixgbe, i40e, ...) into new
programs?

If there is documentation, a platform-agnostic master repository, etc, that
would be really interesting.

I have the impression as an outsider that the various incarnations of these
drivers (Linux, FreeBSD, DPDK) are loosely synchronized forks maintained at
considerable effort by each project. If there is actually a common core
that is easy to adopt, I am interested!

(If dpdk-devel is the wrong mailing list for this discussion then perhaps
you could reply with Cc: to a more suitable one and I will subscribe there.)

Cheers,
-Luke


[dpdk-dev] Beyond DPDK 2.0

2015-05-08 Thread Luke Gorrie
On 8 May 2015 at 11:42, Bruce Richardson  wrote:

> The code in those directories is "common" code that is maintained by Intel
> -
> which is why you see repeated comments about not modifying it for DPDK. It
> is
> just contained in it's own subfolder in each DPDK driver for easier
> updating
> off the internal Intel baseline.
>

Thanks for pointing this out to me, Bruce. Food for thought.

Cheers,
-Luke


[dpdk-dev] TX performance regression caused by the mbuf cachline split

2015-05-11 Thread Luke Gorrie
Hi Paul,

On 11 May 2015 at 02:14, Paul Emmerich  wrote:

> Another possible solution would be a more dynamic approach to mbufs:


Let me suggest a slightly more extreme idea for your consideration. This
method can easily do > 100 Mpps with one very lightly loaded core. I don't
know if it works for your application or not but I share it just in case.

Background: Load generators are specialist applications and can benefit
from specialist transmit mechanisms.

You can instruct the NIC to send up to 32K packets with one operation: load
the address of a descriptor list into the TDBA register (Transmit
Descriptor Base Address).

The descriptor list is a simple series of 64-bit values: addr0, flags0,
addr1, flags1, ... etc. It is easy to construct by hand.

The NIC can also be made to play the packets in a loop. You just have to
periodically reset the DMA cursor to make all the packets valid again. That
is a simple register poke: TDT = TDH-1.

We do this routinely when we want to generate a large amount of traffic
with few resources, typically when generating load using spare capacity of
a device under test. (I have sample code but it is not based on DPDK.)

If you want all of your packets to be unique then you have to be a bit more
clever. For example you could poll to see the DMA progress: let half the
packets be sent, then rewrite those while the other half are sent, and so
on. Kind of like the way video games tracked the progress of the display
scan beam to update parts of the frame buffer that were not being DMA'd.

This method may impose other limitations that are not acceptable for your
application of course. But if not then it can drastically reduce the number
of instructions and cache footprint required to generate load. You don't
have to touch mbufs or descriptors at all. You just update the payload and
update the DMA register every millisecond or so.

Cheers,
-Luke


[dpdk-dev] bifurcated driver

2014-11-24 Thread Luke Gorrie
On 5 November 2014 at 14:00, Thomas Monjalon 
wrote:

> It seems to be close to the bifurcated driver needs.
> Not sure if it can solve the security issues if there is no dedicated MMU
> in the NIC.
>
> I feel we should sum up pros and cons of
> - igb_uio
> - uio_pci_generic
> - VFIO
> - ibverbs
> - bifurcated driver
>

I am also curious about the pros and cons of the bifurcated driver compared
with SR-IOV.

What are the practical differences between running a bifurcated driver vs.
running SR-IOV mode where the kernel owns the PF and userspace applications
own the VFs?

Specifically, could I run the ixgbe driver in the kernel (max_vfs=N),
control it via ethtool, and then access the queues via userspace VF
drivers? If so, how would this differ from the bifurcated driver?

Cheers,
-Luke


[dpdk-dev] [dpdk-announce] DPDK Features for Q1 2015

2014-10-22 Thread Luke Gorrie
Hi Tim,

On 22 October 2014 15:48, O'driscoll, Tim  wrote:

> 2.0 (Q1 2015) DPDK Features:
> Bifurcated Driver: With the Bifurcated Driver, the kernel will retain
> direct control of the NIC, and will assign specific queue pairs to DPDK.
> Configuration of the NIC is controlled by the kernel via ethtool.
>

That sounds awesome and potentially really useful for other people writing
userspace data planes too. If I understand correctly, this way the messy
details can be contained in one place (kernel) and the application (DPDK
PMD or otherwise) will access the NIC TX/RX queue via the ABI defined in
the hardware data sheet.

Single Virtio Driver: Merge existing Virtio drivers into a single
> implementation, incorporating the best features from each of the existing
> drivers.
>

Cool. Do you have a strategy in mind already for zero-copy optimisation
with VMDq? I have seen some patches floating around for this and it's an
area of active interest for myself and others. I see a lot of potential for
making this work more effectively with some modest extensions to Virtio and
guest behaviour, and would love to meet kindred spirits who are thinking
along these lines too.


[dpdk-dev] FW: Vhost user no connection vm2vm

2015-05-22 Thread Luke Gorrie
On 22 May 2015 at 10:05, Maciej Grochowski 
wrote:

> What I'm going to do today is to compile newest kernel for vhost and guest
> and debug where packet flow stuck, I will report the result
>

Compiling the guest virtio-net driver with debug printouts enabled can be
really helpful in these situations too.


[dpdk-dev] [PATCH] vhost: flush used->idx update before reading avail->flags

2015-06-09 Thread Luke Gorrie
On 9 June 2015 at 09:04, Linhaifeng  wrote:

> On 2015/4/24 15:27, Luke Gorrie wrote:
> > You should be able to test it like this:
> >
> > 1. Boot two Linux kernel (e.g. 3.13) guests.
> > 2. Connect them via vhost switch.
> > 3. Run continuous traffic between them (e.g. iperf).
> >
> > I would expect that within a reasonable timeframe (< 1 hour) one of the
> > guests' network interfaces will hang indefinitely due to a missed
> interrupt.
> >
> > You won't be able to reproduce this using DPDK guests because they are
> not
> > using the same interrupt suppression method.
>
> I think this patch can't resole this problem. On the other hand we still
> would miss interrupt.
>

For what it is worth, we were able to reproduce the problem as described
above with older Snabb Switch releases and we were also able to verify that
inserting a memory barrier fixes this problem.

This is the relevant commit in the snabbswitch repo for reference:
https://github.com/SnabbCo/snabbswitch/commit/c33cdd8704246887e11d7c353f773f7b488a47f2

In a nutshell, we added an MFENCE instruction after writing used->idx and
before checking VRING_F_NO_INTERRUPT.

I have not tested this case under DPDK myself and so I am not really
certain which memory barrier operations are sufficient/insufficient in that
context. I hope that our experience is relevant/helpful though and I am
happy to explain more about that if I have missed any important details.

Cheers,
-Luke


[dpdk-dev] [PATCH] vhost: flush used->idx update before reading avail->flags

2015-06-10 Thread Luke Gorrie
On 9 June 2015 at 10:46, Michael S. Tsirkin  wrote:

> By the way, similarly, host side must re-check avail idx after writing
> used flags. I don't see where snabbswitch does it - is that a bug
> in snabbswitch?


Good question.

Snabb Switch does not use interrupts from the guest. We always set
VRING_F_NO_NOTIFY to tell the guest that it need not interrupt us. Then we
run in poll mode and in practice check the avail ring for new descriptors
every 20us or so.

So the argument for not needing this check in both Snabb Switch and DPDK is
that we are running poll mode and don't notice whether interrupts are being
sent or not.

Is that a solid argument or do I misunderstand what the race condition is?

Cheers,
-Luke