On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote: > On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote: > > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote: > > > > > > > > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote: > > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: > > > >>It's easy to claim that > > > >>a solution is around the corner, only no one was looking for it, but the > > > >>reality is that kernel bypass has been a solution for years for high > > > >>performance users, > > > >I never said that it's trivial. > > > > > > > >It's probably a lot of work. It's definitely more work than just abusing > > > >sysfs. > > > > > > > >But it looks like a write system call into an eventfd is about 1.5 > > > >microseconds on my laptop. Even with a system call per packet, system > > > >call overhead is not what makes DPDK drivers outperform Linux ones. > > > > > > > > > > 1.5 us = 0.6 Mpps per core limit. > > > > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps. > > But for RX, you can batch a lot of packets. > > > > You can see by now I'm not that good at benchmarking. > > Here's what I wrote: > > > > > > #include <stdbool.h> > > #include <sys/eventfd.h> > > #include <inttypes.h> > > #include <unistd.h> > > > > > > int main(int argc, char **argv) > > { > > int e = eventfd(0, 0); > > uint64_t v = 1; > > > > int i; > > > > for (i = 0; i < 10000000; ++i) { > > write(e, &v, sizeof v); > > } > > } > > > > > > This takes 1.5 seconds to run on my laptop: > > > > $ time ./a.out > > > > real 0m1.507s > > user 0m0.179s > > sys 0m1.328s > > > > > > > dpdk performance is in the tens of > > > millions of packets per system. > > > > I think that's with a bunch of batching though. > > > > > It's not just the lack of system calls, of course, the architecture is > > > completely different. > > > > Absolutely - I'm not saying move all of DPDK into kernel. > > We just need to protect the RX rings so hardware does > > not corrupt kernel memory. > > > > > > Thinking about it some more, many devices > > have separate rings for DMA: TX (device reads memory) > > and RX (device writes memory). > > With such devices, a mode where userspace can write TX ring > > but not RX ring might make sense. > > > > This will mean userspace might read kernel memory > > through the device, but can not corrupt it. > > > > That's already a big win! > > > > And RX buffers do not have to be added one at a time. > > If we assume 0.2usec per system call, batching some 100 buffers per > > system call gives you 2 nano seconds overhead. That seems quite > > reasonable. > > > Hi, > > just to jump in a bit on this. > > Batching of 100 packets is a very large batch, and will add to latency.
This is not on transmit or receive path! This is only for re-adding buffers to the receive ring. This batching should not add latency at all: process rx: get packet packets[n] = alloc packet if (++n > 100) { system call: add bufs(packets, n); } > The > standard batch size in DPDK right now is 32, and even that may be too high for > applications in certain domains. > > However, even with that 2ns of overhead calculation, I'd make a few additional > points. > * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX > and TX on a single thread. 10GB of IO doesn't really stress a core any more. > For > 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with > a > huge batch size of 100 packets, your system call overhead on RX is taking > almost > 12% of our processing time. For a batch size of 32 this overhead would rise to > over 35% of our packet processing time. As I said, yes, measureable, but not breaking the bank, and that's with 40GB which still are not widespread. With 10GB and 100 packets, only 3% overhead. > For 100G line rate, the packet arrival > rate is just 6.7ns... Hypervisors still have time get their act together and support IOMMUs by the time 100G systems become widespread. > * As well as this overhead from the system call itself, you are also omitting > the overhead of scanning the RX descriptors. I omit it because scanning descriptors can still be done in userspace, just write-protect the RX ring page. > This in itself is going to use up > a good proportion of the processing time, as well as that we have to spend > cycles > copying the descriptors from one ring in memory to another. Given that right > now > with the vector ixgbe driver, the cycle cost per packet of RX is just a few > dozen > cycles on modern cores, every additional cycle (fraction of a nanosecond) has > an impact. > > Regards, > /Bruce See above. There is no need for that on data path. Only re-adding buffers requires a system call. -- MST