Re: Some performance measurements on the FreeBSD network stack

Andre Oppermann Thu, 19 Apr 2012 14:22:13 -0700

On 19.04.2012 22:46, Luigi Rizzo wrote:

On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:

On 19.04.2012 15:30, Luigi Rizzo wrote:

I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.


Jumping over very interesting analysis...

- the next expensive operation, consuming another 100ns,
   is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
   seems to scale decently at least with 4 cores.  The copyin() is
   relatively inexpensive (not reported in the data below, but
   disabling it saves only 15-20ns for a short packet).

   I have not followed the details, but the allocator calls the zone
   allocator and there is at least one critical_enter()/critical_exit()
   pair, and the highly modular architecture invokes long chains of
   indirect function calls both on allocation and release.

   It might make sense to keep a small pool of mbufs attached to the
   socket buffer instead of going to the zone allocator.
   Or defer the actual encapsulation to the
   (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.


The UMA mbuf allocator is certainly not perfect but rather good.
It has a per-CPU cache of mbuf's that are very fast to allocate
from.  Once it has used them it needs to refill from the global
pool which may happen from time to time and show up in the averages.


indeed i was pleased to see no difference between 1 and 4 threads.
This also suggests that the global pool is accessed very seldom,
and for short times, otherwise you'd see the effect with 4 threads.


Robert did the per-CPU mbuf allocator pools a few years ago.
Excellent engineering.

What might be moderately expensive are the critical_enter()/critical_exit()
calls around individual allocations.


Can't get away from those as a thread must not migrate away
when manipulating the per-CPU mbuf pool.

The allocation happens while the code has already an exclusive
lock on so->snd_buf so a pool of fresh buffers could be attached
there.


Ah, there it is not necessary to hold the snd_buf lock while
doing the allocate+copyin.  With soreceive_stream() (which is
experimental not enabled by default) I did just that for the
receive path.  It's quite a significant gain there.

IMHO better resolve the locking order than to juggle yet another
mbuf sink.

But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
   attached to the socket, built on demand, and cached and managed
   with similar invalidation rules as used by fastforward;


That would require to cross-pointer the rtentry and whatnot again.
We want to get away from that to untangle the (locking) mess that
eventually results from it.

- possibly extend the pru_send interface so one can pass down the uio
   instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
   where the code already has an x-lock on some resource (could be
   the snd_buf, the interface, ...) so the allocation comes for free.


ETOOCOMPLEXOVERTIME.

- another big bottleneck is the route lookup in ip_output()
   (between entries 51 and 56). Not only it eats another
   100ns+ on an empty routing table, but it also
   causes huge contentions when multiple cores
   are involved.


This is indeed a big problem.  I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which


i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?


No.  The main advantage/difference of fastforward is the short code
path and processing to completion.

--
Andre
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Some performance measurements on the FreeBSD network stack

Reply via email to