Re: SO_BINDTODEVICE or equivalent?

2012-04-19 Thread Svatopluk Kraus
Hi,

Use IP_RECVIF option.

For IP_SENDIF look at
http://lists.freebsd.org/pipermail/freebsd-net/2007-March/013510.html
I used the patch on my embedded FreeBSD 9.0 boxes and it works fine. I
modificated it slightly to match 9.0.

Svata

On Thu, Apr 19, 2012 at 7:41 AM, Attila Nagy  wrote:
>
>   Hi,
>   I want to solve the classic problem of a DHCP server: listening for
>   broadcast UDP packets and figuring out what interface a packet has
>   come in.
>   The Linux solution is SO_BINDTODEVICE, which according to socket(7):
>   SO_BINDTODEVICE
>          Bind this socket to a particular device like "eth0", as
>          specified in the passed interface name. If the name is an empty
>          string or the option length is zero, the socket device binding
>          is removed. The passed option is a variable-length
>          null-terminated interface name string with the maximum size of
>          IFNAMSIZ. If a socket is bound to an interface, only packets
>          received from that particular interface are processed by the
>          socket. Note that this only works for some socket types,
>          particularly AF_INET sockets. It is not supported for packet
>          sockets (use normal [1]bind(2) there).
>
>   This makes it possible to listen on selected interfaces for
>   (broadcast) packets. FreeBSD currently doesn't implement this feature.
>   Any chances that somebody will do this?
>   What alternatives would you recommend? Raw packet access (like BPF and
>   RAW sockets) finally make the application to do more -mainly useless-
>   work.
>   Are there any other solutions, which doesn't require additional packet
>   parsing?
>   Thanks,
>
> References
>
>   1. http://linux.die.net/man/2/bind
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Watchdog timeout em driver 8.2-R

2012-04-19 Thread Lars Wilke

* Jack Vogel wrote:
>  ok then i guess i will upgrade to 8.3-R, is the driver there reasonably
>  new?
>
> Yes, that should be fine.

Jack thanks, btw. i can quite reliably reproduce this issue.
So if you or anybody else is interested in some data i might
be able to get it.

   --lars
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Luigi Rizzo
I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.

Test conditions:
- intel i7-870 CPU running at 2.93 GHz + TurboBoost,
  all 4 cores enabled, no hyperthreading
- FreeBSD HEAD as of 15 april 2012, no ipfw, no other
  pfilter clients, no ipv6 or ipsec.
- userspace running 'netsend 10.0.0.2  18 0 5'
  (output to a physical interface, udp port , small
  frame, no rate limitations, 5sec experiments)
- the 'ns' column reports
  the total time divided by the number of successful
  transmissions we report the min and max in 5 tests
- 1 to 4 parallel tasks, variable packet sizes
- there are variations in the numbers which become
  larger as we reach the bottom of the stack

Caveats:
- in the table below, clock and pktlen are constant.
  I am including the info here so it is easier to compare
  the results with future experiments

- i have a small number of samples, so i am only reporting
  the min and the max in a handful of experiments.

- i am only measuring average values over millions of
  cycles. I have no info on what is the variance between
  the various executions.

- from what i have seen, numbers vary significantly on
  different systems, depending on memory speed, caches
  and other things. The big jumps are significant and present
  on all systems, but the small deltas (say < 5%) are
  not even statistically significant.

- if someone is interested in replicating the experiments
  email me and i will post a link to a suitable picobsd image.

- i have not yet instrumented the bottom layers (if_output
  and below).

The results show a few interesting things:

- the packet-sending application is reasonably fast
  and certainly not a bottleneck (over 100Mpps before
  calling the system call);

- the system call is somewhat expensive, about 100ns.
  I am not sure where the time is spent (the amd64 code
  does a few push on the stack and then runs "syscall"
  (followed by a sysret). I am not sure how much
  room for improvement is there in this area.
  The relevant code is in lib/libc/i386/SYS.h and
  lib/libc/i386/sys/syscall.S (KERNCALL translates
  to "syscall" on amd64, and "int 0x80" on the i386)

- the next expensive operation, consuming another 100ns,
  is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
  seems to scale decently at least with 4 cores.  The copyin() is
  relatively inexpensive (not reported in the data below, but
  disabling it saves only 15-20ns for a short packet).

  I have not followed the details, but the allocator calls the zone
  allocator and there is at least one critical_enter()/critical_exit()
  pair, and the highly modular architecture invokes long chains of
  indirect function calls both on allocation and release.

  It might make sense to keep a small pool of mbufs attached to the
  socket buffer instead of going to the zone allocator.
  Or defer the actual encapsulation to the
  (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.

- another big bottleneck is the route lookup in ip_output()
  (between entries 51 and 56). Not only it eats another
  100ns+ on an empty routing table, but it also
  causes huge contentions when multiple cores
  are involved.

There is other bad stuff occurring in if_output() and
below (on this system it takes about 1300ns to send one
packet even with one core, and ony 500-550 are consumed
before the call to if_output()) but i don't have
detailed information yet.


POS CPU clock pktlen ns/pkt--- EXIT POINT 
  min  max
-
U   1   2934 18 88  userspace, before the send() call
  [ syscall ]
20  1   2934 18   103  107  sys_sendto(): begin
20  4   2934 18   104  107

21  1   2934 18   110  113  sendit(): begin
21  4   2934 18   111  116

22  1   2934 18   110  114  sendit() after getsockaddr(&to, ...)
22  4   2934 18   111  124

23  1   2934 18   111  115  sendit() before kern_sendit
23  4   2934 18   112  120

24  1   2934 18   117  120  kern_sendit() after AUDIT_ARG_FD
24  4   2934 18   117  121

25  1   2934 18   134  140  kern_sendit() before sosend()
25  4   2934 18   134  146

40  1   2934 18   144  149  sosend_dgram(): start
40  4   2934 18   144  151

41  1   2934 18   157  166  sosend_dgram() before m_uiotombuf()
41  4   2934 18   157  168
   [ mbuf allocation and copy. The copy is relatively cheap ]
42  1   2934 18   264  268  sosend_dgram() after m_uiotombuf()
42  4   2934 18   265  269

30  1   2934 18   273  276  udp_send() begin
30  4   2934 18   274  278
   [ here we start seeing some contention with multiple threads ]
31  1   2934 18   323  324  udp_output() before ip_output()
31  4   2934 18   344  34

Re: igb(4) Raising IGB_MAX_TXD ??

2012-04-19 Thread John Baldwin
On Wednesday, April 18, 2012 7:40:17 pm Sean Bruno wrote:
> 
> On Wed, 2012-04-18 at 09:49 -0700, Sean Bruno wrote:
> > ok, good.  that at least confirms that I correctly translated between
> > the driver code and documented specification.  
> > 
> > I will try 8k as a test for now and see how that runs.  
> > 
> > sean
> 
> For now, I've patched one front end server with: 
> /usr/src/sys/dev/e1000/if_igb.h:#define IGB_MAX_RXD 4096 * 4
> 
> And adjusted hw.igb.rxd: 8192
> 
> So far so good, been running in production for a couple of hours so the
> "smoke test" for this setting seems to be happy.
> 
> We'll continue to adjust and test tomorrow during higher load
> conditions.

FWIW, at my current employer we run with both rxd and txd cranked up to 32k 
(we had to patch the driver as you suggested) and have not had any problems 
doing that for a couple of years now.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: igb(4) Raising IGB_MAX_TXD ??

2012-04-19 Thread Jack Vogel
OH, well that's interesting to know, thanks John.

Jack


On Thu, Apr 19, 2012 at 5:22 AM, John Baldwin  wrote:

> On Wednesday, April 18, 2012 7:40:17 pm Sean Bruno wrote:
> >
> > On Wed, 2012-04-18 at 09:49 -0700, Sean Bruno wrote:
> > > ok, good.  that at least confirms that I correctly translated between
> > > the driver code and documented specification.
> > >
> > > I will try 8k as a test for now and see how that runs.
> > >
> > > sean
> >
> > For now, I've patched one front end server with:
> > /usr/src/sys/dev/e1000/if_igb.h:#define IGB_MAX_RXD 4096 * 4
> >
> > And adjusted hw.igb.rxd: 8192
> >
> > So far so good, been running in production for a couple of hours so the
> > "smoke test" for this setting seems to be happy.
> >
> > We'll continue to adjust and test tomorrow during higher load
> > conditions.
>
> FWIW, at my current employer we run with both rxd and txd cranked up to 32k
> (we had to patch the driver as you suggested) and have not had any problems
> doing that for a couple of years now.
>
> --
> John Baldwin
>
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Slawa Olhovchenkov
On Thu, Apr 19, 2012 at 03:30:18PM +0200, Luigi Rizzo wrote:

> I have been running some performance tests on UDP sockets,
> using the netsend program in tools/tools/netrate/netsend
> and instrumenting the source code and the kernel do return in
> various points of the path. Here are some results which
> I hope you find interesting.

I do some test in 2011.
May be this test is not actual now.
May be actual.

Initial message 
http://lists.freebsd.org/pipermail/freebsd-performance/2011-January/004156.html
UDP socket in FreeBSD 
http://lists.freebsd.org/pipermail/freebsd-performance/2011-February/004176.html
About 4BSD/ULE 
http://lists.freebsd.org/pipermail/freebsd-performance/2011-February/004181.html

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: igb(4) Raising IGB_MAX_RXD ??

2012-04-19 Thread Sean Bruno
On Thu, 2012-04-19 at 07:09 -0700, Jack Vogel wrote:
> OH, well that's interesting to know, thanks John.
> 
> Jack
> 


Front end box looks pretty happy today at 8k descriptors.

http://people.freebsd.org/~sbruno/igb_8k_stats.txt

Under peak, we're approaching 20MBytes/sec in and out of the
interface.  :-)  Nifty.

-bash-4.2$ netstat 1
input(Total)   output
   packets  errs idrops  bytespackets  errs  bytes colls
 59542 0 0   18189602  59131 0   19884085 0
 58941 0 0   18036651  58673 0   19702671 0
 58790 0 0   18069235  58422 0   19897858 0
 58226 0 0   17948175  57969 0   19648810 0
 58689 0 0   18167855  58479 0   19909843 0
 58633 0 0   17952951  58437 0   19760197 0
 61019 0 0   18779030  60592 0   20394481 0
 56696 0 0   17647407  56552 0   19261155 0
 58853 0 0   18186019  58530 0   19886197 0
 58739 0 0   18314790  58768 0   20165654 0
 58748 0 0   18267243  58539 0   20016668 0
 58672 0 0   17914657  58378 0   19558833 0
 59885 0 0   18332641  59780 0   20239241 0


We're going to crank one server up to 8 igb queues and
hw.igb.max_rxd/txd to 32k and see what blows up.

Sean


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: igb(4) Raising IGB_MAX_RXD ??

2012-04-19 Thread Jack Vogel
On Thu, Apr 19, 2012 at 12:26 PM, Sean Bruno  wrote:

> On Thu, 2012-04-19 at 07:09 -0700, Jack Vogel wrote:
> > OH, well that's interesting to know, thanks John.
> >
> > Jack
> >
>
>
> Front end box looks pretty happy today at 8k descriptors.
>
> http://people.freebsd.org/~sbruno/igb_8k_stats.txt
>
> Under peak, we're approaching 20MBytes/sec in and out of the
> interface.  :-)  Nifty.
>
> -bash-4.2$ netstat 1
>input(Total)   output
>   packets  errs idrops  bytespackets  errs  bytes colls
> 59542 0 0   18189602  59131 0   19884085 0
> 58941 0 0   18036651  58673 0   19702671 0
> 58790 0 0   18069235  58422 0   19897858 0
> 58226 0 0   17948175  57969 0   19648810 0
> 58689 0 0   18167855  58479 0   19909843 0
> 58633 0 0   17952951  58437 0   19760197 0
> 61019 0 0   18779030  60592 0   20394481 0
> 56696 0 0   17647407  56552 0   19261155 0
> 58853 0 0   18186019  58530 0   19886197 0
> 58739 0 0   18314790  58768 0   20165654 0
> 58748 0 0   18267243  58539 0   20016668 0
> 58672 0 0   17914657  58378 0   19558833 0
> 59885 0 0   18332641  59780 0   20239241 0
>
>
> We're going to crank one server up to 8 igb queues and
> hw.igb.max_rxd/txd to 32k and see what blows up.
>
> Sean
>
>
>
Great, look forward to the results. Thanks Sean.

Jack
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 19.04.2012 15:30, Luigi Rizzo wrote:

I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.


Jumping over very interesting analysis...


- the next expensive operation, consuming another 100ns,
   is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
   seems to scale decently at least with 4 cores.  The copyin() is
   relatively inexpensive (not reported in the data below, but
   disabling it saves only 15-20ns for a short packet).

   I have not followed the details, but the allocator calls the zone
   allocator and there is at least one critical_enter()/critical_exit()
   pair, and the highly modular architecture invokes long chains of
   indirect function calls both on allocation and release.

   It might make sense to keep a small pool of mbufs attached to the
   socket buffer instead of going to the zone allocator.
   Or defer the actual encapsulation to the
   (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.


The UMA mbuf allocator is certainly not perfect but rather good.
It has a per-CPU cache of mbuf's that are very fast to allocate
from.  Once it has used them it needs to refill from the global
pool which may happen from time to time and show up in the averages.


- another big bottleneck is the route lookup in ip_output()
   (between entries 51 and 56). Not only it eats another
   100ns+ on an empty routing table, but it also
   causes huge contentions when multiple cores
   are involved.


This is indeed a big problem.  I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which
doesn't produce any lock contention or cache pollution.  Also skipping
the per-route lock while the table read-lock is held should help some
more.  All in all this should give a massive gain in high pps situations
at the expense of costlier routing table changes.  However changes
are seldom to essentially never with a single default route.

After that the ARP table will gets same treatment and the low stack
lock contention points should be gone for good.

--
Andre
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Luigi Rizzo
On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:
> On 19.04.2012 15:30, Luigi Rizzo wrote:
> >I have been running some performance tests on UDP sockets,
> >using the netsend program in tools/tools/netrate/netsend
> >and instrumenting the source code and the kernel do return in
> >various points of the path. Here are some results which
> >I hope you find interesting.
> 
> Jumping over very interesting analysis...
> 
> >- the next expensive operation, consuming another 100ns,
> >   is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
> >   seems to scale decently at least with 4 cores.  The copyin() is
> >   relatively inexpensive (not reported in the data below, but
> >   disabling it saves only 15-20ns for a short packet).
> >
> >   I have not followed the details, but the allocator calls the zone
> >   allocator and there is at least one critical_enter()/critical_exit()
> >   pair, and the highly modular architecture invokes long chains of
> >   indirect function calls both on allocation and release.
> >
> >   It might make sense to keep a small pool of mbufs attached to the
> >   socket buffer instead of going to the zone allocator.
> >   Or defer the actual encapsulation to the
> >   (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.
> 
> The UMA mbuf allocator is certainly not perfect but rather good.
> It has a per-CPU cache of mbuf's that are very fast to allocate
> from.  Once it has used them it needs to refill from the global
> pool which may happen from time to time and show up in the averages.

indeed i was pleased to see no difference between 1 and 4 threads.
This also suggests that the global pool is accessed very seldom,
and for short times, otherwise you'd see the effect with 4 threads.

What might be moderately expensive are the critical_enter()/critical_exit()
calls around individual allocations.
The allocation happens while the code has already an exclusive
lock on so->snd_buf so a pool of fresh buffers could be attached
there.

But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
  attached to the socket, built on demand, and cached and managed
  with similar invalidation rules as used by fastforward;
- possibly extend the pru_send interface so one can pass down the uio
  instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
  where the code already has an x-lock on some resource (could be
  the snd_buf, the interface, ...) so the allocation comes for free.

> >- another big bottleneck is the route lookup in ip_output()
> >   (between entries 51 and 56). Not only it eats another
> >   100ns+ on an empty routing table, but it also
> >   causes huge contentions when multiple cores
> >   are involved.
> 
> This is indeed a big problem.  I'm working (rough edges remain) on
> changing the routing table locking to an rmlock (read-mostly) which

i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy
>> This is indeed a big problem.  I'm working (rough edges remain) on
>> changing the routing table locking to an rmlock (read-mostly) which
>

This only helps if your flows aren't hitting the same rtentry.
Otherwise you still convoy on the lock for the rtentry itself to
increment and decrement the rtentry's reference count.

> i was wondering, is there a way (and/or any advantage) to use the
> fastforward code to look up the route for locally sourced packets ?
>

If the number of peers is bounded then you can use the flowtable. Max
PPS is much higher bypassing routing lookup. However, it doesn't scale
to arbitrary flow numbers.


-Kip
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Comment nit

2012-04-19 Thread Sean Bruno
I noted a small nit in the comments of sys/dev/e1000/if_igb.h

Index: if_igb.h
===
--- if_igb.h(revision 234466)
+++ if_igb.h(working copy)
@@ -52,7 +52,7 @@
 #define IGB_MAX_TXD4096
 
 /*
- * IGB_RXD: Maximum number of Transmit Descriptors
+ * IGB_RXD: Maximum number of Receive Descriptors
  *
  *   This value is the number of receive descriptors allocated by the driver.
  *   Increasing this value allows the driver to buffer more incoming packets.

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Luigi Rizzo
On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote:
> >> This is indeed a big problem. ?I'm working (rough edges remain) on
> >> changing the routing table locking to an rmlock (read-mostly) which
> >
> 
> This only helps if your flows aren't hitting the same rtentry.
> Otherwise you still convoy on the lock for the rtentry itself to
> increment and decrement the rtentry's reference count.
> 
> > i was wondering, is there a way (and/or any advantage) to use the
> > fastforward code to look up the route for locally sourced packets ?

actually, now that i look at the code, both ip_output() and
the ip_fastforward code use the same in_rtalloc_ign(...)

> >
> 
> If the number of peers is bounded then you can use the flowtable. Max
> PPS is much higher bypassing routing lookup. However, it doesn't scale
> to arbitrary flow numbers.

re. flowtable, could you point me to what i should do instead of
calling in_rtalloc_ign() ?

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy
On Thu, Apr 19, 2012 at 11:22 PM, Luigi Rizzo  wrote:
> On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote:
>> >> This is indeed a big problem. ?I'm working (rough edges remain) on
>> >> changing the routing table locking to an rmlock (read-mostly) which
>> >
>>
>> This only helps if your flows aren't hitting the same rtentry.
>> Otherwise you still convoy on the lock for the rtentry itself to
>> increment and decrement the rtentry's reference count.
>>
>> > i was wondering, is there a way (and/or any advantage) to use the
>> > fastforward code to look up the route for locally sourced packets ?
>
> actually, now that i look at the code, both ip_output() and
> the ip_fastforward code use the same in_rtalloc_ign(...)
>
>> >
>>
>> If the number of peers is bounded then you can use the flowtable. Max
>> PPS is much higher bypassing routing lookup. However, it doesn't scale
>> to arbitrary flow numbers.
>
> re. flowtable, could you point me to what i should do instead of
> calling in_rtalloc_ign() ?

If you build with it in your kernel config and enable the sysctl
ip_output will automatically use it for TCP and UDP connections. If
you're doing forwarding you'll need to patch the forwarding path.
Fabien Thomas has a patch for that that I just fixed/identified a bug
in for him.

-Kip


-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
>From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 19.04.2012 22:34, K. Macy wrote:

This is indeed a big problem.  I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which




This only helps if your flows aren't hitting the same rtentry.
Otherwise you still convoy on the lock for the rtentry itself to
increment and decrement the rtentry's reference count.


The rtentry lock isn't obtained anymore.  While the rmlock read
lock is held on the rtable the relevant information like ifp and
such is copied out.  No later referencing possible.  In the end
any referencing of an rtentry would be forbidden and the rtentry
lock can be removed.  The second step can be optional though.


i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?



If the number of peers is bounded then you can use the flowtable. Max
PPS is much higher bypassing routing lookup. However, it doesn't scale
to arbitrary flow numbers.


In theory a rmlock-only lookup into a default-route only routing
table would be faster than creating a flow table entry for every
destination.  It a matter of churn though.  The flowtable isn't
lockless in itself, is it?

--
Andre
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy
>> This only helps if your flows aren't hitting the same rtentry.
>> Otherwise you still convoy on the lock for the rtentry itself to
>> increment and decrement the rtentry's reference count.
>
>
> The rtentry lock isn't obtained anymore.  While the rmlock read
> lock is held on the rtable the relevant information like ifp and
> such is copied out.  No later referencing possible.  In the end
> any referencing of an rtentry would be forbidden and the rtentry
> lock can be removed.  The second step can be optional though.

Can you point me to a tree where you've made these changes?

>>> i was wondering, is there a way (and/or any advantage) to use the
>>> fastforward code to look up the route for locally sourced packets ?
>>>
>>
>> If the number of peers is bounded then you can use the flowtable. Max
>> PPS is much higher bypassing routing lookup. However, it doesn't scale
>> to arbitrary flow numbers.
>
>
> In theory a rmlock-only lookup into a default-route only routing
> table would be faster than creating a flow table entry for every
> destination.  It a matter of churn though.  The flowtable isn't
> lockless in itself, is it?

It is. In a steady state where the working set of peers fits in the
table it should be just a simple hash of the ip and then a lookup.

-Kip
-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
>From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 19.04.2012 22:46, Luigi Rizzo wrote:

On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:

On 19.04.2012 15:30, Luigi Rizzo wrote:

I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.


Jumping over very interesting analysis...


- the next expensive operation, consuming another 100ns,
   is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
   seems to scale decently at least with 4 cores.  The copyin() is
   relatively inexpensive (not reported in the data below, but
   disabling it saves only 15-20ns for a short packet).

   I have not followed the details, but the allocator calls the zone
   allocator and there is at least one critical_enter()/critical_exit()
   pair, and the highly modular architecture invokes long chains of
   indirect function calls both on allocation and release.

   It might make sense to keep a small pool of mbufs attached to the
   socket buffer instead of going to the zone allocator.
   Or defer the actual encapsulation to the
   (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.


The UMA mbuf allocator is certainly not perfect but rather good.
It has a per-CPU cache of mbuf's that are very fast to allocate
from.  Once it has used them it needs to refill from the global
pool which may happen from time to time and show up in the averages.


indeed i was pleased to see no difference between 1 and 4 threads.
This also suggests that the global pool is accessed very seldom,
and for short times, otherwise you'd see the effect with 4 threads.


Robert did the per-CPU mbuf allocator pools a few years ago.
Excellent engineering.


What might be moderately expensive are the critical_enter()/critical_exit()
calls around individual allocations.


Can't get away from those as a thread must not migrate away
when manipulating the per-CPU mbuf pool.


The allocation happens while the code has already an exclusive
lock on so->snd_buf so a pool of fresh buffers could be attached
there.


Ah, there it is not necessary to hold the snd_buf lock while
doing the allocate+copyin.  With soreceive_stream() (which is
experimental not enabled by default) I did just that for the
receive path.  It's quite a significant gain there.

IMHO better resolve the locking order than to juggle yet another
mbuf sink.


But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
   attached to the socket, built on demand, and cached and managed
   with similar invalidation rules as used by fastforward;


That would require to cross-pointer the rtentry and whatnot again.
We want to get away from that to untangle the (locking) mess that
eventually results from it.


- possibly extend the pru_send interface so one can pass down the uio
   instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
   where the code already has an x-lock on some resource (could be
   the snd_buf, the interface, ...) so the allocation comes for free.


ETOOCOMPLEXOVERTIME.


- another big bottleneck is the route lookup in ip_output()
   (between entries 51 and 56). Not only it eats another
   100ns+ on an empty routing table, but it also
   causes huge contentions when multiple cores
   are involved.


This is indeed a big problem.  I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which


i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?


No.  The main advantage/difference of fastforward is the short code
path and processing to completion.

--
Andre
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 19.04.2012 23:17, K. Macy wrote:

This only helps if your flows aren't hitting the same rtentry.
Otherwise you still convoy on the lock for the rtentry itself to
increment and decrement the rtentry's reference count.



The rtentry lock isn't obtained anymore.  While the rmlock read
lock is held on the rtable the relevant information like ifp and
such is copied out.  No later referencing possible.  In the end
any referencing of an rtentry would be forbidden and the rtentry
lock can be removed.  The second step can be optional though.


Can you point me to a tree where you've made these changes?


It's not in a public tree.  I just did a 'svn up' and the recent
pf and rtsocket changes created some conflicts.  Have to solve
them before posting.  Timeframe (early) next week.


i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?



If the number of peers is bounded then you can use the flowtable. Max
PPS is much higher bypassing routing lookup. However, it doesn't scale
to arbitrary flow numbers.



In theory a rmlock-only lookup into a default-route only routing
table would be faster than creating a flow table entry for every
destination.  It a matter of churn though.  The flowtable isn't
lockless in itself, is it?


It is. In a steady state where the working set of peers fits in the
table it should be just a simple hash of the ip and then a lookup.


Yes, but the lookup requires a lock?  Or is every entry replicated
to every CPU?  So a number of concurrent CPU's sending to the same
UDP destination would content on that lock?

--
Andre
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy
>
> Yes, but the lookup requires a lock?  Or is every entry replicated
> to every CPU?  So a number of concurrent CPU's sending to the same
> UDP destination would content on that lock?

No. In the default case it's per CPU, thus no serialization is
required. But yes, if your transmitting thread manages to bounce to
every core during send within the flow expiration window you'll have
an extra 12 or however many bytes per peer times the number of cores.
There is usually a fair amount of CPU affinity over a given unit time.


-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
>From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy
On Thu, Apr 19, 2012 at 11:27 PM, Andre Oppermann  wrote:
> On 19.04.2012 23:17, K. Macy wrote:

 This only helps if your flows aren't hitting the same rtentry.
 Otherwise you still convoy on the lock for the rtentry itself to
 increment and decrement the rtentry's reference count.
>>>
>>>
>>>
>>> The rtentry lock isn't obtained anymore.  While the rmlock read
>>> lock is held on the rtable the relevant information like ifp and
>>> such is copied out.  No later referencing possible.  In the end
>>> any referencing of an rtentry would be forbidden and the rtentry
>>> lock can be removed.  The second step can be optional though.
>>
>>
>> Can you point me to a tree where you've made these changes?
>
>
> It's not in a public tree.  I just did a 'svn up' and the recent
> pf and rtsocket changes created some conflicts.  Have to solve
> them before posting.  Timeframe (early) next week.
>
>

Ok. Keep us posted.

Thanks,
Kip



-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
>From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Luigi Rizzo
On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote:
> On 19.04.2012 22:46, Luigi Rizzo wrote:
...
> >What might be moderately expensive are the critical_enter()/critical_exit()
> >calls around individual allocations.
> 
> Can't get away from those as a thread must not migrate away
> when manipulating the per-CPU mbuf pool.

i understand.

> >The allocation happens while the code has already an exclusive
> >lock on so->snd_buf so a pool of fresh buffers could be attached
> >there.
> 
> Ah, there it is not necessary to hold the snd_buf lock while
> doing the allocate+copyin.  With soreceive_stream() (which is

it is not held in the tx path either -- but there is a short section
before m_uiotombuf() which does

...
SOCKBUF_LOCK(&so->so_snd);
// check for pending errors, sbspace, so_state
SOCKBUF_UNLOCK(&so->so_snd);
...

(some of this is slightly dubious, but that's another story)

> >But the other consideration is that one could defer the mbuf allocation
> >to a later time when the packet is actually built (or anyways
> >right before the thread returns).
> >What i envision (and this would fit nicely with netmap) is the following:
> >- have a (possibly readonly) template for the headers (MAC+IP+UDP)
> >   attached to the socket, built on demand, and cached and managed
> >   with similar invalidation rules as used by fastforward;
> 
> That would require to cross-pointer the rtentry and whatnot again.

i was planning to keep a copy, not a reference. If the copy becomes
temporarily stale, no big deal, as long as you can detect it reasonably
quiclky -- routes are not guaranteed to be correct, anyways.

> >- possibly extend the pru_send interface so one can pass down the uio
> >   instead of the mbuf;
> >- make an opportunistic buffer allocation in some place downstream,
> >   where the code already has an x-lock on some resource (could be
> >   the snd_buf, the interface, ...) so the allocation comes for free.
> 
> ETOOCOMPLEXOVERTIME.

maybe. But i want to investigate this.

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 20.04.2012 00:03, Luigi Rizzo wrote:

On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote:

On 19.04.2012 22:46, Luigi Rizzo wrote:

The allocation happens while the code has already an exclusive
lock on so->snd_buf so a pool of fresh buffers could be attached
there.


Ah, there it is not necessary to hold the snd_buf lock while
doing the allocate+copyin.  With soreceive_stream() (which is


it is not held in the tx path either -- but there is a short section
before m_uiotombuf() which does

...
SOCKBUF_LOCK(&so->so_snd);
// check for pending errors, sbspace, so_state
SOCKBUF_UNLOCK(&so->so_snd);
...

(some of this is slightly dubious, but that's another story)


Indeed the lock isn't held across the m_uiotombuf().  You're talking
about filling an sockbuf mbuf cache while holding the lock?


But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
   attached to the socket, built on demand, and cached and managed
   with similar invalidation rules as used by fastforward;


That would require to cross-pointer the rtentry and whatnot again.


i was planning to keep a copy, not a reference. If the copy becomes
temporarily stale, no big deal, as long as you can detect it reasonably
quiclky -- routes are not guaranteed to be correct, anyways.


Be wary of disappearing interface pointers...


- possibly extend the pru_send interface so one can pass down the uio
   instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
   where the code already has an x-lock on some resource (could be
   the snd_buf, the interface, ...) so the allocation comes for free.


ETOOCOMPLEXOVERTIME.


maybe. But i want to investigate this.


I fail see what passing down the uio would gain you.  The snd_buf lock
isn't obtained again after the copyin.  Not that I want to prevent you
from investigating other ways. ;)

--
Andre
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Luigi Rizzo
On Fri, Apr 20, 2012 at 12:37:21AM +0200, Andre Oppermann wrote:
> On 20.04.2012 00:03, Luigi Rizzo wrote:
> >On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote:
> >>On 19.04.2012 22:46, Luigi Rizzo wrote:
> >>>The allocation happens while the code has already an exclusive
> >>>lock on so->snd_buf so a pool of fresh buffers could be attached
> >>>there.
> >>
> >>Ah, there it is not necessary to hold the snd_buf lock while
> >>doing the allocate+copyin.  With soreceive_stream() (which is
> >
> >it is not held in the tx path either -- but there is a short section
> >before m_uiotombuf() which does
> >
> > ...
> > SOCKBUF_LOCK(&so->so_snd);
> > // check for pending errors, sbspace, so_state
> > SOCKBUF_UNLOCK(&so->so_snd);
> > ...
> >
> >(some of this is slightly dubious, but that's another story)
> 
> Indeed the lock isn't held across the m_uiotombuf().  You're talking
> about filling an sockbuf mbuf cache while holding the lock?

all i am thinking is that when we have a serialization point we
could use it for multiple related purposes. In this case yes we
could keep a small mbuf cache attached to so_snd. When the cache
is empty either get a new batch (say 10-20 bufs) from the zone
allocator, possibly dropping and regaining the lock if the so_snd
must be a leaf.  Besides for protocols like TCP (does it use the
same path ?) the mbufs are already there (released by incoming acks)
in the steady state, so it is not even necessary to to refill the
cache.

This said, i am not 100% sure that the 100ns I am seeing are all
spent in the zone allocator.  As i said the chain of indirect calls
and other ops is rather long on both acquire and release.

> >>>But the other consideration is that one could defer the mbuf allocation
> >>>to a later time when the packet is actually built (or anyways
> >>>right before the thread returns).
> >>>What i envision (and this would fit nicely with netmap) is the following:
> >>>- have a (possibly readonly) template for the headers (MAC+IP+UDP)
> >>>   attached to the socket, built on demand, and cached and managed
> >>>   with similar invalidation rules as used by fastforward;
> >>
> >>That would require to cross-pointer the rtentry and whatnot again.
> >
> >i was planning to keep a copy, not a reference. If the copy becomes
> >temporarily stale, no big deal, as long as you can detect it reasonably
> >quiclky -- routes are not guaranteed to be correct, anyways.
> 
> Be wary of disappearing interface pointers...

(this reminds me, what prevents a route grabbed from the flowtable
from disappearing and releasing the ifp reference ?)

In any case, it seems better to keep a more persistent ifp reference
in the socket rather than grab and release one on every single
packet transmission.

> >>>- possibly extend the pru_send interface so one can pass down the uio
> >>>   instead of the mbuf;
> >>>- make an opportunistic buffer allocation in some place downstream,
> >>>   where the code already has an x-lock on some resource (could be
> >>>   the snd_buf, the interface, ...) so the allocation comes for free.
> >>
> >>ETOOCOMPLEXOVERTIME.
> >
> >maybe. But i want to investigate this.
> 
> I fail see what passing down the uio would gain you.  The snd_buf lock
> isn't obtained again after the copyin.  Not that I want to prevent you
> from investigating other ways. ;)

maybe it can open the way to other optimizations, such as reducing
the number of places where you need to lock, or save some data
copies, or reduce fragmentation, etc.

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Question about fixing udp6_input...

2012-04-19 Thread George Neville-Neil
Howdy,

At the moment the prototype for udp6_input() is the following:

int
udp6_input(struct mbuf **mp, int *offp, int proto)

and udp_input() looks like this:

void
udp_input(struct mbuf *m, int off)

As far as I can tell we immediately change **mp to *m and *offp to off
in udp6_input() and we also never use proto in the rest of the function.

Is there any reason to not make udp6_input() look exactly like udp_input() ?

Best,
George

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Question about fixing udp6_input...

2012-04-19 Thread Bjoern A. Zeeb

On 20. Apr 2012, at 01:44 , George Neville-Neil wrote:

> Howdy,
> 
> At the moment the prototype for udp6_input() is the following:
> 
> int
> udp6_input(struct mbuf **mp, int *offp, int proto)
> 
> and udp_input() looks like this:
> 
> void
> udp_input(struct mbuf *m, int off)
> 
> As far as I can tell we immediately change **mp to *m and *offp to off
> in udp6_input() and we also never use proto in the rest of the function.
> 
> Is there any reason to not make udp6_input() look exactly like udp_input() ?

I think the answer to this is here:

http://wiki.freebsd.org/IPv6TODO#Remove_ip6protosw

-- 
Bjoern A. Zeeb You have to have visions!
   It does not matter how good you are. It matters what good you do!

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"