Re: SO_BINDTODEVICE or equivalent?
Hi, Use IP_RECVIF option. For IP_SENDIF look at http://lists.freebsd.org/pipermail/freebsd-net/2007-March/013510.html I used the patch on my embedded FreeBSD 9.0 boxes and it works fine. I modificated it slightly to match 9.0. Svata On Thu, Apr 19, 2012 at 7:41 AM, Attila Nagy wrote: > > Hi, > I want to solve the classic problem of a DHCP server: listening for > broadcast UDP packets and figuring out what interface a packet has > come in. > The Linux solution is SO_BINDTODEVICE, which according to socket(7): > SO_BINDTODEVICE > Bind this socket to a particular device like "eth0", as > specified in the passed interface name. If the name is an empty > string or the option length is zero, the socket device binding > is removed. The passed option is a variable-length > null-terminated interface name string with the maximum size of > IFNAMSIZ. If a socket is bound to an interface, only packets > received from that particular interface are processed by the > socket. Note that this only works for some socket types, > particularly AF_INET sockets. It is not supported for packet > sockets (use normal [1]bind(2) there). > > This makes it possible to listen on selected interfaces for > (broadcast) packets. FreeBSD currently doesn't implement this feature. > Any chances that somebody will do this? > What alternatives would you recommend? Raw packet access (like BPF and > RAW sockets) finally make the application to do more -mainly useless- > work. > Are there any other solutions, which doesn't require additional packet > parsing? > Thanks, > > References > > 1. http://linux.die.net/man/2/bind > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Watchdog timeout em driver 8.2-R
* Jack Vogel wrote: > ok then i guess i will upgrade to 8.3-R, is the driver there reasonably > new? > > Yes, that should be fine. Jack thanks, btw. i can quite reliably reproduce this issue. So if you or anybody else is interested in some data i might be able to get it. --lars ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Some performance measurements on the FreeBSD network stack
I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. Test conditions: - intel i7-870 CPU running at 2.93 GHz + TurboBoost, all 4 cores enabled, no hyperthreading - FreeBSD HEAD as of 15 april 2012, no ipfw, no other pfilter clients, no ipv6 or ipsec. - userspace running 'netsend 10.0.0.2 18 0 5' (output to a physical interface, udp port , small frame, no rate limitations, 5sec experiments) - the 'ns' column reports the total time divided by the number of successful transmissions we report the min and max in 5 tests - 1 to 4 parallel tasks, variable packet sizes - there are variations in the numbers which become larger as we reach the bottom of the stack Caveats: - in the table below, clock and pktlen are constant. I am including the info here so it is easier to compare the results with future experiments - i have a small number of samples, so i am only reporting the min and the max in a handful of experiments. - i am only measuring average values over millions of cycles. I have no info on what is the variance between the various executions. - from what i have seen, numbers vary significantly on different systems, depending on memory speed, caches and other things. The big jumps are significant and present on all systems, but the small deltas (say < 5%) are not even statistically significant. - if someone is interested in replicating the experiments email me and i will post a link to a suitable picobsd image. - i have not yet instrumented the bottom layers (if_output and below). The results show a few interesting things: - the packet-sending application is reasonably fast and certainly not a bottleneck (over 100Mpps before calling the system call); - the system call is somewhat expensive, about 100ns. I am not sure where the time is spent (the amd64 code does a few push on the stack and then runs "syscall" (followed by a sysret). I am not sure how much room for improvement is there in this area. The relevant code is in lib/libc/i386/SYS.h and lib/libc/i386/sys/syscall.S (KERNCALL translates to "syscall" on amd64, and "int 0x80" on the i386) - the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet). I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release. It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. There is other bad stuff occurring in if_output() and below (on this system it takes about 1300ns to send one packet even with one core, and ony 500-550 are consumed before the call to if_output()) but i don't have detailed information yet. POS CPU clock pktlen ns/pkt--- EXIT POINT min max - U 1 2934 18 88 userspace, before the send() call [ syscall ] 20 1 2934 18 103 107 sys_sendto(): begin 20 4 2934 18 104 107 21 1 2934 18 110 113 sendit(): begin 21 4 2934 18 111 116 22 1 2934 18 110 114 sendit() after getsockaddr(&to, ...) 22 4 2934 18 111 124 23 1 2934 18 111 115 sendit() before kern_sendit 23 4 2934 18 112 120 24 1 2934 18 117 120 kern_sendit() after AUDIT_ARG_FD 24 4 2934 18 117 121 25 1 2934 18 134 140 kern_sendit() before sosend() 25 4 2934 18 134 146 40 1 2934 18 144 149 sosend_dgram(): start 40 4 2934 18 144 151 41 1 2934 18 157 166 sosend_dgram() before m_uiotombuf() 41 4 2934 18 157 168 [ mbuf allocation and copy. The copy is relatively cheap ] 42 1 2934 18 264 268 sosend_dgram() after m_uiotombuf() 42 4 2934 18 265 269 30 1 2934 18 273 276 udp_send() begin 30 4 2934 18 274 278 [ here we start seeing some contention with multiple threads ] 31 1 2934 18 323 324 udp_output() before ip_output() 31 4 2934 18 344 34
Re: igb(4) Raising IGB_MAX_TXD ??
On Wednesday, April 18, 2012 7:40:17 pm Sean Bruno wrote: > > On Wed, 2012-04-18 at 09:49 -0700, Sean Bruno wrote: > > ok, good. that at least confirms that I correctly translated between > > the driver code and documented specification. > > > > I will try 8k as a test for now and see how that runs. > > > > sean > > For now, I've patched one front end server with: > /usr/src/sys/dev/e1000/if_igb.h:#define IGB_MAX_RXD 4096 * 4 > > And adjusted hw.igb.rxd: 8192 > > So far so good, been running in production for a couple of hours so the > "smoke test" for this setting seems to be happy. > > We'll continue to adjust and test tomorrow during higher load > conditions. FWIW, at my current employer we run with both rxd and txd cranked up to 32k (we had to patch the driver as you suggested) and have not had any problems doing that for a couple of years now. -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: igb(4) Raising IGB_MAX_TXD ??
OH, well that's interesting to know, thanks John. Jack On Thu, Apr 19, 2012 at 5:22 AM, John Baldwin wrote: > On Wednesday, April 18, 2012 7:40:17 pm Sean Bruno wrote: > > > > On Wed, 2012-04-18 at 09:49 -0700, Sean Bruno wrote: > > > ok, good. that at least confirms that I correctly translated between > > > the driver code and documented specification. > > > > > > I will try 8k as a test for now and see how that runs. > > > > > > sean > > > > For now, I've patched one front end server with: > > /usr/src/sys/dev/e1000/if_igb.h:#define IGB_MAX_RXD 4096 * 4 > > > > And adjusted hw.igb.rxd: 8192 > > > > So far so good, been running in production for a couple of hours so the > > "smoke test" for this setting seems to be happy. > > > > We'll continue to adjust and test tomorrow during higher load > > conditions. > > FWIW, at my current employer we run with both rxd and txd cranked up to 32k > (we had to patch the driver as you suggested) and have not had any problems > doing that for a couple of years now. > > -- > John Baldwin > ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 03:30:18PM +0200, Luigi Rizzo wrote: > I have been running some performance tests on UDP sockets, > using the netsend program in tools/tools/netrate/netsend > and instrumenting the source code and the kernel do return in > various points of the path. Here are some results which > I hope you find interesting. I do some test in 2011. May be this test is not actual now. May be actual. Initial message http://lists.freebsd.org/pipermail/freebsd-performance/2011-January/004156.html UDP socket in FreeBSD http://lists.freebsd.org/pipermail/freebsd-performance/2011-February/004176.html About 4BSD/ULE http://lists.freebsd.org/pipermail/freebsd-performance/2011-February/004181.html ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: igb(4) Raising IGB_MAX_RXD ??
On Thu, 2012-04-19 at 07:09 -0700, Jack Vogel wrote: > OH, well that's interesting to know, thanks John. > > Jack > Front end box looks pretty happy today at 8k descriptors. http://people.freebsd.org/~sbruno/igb_8k_stats.txt Under peak, we're approaching 20MBytes/sec in and out of the interface. :-) Nifty. -bash-4.2$ netstat 1 input(Total) output packets errs idrops bytespackets errs bytes colls 59542 0 0 18189602 59131 0 19884085 0 58941 0 0 18036651 58673 0 19702671 0 58790 0 0 18069235 58422 0 19897858 0 58226 0 0 17948175 57969 0 19648810 0 58689 0 0 18167855 58479 0 19909843 0 58633 0 0 17952951 58437 0 19760197 0 61019 0 0 18779030 60592 0 20394481 0 56696 0 0 17647407 56552 0 19261155 0 58853 0 0 18186019 58530 0 19886197 0 58739 0 0 18314790 58768 0 20165654 0 58748 0 0 18267243 58539 0 20016668 0 58672 0 0 17914657 58378 0 19558833 0 59885 0 0 18332641 59780 0 20239241 0 We're going to crank one server up to 8 igb queues and hw.igb.max_rxd/txd to 32k and see what blows up. Sean ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: igb(4) Raising IGB_MAX_RXD ??
On Thu, Apr 19, 2012 at 12:26 PM, Sean Bruno wrote: > On Thu, 2012-04-19 at 07:09 -0700, Jack Vogel wrote: > > OH, well that's interesting to know, thanks John. > > > > Jack > > > > > Front end box looks pretty happy today at 8k descriptors. > > http://people.freebsd.org/~sbruno/igb_8k_stats.txt > > Under peak, we're approaching 20MBytes/sec in and out of the > interface. :-) Nifty. > > -bash-4.2$ netstat 1 >input(Total) output > packets errs idrops bytespackets errs bytes colls > 59542 0 0 18189602 59131 0 19884085 0 > 58941 0 0 18036651 58673 0 19702671 0 > 58790 0 0 18069235 58422 0 19897858 0 > 58226 0 0 17948175 57969 0 19648810 0 > 58689 0 0 18167855 58479 0 19909843 0 > 58633 0 0 17952951 58437 0 19760197 0 > 61019 0 0 18779030 60592 0 20394481 0 > 56696 0 0 17647407 56552 0 19261155 0 > 58853 0 0 18186019 58530 0 19886197 0 > 58739 0 0 18314790 58768 0 20165654 0 > 58748 0 0 18267243 58539 0 20016668 0 > 58672 0 0 17914657 58378 0 19558833 0 > 59885 0 0 18332641 59780 0 20239241 0 > > > We're going to crank one server up to 8 igb queues and > hw.igb.max_rxd/txd to 32k and see what blows up. > > Sean > > > Great, look forward to the results. Thanks Sean. Jack ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On 19.04.2012 15:30, Luigi Rizzo wrote: I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. Jumping over very interesting analysis... - the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet). I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release. It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways. The UMA mbuf allocator is certainly not perfect but rather good. It has a per-CPU cache of mbuf's that are very fast to allocate from. Once it has used them it needs to refill from the global pool which may happen from time to time and show up in the averages. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which doesn't produce any lock contention or cache pollution. Also skipping the per-route lock while the table read-lock is held should help some more. All in all this should give a massive gain in high pps situations at the expense of costlier routing table changes. However changes are seldom to essentially never with a single default route. After that the ARP table will gets same treatment and the low stack lock contention points should be gone for good. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote: > On 19.04.2012 15:30, Luigi Rizzo wrote: > >I have been running some performance tests on UDP sockets, > >using the netsend program in tools/tools/netrate/netsend > >and instrumenting the source code and the kernel do return in > >various points of the path. Here are some results which > >I hope you find interesting. > > Jumping over very interesting analysis... > > >- the next expensive operation, consuming another 100ns, > > is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator > > seems to scale decently at least with 4 cores. The copyin() is > > relatively inexpensive (not reported in the data below, but > > disabling it saves only 15-20ns for a short packet). > > > > I have not followed the details, but the allocator calls the zone > > allocator and there is at least one critical_enter()/critical_exit() > > pair, and the highly modular architecture invokes long chains of > > indirect function calls both on allocation and release. > > > > It might make sense to keep a small pool of mbufs attached to the > > socket buffer instead of going to the zone allocator. > > Or defer the actual encapsulation to the > > (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways. > > The UMA mbuf allocator is certainly not perfect but rather good. > It has a per-CPU cache of mbuf's that are very fast to allocate > from. Once it has used them it needs to refill from the global > pool which may happen from time to time and show up in the averages. indeed i was pleased to see no difference between 1 and 4 threads. This also suggests that the global pool is accessed very seldom, and for short times, otherwise you'd see the effect with 4 threads. What might be moderately expensive are the critical_enter()/critical_exit() calls around individual allocations. The allocation happens while the code has already an exclusive lock on so->snd_buf so a pool of fresh buffers could be attached there. But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. > >- another big bottleneck is the route lookup in ip_output() > > (between entries 51 and 56). Not only it eats another > > 100ns+ on an empty routing table, but it also > > causes huge contentions when multiple cores > > are involved. > > This is indeed a big problem. I'm working (rough edges remain) on > changing the routing table locking to an rmlock (read-mostly) which i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? cheers luigi ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
>> This is indeed a big problem. I'm working (rough edges remain) on >> changing the routing table locking to an rmlock (read-mostly) which > This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. > i was wondering, is there a way (and/or any advantage) to use the > fastforward code to look up the route for locally sourced packets ? > If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. -Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Comment nit
I noted a small nit in the comments of sys/dev/e1000/if_igb.h Index: if_igb.h === --- if_igb.h(revision 234466) +++ if_igb.h(working copy) @@ -52,7 +52,7 @@ #define IGB_MAX_TXD4096 /* - * IGB_RXD: Maximum number of Transmit Descriptors + * IGB_RXD: Maximum number of Receive Descriptors * * This value is the number of receive descriptors allocated by the driver. * Increasing this value allows the driver to buffer more incoming packets. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote: > >> This is indeed a big problem. ?I'm working (rough edges remain) on > >> changing the routing table locking to an rmlock (read-mostly) which > > > > This only helps if your flows aren't hitting the same rtentry. > Otherwise you still convoy on the lock for the rtentry itself to > increment and decrement the rtentry's reference count. > > > i was wondering, is there a way (and/or any advantage) to use the > > fastforward code to look up the route for locally sourced packets ? actually, now that i look at the code, both ip_output() and the ip_fastforward code use the same in_rtalloc_ign(...) > > > > If the number of peers is bounded then you can use the flowtable. Max > PPS is much higher bypassing routing lookup. However, it doesn't scale > to arbitrary flow numbers. re. flowtable, could you point me to what i should do instead of calling in_rtalloc_ign() ? cheers luigi ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 11:22 PM, Luigi Rizzo wrote: > On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote: >> >> This is indeed a big problem. ?I'm working (rough edges remain) on >> >> changing the routing table locking to an rmlock (read-mostly) which >> > >> >> This only helps if your flows aren't hitting the same rtentry. >> Otherwise you still convoy on the lock for the rtentry itself to >> increment and decrement the rtentry's reference count. >> >> > i was wondering, is there a way (and/or any advantage) to use the >> > fastforward code to look up the route for locally sourced packets ? > > actually, now that i look at the code, both ip_output() and > the ip_fastforward code use the same in_rtalloc_ign(...) > >> > >> >> If the number of peers is bounded then you can use the flowtable. Max >> PPS is much higher bypassing routing lookup. However, it doesn't scale >> to arbitrary flow numbers. > > re. flowtable, could you point me to what i should do instead of > calling in_rtalloc_ign() ? If you build with it in your kernel config and enable the sysctl ip_output will automatically use it for TCP and UDP connections. If you're doing forwarding you'll need to patch the forwarding path. Fabien Thomas has a patch for that that I just fixed/identified a bug in for him. -Kip -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! >From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On 19.04.2012 22:34, K. Macy wrote: This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. The rtentry lock isn't obtained anymore. While the rmlock read lock is held on the rtable the relevant information like ifp and such is copied out. No later referencing possible. In the end any referencing of an rtentry would be forbidden and the rtentry lock can be removed. The second step can be optional though. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. In theory a rmlock-only lookup into a default-route only routing table would be faster than creating a flow table entry for every destination. It a matter of churn though. The flowtable isn't lockless in itself, is it? -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
>> This only helps if your flows aren't hitting the same rtentry. >> Otherwise you still convoy on the lock for the rtentry itself to >> increment and decrement the rtentry's reference count. > > > The rtentry lock isn't obtained anymore. While the rmlock read > lock is held on the rtable the relevant information like ifp and > such is copied out. No later referencing possible. In the end > any referencing of an rtentry would be forbidden and the rtentry > lock can be removed. The second step can be optional though. Can you point me to a tree where you've made these changes? >>> i was wondering, is there a way (and/or any advantage) to use the >>> fastforward code to look up the route for locally sourced packets ? >>> >> >> If the number of peers is bounded then you can use the flowtable. Max >> PPS is much higher bypassing routing lookup. However, it doesn't scale >> to arbitrary flow numbers. > > > In theory a rmlock-only lookup into a default-route only routing > table would be faster than creating a flow table entry for every > destination. It a matter of churn though. The flowtable isn't > lockless in itself, is it? It is. In a steady state where the working set of peers fits in the table it should be just a simple hash of the ip and then a lookup. -Kip -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! >From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On 19.04.2012 22:46, Luigi Rizzo wrote: On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote: On 19.04.2012 15:30, Luigi Rizzo wrote: I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. Jumping over very interesting analysis... - the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet). I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release. It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways. The UMA mbuf allocator is certainly not perfect but rather good. It has a per-CPU cache of mbuf's that are very fast to allocate from. Once it has used them it needs to refill from the global pool which may happen from time to time and show up in the averages. indeed i was pleased to see no difference between 1 and 4 threads. This also suggests that the global pool is accessed very seldom, and for short times, otherwise you'd see the effect with 4 threads. Robert did the per-CPU mbuf allocator pools a few years ago. Excellent engineering. What might be moderately expensive are the critical_enter()/critical_exit() calls around individual allocations. Can't get away from those as a thread must not migrate away when manipulating the per-CPU mbuf pool. The allocation happens while the code has already an exclusive lock on so->snd_buf so a pool of fresh buffers could be attached there. Ah, there it is not necessary to hold the snd_buf lock while doing the allocate+copyin. With soreceive_stream() (which is experimental not enabled by default) I did just that for the receive path. It's quite a significant gain there. IMHO better resolve the locking order than to juggle yet another mbuf sink. But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; That would require to cross-pointer the rtentry and whatnot again. We want to get away from that to untangle the (locking) mess that eventually results from it. - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. ETOOCOMPLEXOVERTIME. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? No. The main advantage/difference of fastforward is the short code path and processing to completion. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On 19.04.2012 23:17, K. Macy wrote: This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. The rtentry lock isn't obtained anymore. While the rmlock read lock is held on the rtable the relevant information like ifp and such is copied out. No later referencing possible. In the end any referencing of an rtentry would be forbidden and the rtentry lock can be removed. The second step can be optional though. Can you point me to a tree where you've made these changes? It's not in a public tree. I just did a 'svn up' and the recent pf and rtsocket changes created some conflicts. Have to solve them before posting. Timeframe (early) next week. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. In theory a rmlock-only lookup into a default-route only routing table would be faster than creating a flow table entry for every destination. It a matter of churn though. The flowtable isn't lockless in itself, is it? It is. In a steady state where the working set of peers fits in the table it should be just a simple hash of the ip and then a lookup. Yes, but the lookup requires a lock? Or is every entry replicated to every CPU? So a number of concurrent CPU's sending to the same UDP destination would content on that lock? -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
> > Yes, but the lookup requires a lock? Or is every entry replicated > to every CPU? So a number of concurrent CPU's sending to the same > UDP destination would content on that lock? No. In the default case it's per CPU, thus no serialization is required. But yes, if your transmitting thread manages to bounce to every core during send within the flow expiration window you'll have an extra 12 or however many bytes per peer times the number of cores. There is usually a fair amount of CPU affinity over a given unit time. -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! >From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 11:27 PM, Andre Oppermann wrote: > On 19.04.2012 23:17, K. Macy wrote: This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. >>> >>> >>> >>> The rtentry lock isn't obtained anymore. While the rmlock read >>> lock is held on the rtable the relevant information like ifp and >>> such is copied out. No later referencing possible. In the end >>> any referencing of an rtentry would be forbidden and the rtentry >>> lock can be removed. The second step can be optional though. >> >> >> Can you point me to a tree where you've made these changes? > > > It's not in a public tree. I just did a 'svn up' and the recent > pf and rtsocket changes created some conflicts. Have to solve > them before posting. Timeframe (early) next week. > > Ok. Keep us posted. Thanks, Kip -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! >From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote: > On 19.04.2012 22:46, Luigi Rizzo wrote: ... > >What might be moderately expensive are the critical_enter()/critical_exit() > >calls around individual allocations. > > Can't get away from those as a thread must not migrate away > when manipulating the per-CPU mbuf pool. i understand. > >The allocation happens while the code has already an exclusive > >lock on so->snd_buf so a pool of fresh buffers could be attached > >there. > > Ah, there it is not necessary to hold the snd_buf lock while > doing the allocate+copyin. With soreceive_stream() (which is it is not held in the tx path either -- but there is a short section before m_uiotombuf() which does ... SOCKBUF_LOCK(&so->so_snd); // check for pending errors, sbspace, so_state SOCKBUF_UNLOCK(&so->so_snd); ... (some of this is slightly dubious, but that's another story) > >But the other consideration is that one could defer the mbuf allocation > >to a later time when the packet is actually built (or anyways > >right before the thread returns). > >What i envision (and this would fit nicely with netmap) is the following: > >- have a (possibly readonly) template for the headers (MAC+IP+UDP) > > attached to the socket, built on demand, and cached and managed > > with similar invalidation rules as used by fastforward; > > That would require to cross-pointer the rtentry and whatnot again. i was planning to keep a copy, not a reference. If the copy becomes temporarily stale, no big deal, as long as you can detect it reasonably quiclky -- routes are not guaranteed to be correct, anyways. > >- possibly extend the pru_send interface so one can pass down the uio > > instead of the mbuf; > >- make an opportunistic buffer allocation in some place downstream, > > where the code already has an x-lock on some resource (could be > > the snd_buf, the interface, ...) so the allocation comes for free. > > ETOOCOMPLEXOVERTIME. maybe. But i want to investigate this. cheers luigi ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On 20.04.2012 00:03, Luigi Rizzo wrote: On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote: On 19.04.2012 22:46, Luigi Rizzo wrote: The allocation happens while the code has already an exclusive lock on so->snd_buf so a pool of fresh buffers could be attached there. Ah, there it is not necessary to hold the snd_buf lock while doing the allocate+copyin. With soreceive_stream() (which is it is not held in the tx path either -- but there is a short section before m_uiotombuf() which does ... SOCKBUF_LOCK(&so->so_snd); // check for pending errors, sbspace, so_state SOCKBUF_UNLOCK(&so->so_snd); ... (some of this is slightly dubious, but that's another story) Indeed the lock isn't held across the m_uiotombuf(). You're talking about filling an sockbuf mbuf cache while holding the lock? But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; That would require to cross-pointer the rtentry and whatnot again. i was planning to keep a copy, not a reference. If the copy becomes temporarily stale, no big deal, as long as you can detect it reasonably quiclky -- routes are not guaranteed to be correct, anyways. Be wary of disappearing interface pointers... - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. ETOOCOMPLEXOVERTIME. maybe. But i want to investigate this. I fail see what passing down the uio would gain you. The snd_buf lock isn't obtained again after the copyin. Not that I want to prevent you from investigating other ways. ;) -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
On Fri, Apr 20, 2012 at 12:37:21AM +0200, Andre Oppermann wrote: > On 20.04.2012 00:03, Luigi Rizzo wrote: > >On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote: > >>On 19.04.2012 22:46, Luigi Rizzo wrote: > >>>The allocation happens while the code has already an exclusive > >>>lock on so->snd_buf so a pool of fresh buffers could be attached > >>>there. > >> > >>Ah, there it is not necessary to hold the snd_buf lock while > >>doing the allocate+copyin. With soreceive_stream() (which is > > > >it is not held in the tx path either -- but there is a short section > >before m_uiotombuf() which does > > > > ... > > SOCKBUF_LOCK(&so->so_snd); > > // check for pending errors, sbspace, so_state > > SOCKBUF_UNLOCK(&so->so_snd); > > ... > > > >(some of this is slightly dubious, but that's another story) > > Indeed the lock isn't held across the m_uiotombuf(). You're talking > about filling an sockbuf mbuf cache while holding the lock? all i am thinking is that when we have a serialization point we could use it for multiple related purposes. In this case yes we could keep a small mbuf cache attached to so_snd. When the cache is empty either get a new batch (say 10-20 bufs) from the zone allocator, possibly dropping and regaining the lock if the so_snd must be a leaf. Besides for protocols like TCP (does it use the same path ?) the mbufs are already there (released by incoming acks) in the steady state, so it is not even necessary to to refill the cache. This said, i am not 100% sure that the 100ns I am seeing are all spent in the zone allocator. As i said the chain of indirect calls and other ops is rather long on both acquire and release. > >>>But the other consideration is that one could defer the mbuf allocation > >>>to a later time when the packet is actually built (or anyways > >>>right before the thread returns). > >>>What i envision (and this would fit nicely with netmap) is the following: > >>>- have a (possibly readonly) template for the headers (MAC+IP+UDP) > >>> attached to the socket, built on demand, and cached and managed > >>> with similar invalidation rules as used by fastforward; > >> > >>That would require to cross-pointer the rtentry and whatnot again. > > > >i was planning to keep a copy, not a reference. If the copy becomes > >temporarily stale, no big deal, as long as you can detect it reasonably > >quiclky -- routes are not guaranteed to be correct, anyways. > > Be wary of disappearing interface pointers... (this reminds me, what prevents a route grabbed from the flowtable from disappearing and releasing the ifp reference ?) In any case, it seems better to keep a more persistent ifp reference in the socket rather than grab and release one on every single packet transmission. > >>>- possibly extend the pru_send interface so one can pass down the uio > >>> instead of the mbuf; > >>>- make an opportunistic buffer allocation in some place downstream, > >>> where the code already has an x-lock on some resource (could be > >>> the snd_buf, the interface, ...) so the allocation comes for free. > >> > >>ETOOCOMPLEXOVERTIME. > > > >maybe. But i want to investigate this. > > I fail see what passing down the uio would gain you. The snd_buf lock > isn't obtained again after the copyin. Not that I want to prevent you > from investigating other ways. ;) maybe it can open the way to other optimizations, such as reducing the number of places where you need to lock, or save some data copies, or reduce fragmentation, etc. cheers luigi ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Question about fixing udp6_input...
Howdy, At the moment the prototype for udp6_input() is the following: int udp6_input(struct mbuf **mp, int *offp, int proto) and udp_input() looks like this: void udp_input(struct mbuf *m, int off) As far as I can tell we immediately change **mp to *m and *offp to off in udp6_input() and we also never use proto in the rest of the function. Is there any reason to not make udp6_input() look exactly like udp_input() ? Best, George ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Question about fixing udp6_input...
On 20. Apr 2012, at 01:44 , George Neville-Neil wrote: > Howdy, > > At the moment the prototype for udp6_input() is the following: > > int > udp6_input(struct mbuf **mp, int *offp, int proto) > > and udp_input() looks like this: > > void > udp_input(struct mbuf *m, int off) > > As far as I can tell we immediately change **mp to *m and *offp to off > in udp6_input() and we also never use proto in the rest of the function. > > Is there any reason to not make udp6_input() look exactly like udp_input() ? I think the answer to this is here: http://wiki.freebsd.org/IPv6TODO#Remove_ip6protosw -- Bjoern A. Zeeb You have to have visions! It does not matter how good you are. It matters what good you do! ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"