Network buffer allocations: mbuma, PLEASE TEST
Hi, If you're running -CURRENT, please test this: http://people.freebsd.org/~bmilekic/code/mbuma2.diff It is several extensions to UMA and mbuf & cluster allocation built on top of it. Once you apply the patch from src/, you need to rebuild and reinstall src/usr.bin/netstat, src/usr.bin/systat, and then a new kernel. When you're configuring your new kernel, you should remove the NMBCLUSTERS compile-time option, it's no longer needed. Clusters will still be capped off according to maxusers (which is auto-tuned itself). Alternately, if you want theoretically unlimited number of clusters, you can tune the boot-time kern.ipc.nmbclusters tunable to zero. Unless final issues arise I'm going to commit this tomorrow morning; it's been tested already quite a bit, and performance considered. A paper is available and was presented at BSDCan 2004; in case you missed it: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf It has been looked at for quite some time now. Additional code cleanups will need to occur following commit, maybe. Future work is also possible, see the paper if you're interested in taking some of it on. Oh, and keep me in the CC; I have no idea if I'm subscribed to these lists anymore. You should also follow up to this thread on -net and not on -hackers (trim -hackers from CC in the future). Thanks and happy hacking! Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: polling(4) rocks!
On Wed, Nov 17, 2004 at 02:08:25PM -0500, Charles Swiger wrote: > On Nov 17, 2004, at 1:52 PM, Eugene Grosbein wrote: > >On Wed, Nov 17, 2004 at 09:13:51PM +0300, Yar Tikhiy wrote: > >[ ...praise of polling(4)... ] > >Does polling(4) increase latency? It is very imortant for router > >that handles lots of RTP (VoIP) traffic. > > Using polling does increase the latency of the traffic, slightly: if > you follow the recommended configuration and set HZ=1000 or so, the > added latency typically ends up being around 1ms. > > That's not enough to affect VoIP significantly. I've got 12 VoIP phone > lines going through a FreeBSD 4.10 firewall using polling(4) over a T1 > now. The firewall box is a Dell P3/400MHz or so using fxp and a > quad-port card using the DEC 21x4x chipset. Latency is affected if in the case of using interrupts, it is lower than for polling. If you plot a latency versus pps graph, for example, you might notice that in the case of interrupts, the curve keeps growing beyond a certain pps number, whereas in the case of polling it flattens out. This happens because in the case of interrupts, more packets means more interrupts, always (assuming your card's coalescing abilities have been exhausted, i.e., very high pps, all the time), and more interrupts means more latency. In the case of polling, more packets does not mean more interrupts, and the system will process them at a more or less constant rate. > -- > -Chuck -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Dingo and PerForce
This way he can push changes back into 'dingo' before pushing them back into HEAD. It makes a lot of sense, actually, to do this sort of 3-tier approach when multiple people are working on different parts of a larger sub-project, like for example dingo. HEAD (top, must work) <-- (dingo, pre-production, test all changes together) <-- (your individual branch, single set of changes). You develop in your individual branch, test your changes. If all is well, you push into dingo, where changes get tested with respect to other dingo-related changes (which have not yet been pushed into HEAD). When it's all ready, everything gets pushed at once into HEAD, or in pieces, but you know that the individual pieces work well together. This is because dingo changes less often than HEAD. -Bosko On Sun, Dec 19, 2004 at 06:20:17PM +0100, Max Laier wrote: > On Sunday 19 December 2004 05:23, [EMAIL PROTECTED] wrote: > > For those who use PerForce and want to work on Dingo there is > > now a dingo branch, named "dingo". The dingo branch contains > > all of src, not just sys, as I suspect there are userland bits > > we'll want to do. I know I'll be doing userland things. > > Please make your own sub-branch off of dingo. For example, > > here is mine (gnn_dingo): > <...> > > The main branch (dingo) is locked against changes and, more > > importantly, accidental deletion. Let me know if there are any issues > > with any of this. > > > > Oh, and the integration from vendor to dingo was done last Saturday > > night around 23:00 JST. > > I am not all that familiar with perforce so I am wondering: What is the > benefit of having this branch if it will be tracking HEAD anyway, we can just > branch off the normal vendor branch. Or do you plan to integrate parts back? > > I have 'mlaier_carp2' off the vendor branch. What will be the benefit if I > move it to 'dingo' instead? > > -- > /"\ Best regards, | [EMAIL PROTECTED] > \ / Max Laier | ICQ #67774661 > X http://pf4freebsd.love2party.net/ | [EMAIL PROTECTED] > / \ ASCII Ribbon Campaign | Against HTML Mail and News -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Network Emulation Software [recomendations please ?]
On Fri, Jan 21, 2005 at 11:14:48AM +1030, Wilkinson, Alex wrote: > Hi all, > > I would like to find some network simulation software that is similar but > better than NIST Net [http://www-x.antd.nist.gov/nistnet/]. > > Can anyone recommend any network simulation software that "is a > general-purpose tool for emulating performance dynamics in IP > networks". > > Cheers > > - aW If you are open to commercial [non-free] solutions, take a look at OPNet Modeler (http://www.opnet.com/products/modeler/home.html). -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Race condition in mb_free_ext()?
On Tue, 1 Mar 2005 13:40:18 -0500, John Baldwin <[EMAIL PROTECTED]> wrote: > On Monday 28 February 2005 07:04 pm, Kris Kennaway wrote: > > I'm seeing an easily-provoked livelock on quad-CPU sparc64 machines > > running RELENG_5. It's hard to get a good trace because the processes > > running on other CPUs cannot be traced from DDB, but I've been lucky a > > few times: > > > > db> show alllocks > > Process 15 (swi1: net) thread 0xf8001fb07480 (18) > > exclusive sleep mutex so_snd r = 0 (0xf800178432a8) locked @ > > netinet/tcp_input.c:2189 exclusive sleep mutex inp (tcpinp) r = 0 > > (0xf800155c3b08) locked @ netinet/tcp_input.c:744 exclusive sleep mutex > > tcp r = 0 (0xc0bdf788) locked @ netinet/tcp_input.c:617 db> wh 15 > > Tracing pid 15 tid 18 td 0xf8001fb07480 > > sab_intr() at sab_intr+0x40 > > psycho_intr_stub() at psycho_intr_stub+0x8 > > intr_fast() at intr_fast+0x88 > > -- interrupt level=0xd pil=0 %o7=0xc01a0040 -- > > mb_free_ext() at mb_free_ext+0x28 > > sbdrop_locked() at sbdrop_locked+0x19c > > tcp_input() at tcp_input+0x2aa0 > > ip_input() at ip_input+0x964 > > netisr_processqueue() at netisr_processqueue+0x7c > > swi_net() at swi_net+0x120 > > ithread_loop() at ithread_loop+0x24c > > fork_exit() at fork_exit+0xd4 > > fork_trampoline() at fork_trampoline+0x8 > > db> > > > > That code is here in mb_free_ext(): > > > > /* > > * This is tricky. We need to make sure to decrement the > > * refcount in a safe way but to also clean up if we're the > > * last reference. This method seems to do it without race. > > */ > > while (dofree == 0) { > > cnt = *(m->m_ext.ref_cnt); > > if (atomic_cmpset_int(m->m_ext.ref_cnt, cnt, cnt - 1)) { > > if (cnt == 1) > > dofree = 1; > > break; > > } > > } > > Well, this is obtuse at least. A simpler version would be: > > do { > cnt = *m->m_ext.ref_cnt; > } while (atomic_cmpset_int(m->m_ext.ref_cnt, cnt, cnt - 1) == 0); > dofree = (cnt == 1); > > -- > John Baldwin <[EMAIL PROTECTED]> <>< http://www.FreeBSD.org/~jhb/ > "Power Users Use the Power to Serve" = http://www.FreeBSD.org Your suggestion will always enter the loop and do the atomic regardless of what dofree is set to above that code (not shown in Kris' paste): [...] /* Account for lazy ref count assign. */ if (m->m_ext.ref_cnt == NULL) dofree = 1; else dofree = 0; /* * This is tricky. We need to make sure to decrement the * refcount in a safe way but to also clean up if we're the * last reference. This method seems to do it without race. */ [...] The segment could still be reworked, but anyway: This does not appear to explain the livelock. What's m->m_ext.ref_cnt point to? And what is the value at the location pointed to by m->m_ext.ref_cnt? Regardless, though, the livelock itself, assuming it is due to a long time being spent spinning in the above loop, should not be caused by underruns or overruns of the reference count (those may only cause leaking of the cluster). Furthermore, the above code has been around in that form for some time now and in fact the loop was probably entered *more* often in the past (before the 'dofree' variable was introduced there). Since when are you able to cause the livelock to happen, and are you sure it is the mb_free_ext() that is looping indefinitely? I do not know sparc64 well, but what are the semantics of atomic_cmpset_int()? I see that it is defined to use the 'casa' instruction; does atomic_cmpset_int() behave the same way as it does on i386? -Bosko -- Bosko Milekic - If I were a number, I'd be irrational. Contact Info: http://bmilekic.unixdaemons.com/contact.txt ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Race condition in mb_free_ext()?
P.S.: Not sure if this is related: http://www.mail-archive.com/bk-commits-24@vger.kernel.org/msg00136.html On Tue, 1 Mar 2005 18:04:27 -0500, Bosko Milekic <[EMAIL PROTECTED]> wrote: > On Tue, 1 Mar 2005 13:40:18 -0500, John Baldwin <[EMAIL PROTECTED]> wrote: > > On Monday 28 February 2005 07:04 pm, Kris Kennaway wrote: > > > I'm seeing an easily-provoked livelock on quad-CPU sparc64 machines > > > running RELENG_5. It's hard to get a good trace because the processes > > > running on other CPUs cannot be traced from DDB, but I've been lucky a > > > few times: > > > > > > db> show alllocks > > > Process 15 (swi1: net) thread 0xf8001fb07480 (18) > > > exclusive sleep mutex so_snd r = 0 (0xf800178432a8) locked @ > > > netinet/tcp_input.c:2189 exclusive sleep mutex inp (tcpinp) r = 0 > > > (0xf800155c3b08) locked @ netinet/tcp_input.c:744 exclusive sleep > > > mutex > > > tcp r = 0 (0xc0bdf788) locked @ netinet/tcp_input.c:617 db> wh 15 > > > Tracing pid 15 tid 18 td 0xf8001fb07480 > > > sab_intr() at sab_intr+0x40 > > > psycho_intr_stub() at psycho_intr_stub+0x8 > > > intr_fast() at intr_fast+0x88 > > > -- interrupt level=0xd pil=0 %o7=0xc01a0040 -- > > > mb_free_ext() at mb_free_ext+0x28 > > > sbdrop_locked() at sbdrop_locked+0x19c > > > tcp_input() at tcp_input+0x2aa0 > > > ip_input() at ip_input+0x964 > > > netisr_processqueue() at netisr_processqueue+0x7c > > > swi_net() at swi_net+0x120 > > > ithread_loop() at ithread_loop+0x24c > > > fork_exit() at fork_exit+0xd4 > > > fork_trampoline() at fork_trampoline+0x8 > > > db> > > > > > > That code is here in mb_free_ext(): > > > > > > /* > > > * This is tricky. We need to make sure to decrement the > > > * refcount in a safe way but to also clean up if we're the > > > * last reference. This method seems to do it without race. > > > */ > > > while (dofree == 0) { > > > cnt = *(m->m_ext.ref_cnt); > > > if (atomic_cmpset_int(m->m_ext.ref_cnt, cnt, cnt - 1)) { > > > if (cnt == 1) > > > dofree = 1; > > > break; > > > } > > > } > > > > Well, this is obtuse at least. A simpler version would be: > > > > do { > > cnt = *m->m_ext.ref_cnt; > > } while (atomic_cmpset_int(m->m_ext.ref_cnt, cnt, cnt - 1) == 0); > > dofree = (cnt == 1); > > > > -- > > John Baldwin <[EMAIL PROTECTED]> <>< http://www.FreeBSD.org/~jhb/ > > "Power Users Use the Power to Serve" = http://www.FreeBSD.org > > Your suggestion will always enter the loop and do the atomic > regardless of what dofree is set to above that code (not shown in > Kris' paste): > > [...] > /* Account for lazy ref count assign. */ > if (m->m_ext.ref_cnt == NULL) > dofree = 1; > else > dofree = 0; > > /* > * This is tricky. We need to make sure to decrement the > * refcount in a safe way but to also clean up if we're the > * last reference. This method seems to do it without race. > */ > [...] > > The segment could still be reworked, but anyway: > > This does not appear to explain the livelock. What's m->m_ext.ref_cnt > point to? And what is the value at the location pointed to by > m->m_ext.ref_cnt? Regardless, though, the livelock itself, assuming it > is due to a long time being spent spinning in the above loop, should > not be caused by underruns or overruns of the reference count (those > may only cause leaking of the cluster). > > Furthermore, the above code has been around in that form for some time > now and in fact the loop was probably entered *more* often in the past > (before the 'dofree' variable was introduced there). Since when are > you able to cause the livelock to happen, and are you sure it is the > mb_free_ext() that is looping indefinitely? > > I do not know sparc64 well, but what are the semantics of > atomic_cmpset_int()? I see that it is defined to use the 'casa' > instruction; does atomic_cmpset_int() behave the same way as it does > on i386? > > -Bosko > > -- > Bosko Milekic - If I were a number, I'd be irrational. > Contact Info: http://bmilekic.unixdaemons.com/contact.txt > -- Bosko Milekic - If I were a number, I'd be irrational. Contact Info: http://bmilekic.unixdaemons.com/contact.txt ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Race condition in mb_free_ext()?
You know, the more and more we run into these problems specific to how reference counting is performed, I keep wondering whether some cleverly defined macros for dealing with common reference counting operations would be worthwhile. I'm not saying "introduce a refcount_t type and a full-blown abstraction," but some template-like stuff might be useful. It took a while just to get the refcount decrement on free path free of races on SMP, and that's only in the mbuf code. -Bosko On Tue, Mar 01, 2005 at 03:14:38PM -0800, Kris Kennaway wrote: > On Tue, Mar 01, 2005 at 06:04:27PM -0500, Bosko Milekic wrote: > > > This does not appear to explain the livelock. > > alc and dwhite tracked it down to a missing volatile causing gcc to > mis-optimize the loop: > > - cnt = *(m->m_ext.ref_cnt); > + cnt = *(volatile u_int *)(m->m_ext.ref_cnt); > > I'm currently testing that. > > Kris -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Race condition in mb_free_ext()?
On Mon, Feb 28, 2005 at 10:00:25PM -0800, Doug White wrote: > Forgive me for being naieve, but is there a reason you don't do an atomic > subtraction on the refcount? I can see why it repeats -- if two things > are warring over the refcount one or the other keep trying until one wins > -- but the subtraction would seem more intuitive. The subtraction is atomic and is part of the cmpset. If you were to only do a subtraction, you risk racing on figuring out what the counter value before the subtraction was and making sure that it stays consistent after the subtraction. That is the purpose of the cmpset. The idea is that only the LAST thread to decrement the counter down to exactly 1 frees the cluster. If you look at the CVS history for that routine and its various incarnations (you might need to look at kern/subr_mbuf.c in the attic, since mb_free_ext() used to be there, iirc), you will see various points in time where we had this wrong. > -- > Doug White| FreeBSD: The Power to Serve > [EMAIL PROTECTED] | www.FreeBSD.org -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: generic network protocols parser ?
On Fri, Mar 04, 2005 at 11:07:34AM -0500, Aziz KEZZOU wrote: > Hi all, > I am wondering if any one knows about a generic parser which takes a > packet (mbuf) of a certain protocol (e.g RSVP ) as input and generates > some data structre representing the packet ? > > I've been searching for a while and found that ethereal and tcpdump > for example use specific data structres and functions to dissect each > protocol packets. Is this the only approach possible ? > > My supervisor suggested using a TLV (Type/Length/Value) approach > instead. Any opinions about that? > > If no such a parser exists is there any practical reason why ? > > Thanks, > Aziz You can only go so far with generic parsing. Eventually you will want some protocol specific value to be extracted and your parser will have to know about what the packet looks like. What are you trying to do, exactly? -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Giant-free polling [PATCH]
On Fri, Mar 11, 2005 at 03:14:50PM +0100, Pawel Jakub Dawidek wrote: > On Fri, Mar 11, 2005 at 04:55:25PM +0300, dima wrote: > +> I thought about using list also, but considered it to bring > +> too much overhead to the code. The original idea of handling arrays > +> seems to be very elegant. > > Overhead? Did you run any benchmarks to prove it? > I find list-version much more elegant that using an array. You should however be aware that iterating over an array of references as opposed to a list in the polling path, if you only consider the cost of iteration itself, is faster on both SMP and UP. With the list approach, you will have to bring in at least three items into cache, at each iteration: (1) The list node; (2) The list node's member (the 'pr' reference data in your patch); (3) The device's mutex. With the array approach, you will remove (1). If the relative frequency of insertions and removals is low (which I think it is in the polling case), you ought to consider doing the array thing (considering the predictable scheduling nature of polling, you could use the cache advantages due to the use of the array). Also, you will still be able to implement dynamic removal/insertion from/to the polling array. The fact that it is an array might just require re-allocation, not a big deal. > I also don't like the idea of calling handler method with two locks > held (one sx and one mutex)... Yeah, this should be re-worked. However, the use of the sx lock is neat. Unfortunately, our sx lock implementation always resorts to using a regular mutex lock, even when doing a shared lock (read only) acquisition. Since the window of the underlying sx lock mutex is relatively short, in most cases the performance advantage of using the sx lock is noticable. However, if the window of the original mutex is also short (compared to the sx lock window), then I wonder if it is worth it. The sx lock is really only designed for optimizing long-reads, in our case. Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Abusing m_ext for a worthy goal.
Hi Alfred, On Mon, 11 Dec 2000, Alfred Perlstein wrote: > Ok, part of the mp cleanup got me looking at af_unix sockets. > > Here's the deal, descriptors can be called "in-flight" meaning they > have been passed from one application to another via > CMG_DATA but not picked up on the other side yet. > > The kernel does a bunch of hokey stuff to garbage collect them. > > Instead of doing that I'd like to abuse MEXTADD to setup a callback > to remove the references automagicaly when the mbuf it free'd. > > Here's the problem: > > I want to use the mbuf clusters, I don't want to use any special > external storage. I want my callback to free the cluster. > > Any hints on how to do this? > > here's what I have so far: > > MEXTADD(control, mtod(control, char *), control->m_len, unp_mbfree, > cm, M_RDONLY, EXT_CMSG_DATA); > > control is a mbuf with a cluster attached the has fd's written in it. > cm is a pointer to the mbuf cluster. Here, the general idea is correct, but you'll have to do things this way: #define EXT_CMSG_DATA 666 /* A hack, for now. */ An allocation: /* Here make sure you're not holding any other mutex. */ mtx_enter(&mclfree.m_mtx, MTX_DEF); _MCLALLOC(control->m_ext.ext_buf, M_WAIT); /* Or M_DONTWAIT */ mtx_exit(&mclfree.m_mtx, MTX_DEF); if (control->m_ext.ext_buf != NULL) { /* Assuming you want to M_RDONLY your data. Otherwise, just change M_RDONLY to 0. */ MEXTADD(control, control->m_ext.ext_buf, MCLBYTES, unp_mbfree, cm, M_RDONLY, EXT_CMSG_DATA); if (control->m_ext.ref_cnt == NULL) { _MCLFREE(control->m_ext.ext_buf); /* Allocation of ref. counter failed, so just fail gracefully instead of panic (this should never happen, but make sure to check. */ return (ENOBUFS); } } else return (ENOBUFS); > static void > unp_mbfree(caddr_t mm, void *claddr) > { > struct mbuf *m = (struct mbuf *)mm; > > unp_scan(m, unp_discard); > _MCLFREE(claddr); > } Actually, your free routine needs to accept the address of the cluster as a first argument, and the optional argument as a second argument. According to your MEXTADD() above, your optional argument will be something called "cm." In your case, what you need to make your optional argument be is the actual mbuf "control" (you can cast it to struct mbuf *). Then, you can call unp_scan() on that mbuf (second argument) and call _MCLFREE() on the first argument, which will effectively be the address of the cluster. > does this look right? would you like to abstract the mbuf > clusters further before i started using the _ macros? Abstracting mbuf clusters is deffinately something I'm looking at doing (as you've probably seen me mention on some posts to some of the lists), but for now, the above should do. I'd rather not rush into abstracting the EXT_CLUSTER type further for now "just for the sake of abstracting it." I want to make sure that we don't lose any performance whatsoever for what concerns clusters (i.e. I'm trying to decide on whether the cost of calling a function for clusters as well is worth it at this point). Briefly, the day will come. :-) > -- > -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] > "I have the heart of a child; I keep it in a jar on my desk." Cheers, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Abusing m_ext for a worthy goal.
Something I forgot in first Email... On Mon, 11 Dec 2000, Bosko Milekic wrote: [...] > Actually, your free routine needs to accept the address of the > cluster as a first argument, and the optional argument as a second > argument. According to your MEXTADD() above, your optional argument will > be something called "cm." In your case, what you need to make your > optional argument be is the actual mbuf "control" (you can cast it to > struct mbuf *). Then, you can call unp_scan() on that mbuf (second > argument) and call _MCLFREE() on the first argument, which will > effectively be the address of the cluster. Don't forget to also, after running _MCLFREE() on the ext_buf, to NULL out the control mbuf's ext_buf, and to remove the M_EXT bit from the mbuf if you are planning to re-use it. Basically, make sure to clean up after yourself after you forcibly free the cluster "manually." [...] Later, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Abusing m_ext for a worthy goal.
On Mon, 11 Dec 2000, Alfred Perlstein wrote: > Hmm, I think instead of doing this sort of abuse, what I should be > doing is allocating an mbuf header, then allocating a mbuf cluster > then attaching them using the MEXTADD() macro. Alfred, I'm surprised you're pointing this out when in my first post, the code example does exactly this, minus the mbuf allocation. :-) > I'm going to look at the code to see if there's a clean way to do this, > if not can you provide an interface for allocating and free'ing > clusters by themselves? > > thanks, > -- > -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] > "I have the heart of a child; I keep it in a jar on my desk." Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: MEXT_IS_REF broken.
On Tue, 12 Dec 2000, Alfred Perlstein wrote: [...] > And since Chuck backs me up on this, can we consider this discussion > over as soon as John commits his code? I think this is a valid point. This is not the first time the need for proper refcount "interface" arises, or at least the discussion of the need for it, and I would say that it is better to do this now than later. Clearly, the ref. count code in mbuf.h is suffering from a pretty nasty race. Even if we use the particularities of the mbuf cluster referencing system to fix it one way now, we're bound to encounter similar problems sooner or later. > thanks, > -- > -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] > "I have the heart of a child; I keep it in a jar on my desk." Cheers, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: need mbuf generic free() hook
Alfred, You've brought this up before. This looks familiar to your earlier post(s) regarding how to "abuse m_ext." Mind you, I may be mistaken, but that would be because this isn't being made awfully clear. Earlier you talked about forcing a custom free routine to be called even in the case of mbuf clusters, and we devised a way. Are you suggesting now to have a custom free routine called when your mbuf is freed (i.e. even when it is a non-M_EXT mbuf?) If that's so, I would urge you to re-examine the code because doing something like that would be generally wrong. I say "generally" because I'm always open to exceptions, as you know. So please clarify. On Tue, 12 Dec 2000, Alfred Perlstein wrote: > Ok, so I'm trying to rid FreeBSD of the unp_gc() junk in uipc_usrreq.c > I just realized why I'm leaking file descriptors: > > Only mbufs with external storage can have a 'free' function, the > problem is that for the most part when passing a single fd I'm only > using a single mbuf to pass my data around, my ext_free routine is > probably never called, and on top of that I'm probably corrupting > my mbuf by using it when I shouldn't be. > > This is terrible, what should be a simple free() hook is now my > worst nightmare. A couple of options come to mind: > > a) always have a free hook, the extra if (m->m_freehook != NULL) > check can't be that expensive. > b) a macro to make an mbuf into an mbuf cluster that points to itself, > it should check to see if it has enough room to shuffle the data > and return an error if it can't accomidate both the data and the > expanded mbug header. > > I like 'a' but would be willing to deal with 'b' as we've never > had a need for this, although that would mean copying about the > size of an mbuf in the worst case when a subsystem needs this > facility. > > Suggestions? Comments? > > -- > -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] > "I have the heart of a child; I keep it in a jar on my desk." > > Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Ratelimint Enhancement patch (Please Review One Last Time!)
Hi, A while ago (it's been at least two weeks now), Mike Silbersack requested a review for: http://www.silby.com/patches/ratelimit-enhancement-2.patch To quote the description on his web page, this diff will: * ICMP ECHO and TSTAMP replies are now rate-limited. * RSTs generated due to packets sent to open and unopen ports are now seperated into separate queues. * Each rate limiting queue now has its own description, as follows: Suppressing udp flood/scan: 212/200 pps Suppressing outgoing RST due to port scan: 202/200 pps Suppressing outgoing RST due to ACK flood: 19725/200 pps Suppressing ping flood: 230/200 pps Suppressing icmp tstamp flood: 210/200 pps While the descriptions for the two RST cases can be accused of oversimplification, they should cut down on questions by users confused with the current terminology. Experienced users can always run a packet sniffer if they need more exact knowledge of what's occuring. The diff was initially reviewed by me and green, and the recommended changes were mainly stylistic. I want to commit this code, but I'm posting it up here in case someone has any final objections or review. Thanks, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Ratelimint Enhancement patch (Please Review One Last Time!)
On Wed, 13 Dec 2000, Bill Fumerola wrote: > On Wed, Dec 13, 2000 at 02:42:53PM -0500, Richard A. Steenbergen wrote: > > > It could just as easily be a SYN flood against a single port... or a large > > number of clients trying to connected to your crashed web server... :P Or > > it could just as easily be an ack flood against a port without a listener > > and be showing up in the "not the ack flood" counter. > > Exactly. Bikeshedding the millions of possible reasons the queue/ratelimit > was triggered is silly. > > Bosko, please change the descriptions to something very generic before > committing them ("ratelimiting TCP RST packets: x/y pps" or something) Mike said he would do it and re-post the diff. > -- > Bill Fumerola - security yahoo / Yahoo! inc. > - [EMAIL PROTECTED] / [EMAIL PROTECTED] Later, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Updated ratelimit patch
Looks good to me, I'll commit this within the next 24 hours unless someone complains with reason. Thanks Mike. On Thu, 14 Dec 2000, Mike Silbersack wrote: > > Ok, I've redone the messages in the ratelimiting patch. They are now as > follows: > > Limiting icmp unreach response from 439 to 200 packets per second > Limiting closed port RST response from 283 to 200 packets per second > Limiting open port RST response from 18724 to 200 packets per second > Limiting icmp ping response from 211 to 200 packets per second > Limiting icmp tstamp response from 394 to 200 packets per second > > No other changes have been made, and the updated patch is available at: > http://www.silby.com/patches/ratelimit-enhancement-3.patch > > Mike "Silby" Silbersack Regards, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Changing the names of some M_flags
Hello, Recently, there was a bikeshed on one of the lists dealing with whether or not to rename M_WAIT and M_DONTWAIT flags to something else that would "communicate more of the Right Thing" to developers considering that for mbuf allocations, M_WAIT may return a NULL pointer in the case where all mbuf resources are depleted and mbuf_wait time has already been spent waiting. The proposed flag names were/are: M_WAIT --> M_TRY_WAIT M_DONTWAIT --> M_DONTBLOCK I have a diff sitting here: http://people.freebsd.org/~bmilekic/m_flag_rnm.diff It's roughly 160K. On one side, personally, I wasn't so pro-the change, as I see it more as a bloat to the repo than anything else. But, I have to admit that some folks have expressed a significant interest in having them changed and with reason, so I agreed to do it. On this same side, it wasn't such a bad thing to do after all as it led me to spot some not so obfuscated and obfuscated (and potentially serious in the future) bugs which I have yet to fix. A good example is people mixing up M_WAIT and M_WAITOK in calls to malloc() which could potentially lead to disaster if we were to alter the values of the flags. On the other side, I personally do not passionately feel the change "needs" to be made. But, as mentionned, if the majority of developers think that it will avoid needless miscommunication, then I have no problem committing it. I don't want to start a bigger bikeshed from this post than it already is. I'd just like to know if anyone has any reasonable objections to having this change happen. Regards, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Changing the names of some M_flags
On Sat, 16 Dec 2000, Alfred Perlstein wrote: > I think M_DONTWAIT is fine as it was, and M_TRYWAIT instead of M_TRY_WAIT. > > Leaving it as M_DONTWAIT should reduce the delta by quite a bit and > M_TRYWAIT vs M_TRY_WAIT because you have M_DONTWAIT/M_DONTBLOCK. > > -Alfred I agree. Anyone else before I re-roll? :-) Cheers, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Changing the names of some M_flags
On Sat, 16 Dec 2000, Alfred Perlstein wrote: > > Is this just going to make portablity between the various *BSD kernels > > more difficult for what's essentially a cosmetic change? I'm thinking > > of things like KAME, ALTQ, etc. > > I agree, however this argument keeps coming up: > "I thought M_WAIT meant it would wait forever!" > > Personally, I think developers should do a bit more research > and should have noticed all the places where M_WAIT was followed > by a check for NULL and be able to bridge the gap. > > So honestly, I'm against the change, but if it has to be done > then I'd like to see the M_DONTWAIT and M_TRYWAIT. > > -- > -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] > "I have the heart of a child; I keep it in a jar on my desk." This is _EXACTLY_ my opinion and standpoint on this issue. In terms of portability, though, let's be realistic: the name of a flag isn't going to be a major concern. When you port code, you typically have to read/understand what's going on anyway. So, if NetBSD for example, uses malloc() to allocate mbufs and passes the WAIT/NOWAIT flag directly down to malloc(), then it may as well never return if malloc() can't find a page. When porting to FreeBSD, the developer needs to know that that is not the case here and must make appropriate changes anyway. In fact, changing the flag may HELP the developer doing the porting into not making the mistake of assuming that it will be the same. Regards, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
M_flag rename: M_WAIT to M_TRYWAIT
Ok, the new version of the diff is here, and it's a mere 35K (compared to the last one which was 162k): http://people.freebsd.org/~bmilekic/m_flag_rnm.diff Big bloat reduction compared to last time. It renames M_WAIT to M_TRYWAIT, M_DONTWAIT stays M_DONTWAIT, and also fixes a few rather lame mistakes, such as M_WAIT being passed to malloc() (as opposed to M_WAITOK) which may prevent future problems if we were to redefine the values of the malloc/mbuf alloc flags. M_WAIT is maintained and defined as M_TRYWAIT for now, but its use is deprecated. Anyone have final comments before this is committed? Then I can go on and change the man page to reflect it as well and send a small heads-up to -current. I have three small PostIt[tm]'s worth of "to do" fixes following this change which involve developers previously assuming that M_WAIT will never return NULL. Going in and making sure these guys check for the possibility of failure will be done once this goes in (I'm doing it in two parts to avoid merging headaches while working). Please note that this is planned only for -CURRENT at this time. Regards, Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Intel ethernet driver source code
Hi, Please show us how you're manipulating the mbufs. You should not be getting page faults unless you're doing something wrong. A good idea would also be to check the fault virtual address and enable debugging support in your kernel in order to be able to analyze the situation as much as you can. Regards, Bosko. - Original Message - From: "Marcus Vinícius Midena Ramos" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, December 28, 2000 8:58 AM Subject: Intel ethernet driver source code > Hello, > > I am developing an application to measure the performance of an IP > communications network, accounting for packet loss rate and transmission > delay values between any two sites on a net. This work resembles that of > www.advanced.org and is also based on IETF's IPPM initiative. > > One of the goals is to insert a GPS timestamp on the packets to be > transmitted as late as possible before they are put on the wire. This > means I cannot do this job on the application level. Instead, this has > to be done on the ethernet driver level in order to achieve the maximum > accuracy, specially when the measurement machine is under heavy load. > > I've studied the "if_fxp.c" (Intel ethernet driver) source code plus > Stevens' TCP Illustrated vol. 2, which helped me get a good > understanding of the data structures manipulated by this code. My aim > now is to insert the GPS timestamps on the packets to be transmitted > right before they go to the interface. > > To do this, I am trying to modify the driver source code to achieve the > following objectives: (1) read the contents of the mbufs in order to > identify the ones that hold the packets to be transmitted, and (2) > insert the timestamps on those ellegible for so. > > However, any access to mbufs from inside the driver causes an error > message "page fault while in kernel mode" to be displayed the new kernel > is booted. It then enters a loop, trying to reboot forever. > > I guess the mbufs are in an area somehow protected by the kernel, which > prevents even the driver to have access to it. May question is: how can > I go around this problem and be able to read and change the contents of > the mbufs from inside the driver ? > > Thank you very much. > Marcus. > > > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-net" in the body of the message > To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: (forw) Two NICs In FreeBSD
I would go with (a) unless someone is religiously obsessed with having the message stay. Else go with (b) but see if you can fit it in with some "generic verbosity" sysctl knob, as opposed to creating its own. Later, Bosko. > /kernel: arp: 1.2.3.4 is on dc0 but got reply from 00:00:c5:79:d0:0c on dc1 > > Hi can we axe this message already? I see the possibility of using > two 100mbit cards on a switch with a gig uplink to be reason enough > to either axe it, or make it a sysctl to quiet the warnings. > > I've also had this broken config setup as a temporary thing and > my console getting flooded was also quite irritating. > > Anyone going to raise a rukus if I turn it > a) off > b) sysctl default off > c) sysctl default on > > thanks, > -Alfred > > - Forwarded message from Tim Gustafson <[EMAIL PROTECTED]> - > > From: Tim Gustafson <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Subject: Two NICs In FreeBSD > Date: Wed, 03 Jan 2001 23:37:06 -0500 > Message-Id: <[EMAIL PROTECTED]> > X-Mailer: QUALCOMM Windows Eudora Version 5.0 > Sender: [EMAIL PROTECTED] > > Hello > > I just installed two NIC cards into my FreeBSD machine that are on the same > LAN, each with a different IP. However, I keep getting this sort of > message in my syslog: > > /kernel: arp: 1.2.3.4 is on dc0 but got reply from 00:00:c5:79:d0:0c on dc1 > > I realize that the message is because it's getting two ARP replies, but > this is normal (for this network, anyhow). How can I suppress these > messages from appearing in the syslog? Is it OK to have both interfaces > receiving ARP replies? Will this mess anything up? > > Tim > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >Tim Gustafson [EMAIL PROTECTED] >www.falconsoft.com (631)475-6662 > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >Share your knowledge - it's a way to achieve immortality. > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > > - End forwarded message - > > -- > -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] > "I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
lock order violation? (Was: Fw: [patch] quick review)
Hello, As Alfred suggests below, there may be a potential problem with the mbuf wait code calling m_reclaim() [the drain routines] with the mmbfree (mbuf freelist mutex) lock held. Presently, to avoid a race condition, we call m_reclaim() with the lock held and recurse into it from the drain routines which call m_free{m}() to free mbufs. This was initially written this way; it was partially adopted from BSD/OS (it's done the same way). This way, the mbuf wait code is sure to be the first one to get the chance to allocate after the drain routines run. Unfortunately, as mentionned, there appears to be a potential lock order problem. Using as an example the BSD/OS code, something like this happens: - In ip_slowtimo(), IPQ_LOCK is obtained and with the lock held, ip_freef() is called. ip_freef() calls m_freem() which grabs the MBUF_LOCK. The lock order here is IPQ -> MBUF. - In ip_drain(), which is called from m_reclaim() with the MBUF lock already held, IPQ lock is acquired. Then ip_freef() is called which does m_freem() [which grabs MBUF lock]. The lock order here, since ip_drain() is called with MBUF already held is MBUF -> IPQ. Are we correct to assume that there is a potential for deadlock here? Clearly, if one thread in ip_slowtimo() has IPQ and another thread presently draining has MBUF, there is deadlock. The proposed patch (for FreeBSD) is at the address immediately below. For now, what it does is avoid the lock recursion by calling m_reclaim() without the mbuf lock held. We deal with the race condition. In the future, if we can, we may want to weed out the drain routines and make sure that there are no lock ordering problems; but for now, until we're done with all the locking, I think it's better to do it this way. Thanks, Bosko. P.S.: Please feel free to snip -arch from the CC list, if judged appropriate. Alfred Perlstein wrote: > * Bosko Milekic <[EMAIL PROTECTED]> [010108 19:13] wrote: > > Hi Alfred, > > > > Can you please review this: > > > > http://people.freebsd.org/~bmilekic/code/no_rec_mbsleep.patch > > > > All it does is in m_mballoc_wait() [wait routine for mbuf allocation] > > drop the mbuf freelist lock before calling the drain routines. It avoids > > recursing into the lock from within the drain routines. We discussed this > > some time ago and you brought up the fact that it should probably be changed; > > although I don't recall exactly why (besides for generally just getting rid > > of the lock recursion and relatively large lock contention)... I recall you > > having an even more important reason, can you remind me, if applicable? > > It's a lock ordering problem, you _can_ unwind the lock in the protocol > drain routines before locking down a socketbuffer or whatnot, but the > fact of the matter is if you have to lock _anything_ you pretty much > have to unwind the lock or have a reversal problem as the mbuf system > should be a leaf-lock. > > You're also blocking the natural free'ing of the mbufs by other code > running. > > So the comment is sorta lacking, it's to avoid: > 1) lock recursion (as you stated) > 2) lock reversal > 3) needing to to explicitly unwind the lock to avoid #2 > > At least I think. :) > > It looks like BSD/os holds the lock during the protocol drain routines, > but I think that's a mistake. > > Would you mind reposting this to -net/-smp, I think I'm right about > this, but some peer review couldn't hurt, I wasn't sure if reposting > was ok considering this came across in private. > > -- > -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] > "I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: no buffer space available
Does this happen regularly? Check `netstat -m' on the machine (you'll probably have to do it from the console) when this is happening. If it happens regularly, does anything specific happen that "helps reproduce it?" What version of FreeBSD are you running? -Bosko G. Jason Middleton wrote: > I have a problem. > I can telnet into my bsd box from the net and after a while it will just not > let me in. It just drops right off the face of the earth. So when i > actually try to ping something from the box after this happens i get an > error as follows: > > ping: sendto: no buffer space avaialble. > > > when i reboot everything works like it should. > > Got any ideas? > > Regards, > > G. Jason Middleton > University of Maryland Baltimore County To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: no buffer space available (outcome of netstat -m)
What device are you using? What is the interface you ifconfig'd down and back up? Jason, can you also contribute? Boris, if you're reading this, if you see that both these gentlemen have the same interface, perhaps we have a winner? I've been unable to find the problem as of yet (frankly, I've had little time to look too deeply). Regards, Bosko. Wesley Morgan wrote: > When this happened to me once before, I ifconfig'd the interface down and > back up... Viola, working interface. Never figured out what caused it > though :) > > On Sun, 28 Jan 2001, G. Jason Middleton wrote: > > > The original problem was an still is > ping: sendto: no buffer space > > avaialble. > > > > I was able to reproduce the error today while telneting in to the box. I > > was ftping through telnet and i did an "ls" and it got halfway through a > > directory listing and crapped out on my. the box was no where to be found > > on the network/internet. > > > > OK i did the netstat -m and got he following output > > > > 298/352/4096 mbufs in use (current/peak/max) > > 170 mbufs allocated to data > > 128 mbufs allocated to packets headers > > 41/160/1024 mbuf clusters in use (current/peak/max) > > 408 kbytes allocated to network (13% of mb-map in use) > > 0 requests for mem denied > > 0 requests for mem delayed > > 0 calls to protocol drain routines > > > > > > what next? > > > > Jason > > > > > > Does this happen regularly? Check `netstat -m' on the machine (you'll > > probably have to do it from the console) when this is happening. > > If it happens regularly, does anything specific happen that "helps > > reproduce it?" > > What version of FreeBSD are you running? > > > > -Bosko > > > > G. Jason Middleton wrote: > > > > > I have a problem. > > > I can telnet into my bsd box from the net and after a while it will > > just not > > > let me in. It just drops right off the face of the earth. So when i > > > actually try to ping something from the box after this happens i get > > an > > > error as follows: > > > > > > ping: sendto: no buffer space avaialble. > > > > > > > > > when i reboot everything works like it should. > > > > > > Got any ideas? > > > > > > Regards, > > > > > > G. Jason Middleton > > > University of Maryland Baltimore County > > > > > > > > To Unsubscribe: send mail to [EMAIL PROTECTED] > > with "unsubscribe freebsd-net" in the body of the message > > > > -- >_ __ ___ ___ ___ ___ > Wesley N Morgan _ __ ___ | _ ) __| \ > [EMAIL PROTECTED] _ __ | _ \._ \ |) | > FreeBSD: The Power To Serve _ |___/___/___/ > 6bone: 3ffe:1ce3:7::b4ff:fe53:c297 > Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread! > > > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-net" in the body of the message > To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Sequential mbuf read/write extensions
Boris Popov wrote: [...] > Since currently there isn't many consumers of this code I can > suggest to define an option LIBMBUF in the kernel configuration file and > add KLD libmbuf (with interface libmbuf), so kernel footprint will not be I am in favor of such an option on the condition that it is temporary. In other words, only until we decide "we have converted enough code to use this code so we should remove the option now." The reason is that otherwise, we will be faced with numerous "#ifdef LIBMBUF ... #else ... #endif" code. I assume this is what you meant, anyway, so I have no objections. :-) The API looks great by the way, and I will try to give a more detailed review in the next few days. :-) For now: #define M_TRYWAIT M_WAIT is not right. (M_WAIT is no longer to be used in the mbuf code.) The succesfull return values are 0, I don't have a problem with this, specifically, but I would assume that this: if (!mb_init(mbp)) ... would be more "logical" (I use the term loosely) if it meant: "if initialization fails" (now it means "if initialization is succesful"). > significantly affected. The names of source and header files are > questionable too and I would appreciate good suggestions (currently they > are subr_mbuf.c and subr_mbuf.h). Hmmm. Maybe subr_mblib.c and libmb.h ? I don't want to turn this into a bikeshed ( :-) ), so I suggest that you decide. Personally, I would prefer that it be something other than "subr_mbuf.c" simply because it may be a little misleading in some cases. > Well, and finally here you will find full source code of proposed > API: http://www.butya.kz/~bp/mbuf/ > > Any comments and suggestions are greatly appreciated. > > -- > Boris Popov > http://www.butya.kz/~bp/ Boris, this is really a great interface and nice looking, clean code. Thank you! Regards, Bosko. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Fw: if_ed.c && BRIDGE
Hi Luigi, -net: There seems to be a problem with the BRIDGE-specific "optimization" in if_ed.c. What it does is avoid getting a packet from the card if bridge_in() says that we're going to drop it. However, the code is broken and something eventually leads to a page fault. Disactivating this portion of if_ed.c rids us of the problem, but doesn't "fix it" properly. Rich has some debugging information including, I believe, a crash dump. Rich, if you can please post that here (i.e. the backtraces and fault information) as I seemed to have lost it (gr, sorry about that). I would rather not just disable the BRIDGE section of if_ed.c to "mask out" the problem. Regards, Bosko. Rich Wales wrote: > Bosko -- > > That "if_ed.c" patch you had me try (commenting out the packet-dropping > optimization) seems to be quite solid. If you wanted to go ahead and > commit it into -STABLE . . . . > > Rich Wales [EMAIL PROTECTED] http://www.webcom.com/richw/ > To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Sequential mbuf read/write extensions
Boris Popov wrote: [...] > Not exactly so. 'option LIBMBUF' will just connect the source file > to kernel makefile. There is no need for any #ifdef's in the code. Right. But I assume LIBMBUF will absolutely be needed if code that uses the routines is compiled. What I just meant to say was: "when the code using these routines grows to be significant enough, then we can just remove the option." > > #define M_TRYWAIT M_WAIT is not right. > > (M_WAIT is no longer to be used in the mbuf code.) > > You omitted the surrounding "#ifndef M_TRYWAIT" which makes this > code portable to RELENG_4 (mind you, this code taken from smbfs). Of > course, this should be stripped before import. I did, you're right. I guess I saw the "ifndef" wrong... I read this with only -CURRENT in mind and was afraid that the mbuf code flags would start mixing in with the malloc code flags -- something I tried to fight off in the past while. > > The succesfull return values are 0, I don't have a problem with this, > > specifically, but I would assume that this: > > if (!mb_init(mbp)) ... would be more "logical" (I use the term > > loosely) if it meant: "if initialization fails" (now it means "if > > initialization is succesful"). > > I'm generally don't like such syntax if function or variable name > do not clearly specify which value it should have/return on success. > Nearly all functions in this file return zero or error code, so the > correct syntax of the above will be: > > error = mb_init(mbp); > if (!error) > > or > > if (error) > return error; > > or > > if (mb_init(mbp) != 0) > return ESOMETHINGEVIL; OK. > > > significantly affected. The names of source and header files are > > > questionable too and I would appreciate good suggestions (currently > > they > > > are subr_mbuf.c and subr_mbuf.h). > > > > Hmmm. Maybe subr_mblib.c and libmb.h ? I don't want to turn this > > into a bikeshed ( :-) ), so I suggest that you decide. Personally, I > > would prefer that it be something other than "subr_mbuf.c" simply > > because it may be a little misleading in some cases. > > Good point. > > > Boris, this is really a great interface and nice looking, clean code. > > I'm sure, this code can be significantly improved by mbuf gurus :) > > -- > Boris Popov > http://www.butya.kz/~bp/ Ok, I have a few things to add (although I'm sure you'll be more into reading Ian Dowse's comments) :-) in mb_append_record(), you walk all the "record" mbufs to get to the last "record." How good would be the tradeoff? i.e. keeping a pointer to the last pkt in the mbdata structure's mbuf chain? We would grow the structure by a pointer, and we may have to maintain the last record pointer; but isn't the only place where we would have to "maintain it" in mb_append_record() anyway? in mb_init(), the m->m_pkthdr.rcvif = NULL; can be ommitted, as MGETHDR() will do that. The m->m_len = 0 should stay for now. m_getm() looks like it should belong in uipc_mbuf.c -- it looks quite a bit like the "all or nothing" allocation routine I have sitting here. The difference is that mine doesn't take size as an argument, but rather the actual count of mbufs and all it does is allocate `count' mbufs and attach a cluster to each one of them. If it can't allocate a cluster or an mbuf at any point, it frees everything and returns. Now that I think about it, I'd much rather have `size' passed in instead, even though some callers may not know the `size' (think drivers that pre-allocate mbufs + clusters, they typically know the `count'), it turns out that it is cheaper to compute the count from the size than the optimal size from the count, in the mbuf case. If you don't mind, I would strongly recommend moving m_getm() to uipc_mbuf.c. Code that doesn't know the `size' but knows the `count' (like some driver code) can do; m = m_get(M_TRYWAIT, MT_DATA); if (m == NULL) { /* we can't even allocate one mbuf, we're really low, so don't even bother calling m_getm(). The other option would be to have m_getm() not require us to pre-allocate an mbuf at all and do all the work, but then that may interfere with code like yours which needs to pass in an existing mbuf that has already been allocated. */ m_free(m); /* fail right here */ } else { size = count * (MLEN + MCLBYTES); if (m_getm(m, size) == NULL) { /* everything has been properly freed for us, we don't have to worry about leaking mbufs. */ /* fail right here. */ } } For this to work, though, m_getm() needs to be modified to free all of `top' chain if it can't get either a cluster or an mbuf. I don't know if this was intentional, but it seems to me that there is a subtle problem in m_getm() as it is now: if (len > MINCLSIZE) { MCLGET(m, M_TRYWAIT); if ((m->m_flags & M_EXT) == 0) { m_freem(m); <-- frees only one mbuf return NULL; } } I think what may ha
Re: Sequential mbuf read/write extensions
Boris Popov wrote: [...] > > For this to work, though, m_getm() needs to be modified to free all of > > `top' chain if it can't get either a cluster or an mbuf. I don't know > > if this was intentional, but it seems to me that there is a subtle > > problem in m_getm() as it is now: > > > > if (len > MINCLSIZE) { > > MCLGET(m, M_TRYWAIT); > > if ((m->m_flags & M_EXT) == 0) { > >m_freem(m); <-- frees only one mbuf > ^^ cluster is not in the chain yet, so it have to be > freed. m_free() may be more appropriate than m_freem() then, but see below. > >return NULL; > > } > > } > > > > I think what may happen here is that you will leak your `top' chain if > > you fail to allocate a cluster. > > The original semantic was not to free an entire chain because > m_getm() do not reallocates original (top) mbuf(s) (which may contain > data) and only adds new mbufs/clusters if possible. So, the calls like > m_get(mb->mb_top) will not left the wild pointer. There is also simple way > to deal with such behavior: > > mtop = m_get(...); > if (mtop == NULL) > fail; > if (m_getm(mtop) == NULL) { > m_freem(mtop); > fail; > } > > Probably m_getm() should return error code rather than pointer to > mbuf to avoid confusion. I understand this part, but what I think you missed in my comment is that m_getm() should probably free what it already allocated before finally failing. It may not need to free `top' because of the wild pointer, as you say. But think of this: m_getm() is called with a larger `size' - it decides that given the `size' it will need to allocate a total of exactly 6 mbufs and 6 clusters for each mbuf. It loops and allocates, succesfully, 5 of those mbufs and 5 clusters. So `top' chain has now grown and includes those mbufs. Then what happens in the last iteration is that it allocates the 6th mbuf OK (it has not yet placed it on the chain) and fails to allocate a cluster, so it frees just that one mbuf (and not the mbufs it allocated in prior iterations and attached to `top' chain) and returns NULL. Your code that calls m_getm() then just fails, leaving `top' with what it could allocate. Note that in my mail I said "assuming this is a leak," thus recognizing the possibility that you did this intentionally. :-) Right now, I'll assume that this _was_ intentional, as that is what I understand from the above. But in any case, if we do move this to uipc_mbuf.c, we need to do one of the following: (a) make m_getm() free what it allocated in previous loop iterations before it failed (as described above) or (b) leave m_getm() the way it is BUT write an additional function that will simply wrap the call to m_getm() and flush properly for it if it fails (EXACTLY like your code snippet above). I'll gladly settle for either, but if we do go with (b), then the m_freem() should be changed to an m_free(), as it reflects the fact that we are only freeing the one mbuf and we should document this behavior, certainly. If you want, I'll roll up a diff in a few days (once I get what is presently dragging in my "commit this" queue out) and commit it. If you prefer to do this yourself, then feel free. :-) > -- > Boris Popov > http://www.butya.kz/~bp/ Regards, Bosko. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: buffer problems with ep
Dan Debertin wrote: > I have a busy 4.1.1-RELEASE nfs server that stops responding at its ep > interface every hour or so. Pings to hosts on that network respond with > "sendto: no buffer space available". I've recompiled with MAXUSERS up at > 128 and NMBCLUSTERS at 32768, both of which seem plenty high. The problem > goes away, to return an hour later, after a quick 'ifconfig ep0 down; > ifconfig ep0 up'. > > Other people seem to have run into this problem as well, but > there is no sign of a fix in the archives anywhere. Can you try placing a "#define EP_LOCAL_STATS" at the _top_ of if_epvar.h and rebuilding+reinstalling the ep driver? See if you get anything printed to the console from the ep driver. If you get something that begins with "Status:" then look under it for "NOMB=" and take note of it. > Here's netstat -m when it's working: > > 194/320/131072 mbufs in use (current/peak/max): > 179 mbufs allocated to data > 1 mbufs allocated to packet headers > 14 mbufs allocated to fragment reassembly queue headers > 176/254/32768 mbuf clusters in use (current/peak/max) > 588 Kbytes allocated to network (68% in use) > 0 requests for memory denied > 0 requests for memory delayed > 0 calls to protocol drain routines > > and when it isn't: > 172/320/131072 mbufs in use (current/peak/max): > 134 mbufs allocated to data > 38 mbufs allocated to packet headers > 52/254/32768 mbuf clusters in use (current/peak/max) > 588 Kbytes allocated to network (25% in use) > 0 requests for memory denied > 0 requests for memory delayed > 0 calls to protocol drain routines > > Thanks for any help you an render, > > Dan Debertin > -- > ++ Unix is the worst operating system, except for all others. > > ++ Dan Debertin > ++ Senior Systems Administrator > ++ Bitstream Underground, LLC > ++ [EMAIL PROTECTED] > ++ (612)321-9290 x108 > ++ GPG Fingerprint: 0BC5 F4D6 649F D0C8 D1A7 CAE4 BEF4 0A5C 300D 2387 Regards, Bosko. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: buffer problems with ep
What type of card is this, exactly? (i.e. name, model number, etc.) There is some trickery in the NetBSD driver that involves "Vortex-based (3c59x pci, eisa) and Boomerang (3c900)" which allow FDDI-sized 4500 byte packets. According to the NetBSD driver's comments, these cards commands dealing with sizes (such as setting threshold, for example) upshift the size value by 2 bits in order to accomodate the full-size packet length (because the 11bits usually provided to store the command argument isn't enough to span the full length). If this may indeed be a problem, we can try right (down) shifting all the ORed in length values by 2 bits, thus expecting the card to interpret the lengths as whatever the value we or'ed in is << 2. If you have one of these cards, let me know and I'll send you a diff off-list. Regards, Bosko. Dan Debertin wrote: > Okay, one more clue. When it's working, the interface flags are as > follows: > > flags=8843 mtu 1500 > > And when it stops, the OACTIVE flag is added: > > flags=8c43 mtu 1500 > > Dan Debertin > -- > ++ Unix is the worst operating system, except for all others. > > ++ Dan Debertin > ++ Senior Systems Administrator > ++ Bitstream Underground, LLC > ++ [EMAIL PROTECTED] > ++ (612)321-9290 x108 > ++ GPG Fingerprint: 0BC5 F4D6 649F D0C8 D1A7 CAE4 BEF4 0A5C 300D 2387 To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Kernel crush due to frag attack
Adrian Penisoara wrote: > Hi, > > As we are facing a heavy fragments attack (40-60byte packets in a > ~ 1000 pkts/sec flow) I see some sporadic panics. Kernel/world is > 4.2-STABLE as of 18 Jan 2001 -- it's a production machine and I hadn't yet > the chance for another update; if it's been fixed in the mean time I would > be glad to hear it... > > I have attached a gdb trace and a snip of a tcpdump log. When I rebuilt > the kernel with debug options it seemed to crush less often. I remember > that at the time of this panic I had an ipfw rule to deny IP fragments. This is one of those "odd" faults I've seen in -STABLE sometimes. Thanks to good debugging information you've provided, to be noted: #16 0xc014de98 in m_copym (m=0xc07e7c00, off0=0, len=40, wait=1) at ../../kern/uipc_mbuf.c:621 621 n->m_pkthdr.len -= off0; (kgdb) list 616 if (n == 0) 617 goto nospace; 618 if (copyhdr) { 619 M_COPY_PKTHDR(n, m); 620 if (len == M_COPYALL) 621 n->m_pkthdr.len -= off0;<-- fault happens here (XXX) 622else 623 n->m_pkthdr.len = len; 624 copyhdr = 0; 625} (kgdb) print n $1 = (struct mbuf *) 0x661c20 (kgdb) print *n cannot read proc at 0 (kgdb) print m $2 = (struct mbuf *) 0xc07e7c00 Where the fault happens (XXX), the possible problem is that the mbuf pointer n is bad, and as printed from the debugger, it does appear to be bad. However, there are two things to note: 1. the fault virtual address displayed in the trap message: Fatal trap 12: page fault while in kernel mode fault virtual address = 0x89c0c800 [...] is different from the one printed in your analysis (even though 0x89c0c800 seems bogus as well, although it is at a correct boundry). 2. Nothing bad happens in M_COPY_PKTHDR() which dereferences an equivalent pointer. Something seriously evil is happening here and, unfortunately, I have no idea what. Does this only happen on this one machine? Or is it reproducable on several different machines? I used to stress test -STABLE for mbuf starvation and never stumbled upon one of these `spontaneous pointer deaths' myself. Although I have seen other weird problems reported by other people, but only in RELENG_3. If you cannot reproduce it on any other machines, I would start looking at possibly bad hardware... unless someone else sees something I'm not. > If you need further data just ask, I'd be glad to help, > Ady (@warpnet.ro) Regards, Bosko. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: kernel: nd6_storelladdr failed, mbuf leak
John Hay wrote: > I have configured a 4-stable machine to be a router, routing ipv4, ipv6 > and ipx. To be able to do Ethernet_II framing on one interface and 802.3 > on another I have used if_ef.ko. > > I then noticed that "... kernel: nd6_storelladdr failed" gets logged > often and after a while all mbufs are used. It turned out that in > sys/net/if_ethersubr.c in ether_output() when nd6_storelladdr() fails, > it does a return(0) and does not free the mbuf. I checked -current > and it is still like that. It should not be freeing the mbuf, because that mbuf is being passed as an argument to ether_output(). It is typically the caller that ought to be responsible for freeing the mbuf in this case. > Now the reason it fails is that the ef(4) device use an ifp->if_type > (IFT_XETHER) that nd6_storelladdr() does not expect. > > Oh as a workaround I have configured route6d to ignore fxp1f0. > > John > -- > John Hay -- [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: kernel: nd6_storelladdr failed, mbuf leak
itojun wrote: > > >> > > I then noticed that "... kernel: nd6_storelladdr failed" gets logged > >> > > often and after a while all mbufs are used. It turned out that in > >> > > sys/net/if_ethersubr.c in ether_output() when nd6_storelladdr() > >> > fails, > >> > > it does a return(0) and does not free the mbuf. I checked -current > >> > > and it is still like that. > > will correct it. thanks for reporting. > > itojun This behavior is absolutely horrible. What ought to be fixed, if ether_input() is supposed to be freeing the passed in mbuf, is that ether_input() should instead accept a pointer to a pointer so that after it frees the mbuf it can NULL out the initial pointer. Because of promoting similar existing coding practises in this area of the code, what we're seeing is ether_input() effectively returning an mbuf to the free list while the caller still holds a live pointer to that mbuf. Regards, Bosko. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: kernel: nd6_storelladdr failed, mbuf leak
I'll do it right now if itojun doesn't mind (to save him a task :-) ) and get authorization from Jordan to commit. Thanks, Bosko. John Hay wrote: > > > > >> > will correct it. thanks for reporting. > > > > http://www.kame.net/dev/cvsweb.cgi/kame/kame/sys/netinet6/nd6.c.diff?r 1=1.135&r2=1.136 > > > > itojun > > > > Any chance of this finding its way into -current and -stable (preferably > before the release)? > > John > -- > John Hay -- [EMAIL PROTECTED] > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-net" in the body of the message > To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: kernel: nd6_storelladdr failed, mbuf leak
Euuh, scratch that. I don't see anything committed to HEAD yet, at all. No use in MFCing something that hasn't yet been committed. :-) Bosko I wrote: > I'll do it right now if itojun doesn't mind (to save him a task :-) ) > and get authorization from Jordan to commit. > > Thanks, > Bosko. > > John Hay wrote: > > > > > > > >> > will correct it. thanks for reporting. > > > > > > > http://www.kame.net/dev/cvsweb.cgi/kame/kame/sys/netinet6/nd6.c.diff?r > 1=1.135&r2=1.136 > > > > > > itojun > > > > > > > Any chance of this finding its way into -current and -stable > (preferably > > before the release)? > > > > John > > -- > > John Hay -- [EMAIL PROTECTED] > > > > To Unsubscribe: send mail to [EMAIL PROTECTED] > > with "unsubscribe freebsd-net" in the body of the message > > > > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-net" in the body of the message > To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: kernel: nd6_storelladdr failed, mbuf leak
John Hay wrote: > > > > >I'll do it right now if itojun doesn't mind (to save him a task :-) ) > > >and get authorization from Jordan to commit. > > > > please go ahead, i can review the diff if you want me to. > > > > Ok, here is a patch for -current. It was taken from the kame code > with minor adjustments to fit into our tree. I have tested it on > -current, but not on -stable yet. > > Itojun will you look it over please? > > I'm gone for the weekend now, so I'll only be able to commit it by > Sunday night. If someone wants to do it before then, you're welcome. I have this one sitting around which is a direct "port" of the KAME patch: http://people.freebsd.org/~bmilekic/code/kame_nd6.diff The differences between the two in terms of functionality are very small, but I prefer your patch because it removes the printf()s from if_ethersubr.c Since itojun OK'd it, I'll commit this today. I'll look into an immediate MFC. > John > -- > John Hay -- [EMAIL PROTECTED] [...] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf leak? fxp?
On Thu, Apr 05, 2001 at 10:18:38AM -0700, Archie Cobbs wrote: > Archie Cobbs writes: > > I have this machine that starts running out of mbufs every few days > > ("looutput: mbuf allocation failed") and then crashes, and was wondering > > if anyone else has seen similar behavior... > > > > For example... > > > > Yesterday... > > $ netstat -m > > 461/624/4096 mbufs in use (current/peak/max): > > 459 mbufs allocated to data > > 2 mbufs allocated to packet headers > > 434/490/1024 mbuf clusters in use (current/peak/max) > > 1136 Kbytes allocated to network (36% of mb_map in use) > > 0 requests for memory denied > > 0 requests for memory delayed > > 0 calls to protocol drain routines > > > > Today... > > $ netstat -m > > 947/1072/4096 mbufs in use (current/peak/max): > > 945 mbufs allocated to data > > 2 mbufs allocated to packet headers > > 920/946/1024 mbuf clusters in use (current/peak/max) > > 2160 Kbytes allocated to network (70% of mb_map in use) > > 0 requests for memory denied > > 0 requests for memory delayed > > 0 calls to protocol drain routines > > > > It appears that something is slowly eating up mbuf clusters. > > The machine is on a network with continuous but very low volume > > traffic, including some random multicast, NTP, etc. The machine > > itself is doing hardly anything at all. > > Well, my current guess is that this is simply an NMBCLUSTERS problem. > I increased NMBCLUSTERS to 8192 and it hasn't happened again yet. I kind of doubt that, judging simply from the netstat -m outputs you have posted above. In niether one is the number of clusters allocated meeting the maximum number of allocatable clusters. If it were the case, you would likely see some numbers for "requests for memory denied" and/or "requests for memory delayed." In any case, increasing NMBCLUSTERS to the number you mention is not a bad idea. > This machine has 5 ethernet interfaces, which must be probably more > than the default NMBCLUSTERS can handle. > > I wonder if we should increase the default NMBCLUSTERS, or document > somewhere that > 4 interfaces requires doing so? Well, the way it should be done is that `maxusers' should be increased, if anything. `maxusers' automatically tunes NMBCLUSTERS and NMBUFS accordingly. Chances are, if you are explicitly declaring `NMBCLUSTERS ' in your kernel configuration file, that you are actually lowering the number of clusters/mbufs that would otherwise be allowed with your given `maxusers' value (unless you have an unreasonably low maxusers). > Thanks for all the suggestions... > > -Archie > > __ > Archie Cobbs * Packet Design * http://www.packetdesign.com Regards, -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf leak? fxp?
On Thu, Apr 05, 2001 at 09:04:15PM -0700, Archie Cobbs wrote: > Bosko Milekic writes: > > NMBUFS accordingly. Chances are, if you are explicitly declaring > > `NMBCLUSTERS ' in your kernel configuration file, that you are > > actually lowering the number of clusters/mbufs that would otherwise be > > allowed with your given `maxusers' value (unless you have an unreasonably > > low maxusers). > > Mmm.. I don't understand that.. can you explain? Heh. I'm sorry for being so "obscure" about this. After re-reading it, I realize I should have probably just quoted the following: #ifndef NMBCLUSTERS #define NMBCLUSTERS (512 + MAXUSERS * 16) #endif TUNABLE_INT_DECL("kern.ipc.nmbclusters", NMBCLUSTERS, nmbclusters); TUNABLE_INT_DECL("kern.ipc.nmbufs", NMBCLUSTERS * 4, nmbufs); (from src/sys/kern/uipc_mbuf.c) So, for example, for MAXUSERS 256, NMBCLUSTERS is 4608, whereas I have seen people do things like this before: maxusers 256 options NMBCLUSTERS 4096 Thus actually reducing the address space allotted to clusters. > -Archie > > __ > Archie Cobbs * Packet Design * http://www.packetdesign.com Regards, -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: I call sosend occur error . The error code=22.
On Thu, Apr 19, 2001 at 12:36:11AM +0800, Jessey wrote: > Dear Sir: > I call sosend occur error. The error code=22. > What wrong in my code? > > The code follow: > > socreate(); file://is ok; > > | > } > > top=m_get(MT_DATA,M_DONTWAIT); > > if(top==NULL) > { > log(LOG_INFO, "top=NULL occur error\n"); > goto out; > } > > top->m_next=NULL; > top->m_nextpkt=NULL; > top->m_len=50; > adr=sizeof(struct m_hdr) +sizeof(int)+sizeof(struct ifnet *); > log(LOG_INFO, "adr= %d \n",adr); > > top->m_data=top+adr; ^^ First of all, that looks totally bogus! > log(LOG_INFO, "top= %x, top->m_data= %x,top+adr=%x \n",top > ,top->m_data,top+adr); > > top->m_type=MT_DATA; > top->m_flags=M_PKTHDR; > top->m_pkthdr.len=50; > top->m_pkthdr.rcvif=NULL; You should probably just be using m_gethdr. Is this kernel code? > log(LOG_INFO, "top= %x, Data= %x \n",top ,top->m_data); > > error = sosend(so,(struct sockaddr *) &sin,NULL,top, > 0,MSG_DONTROUTE); Error code 22 is EINVAL. Something sosend() calls is telling you that whatever you're passing to sosend() is probably invalid. Are you sure that what you're trying to do is OK? > Thanks/regards -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Free BSD network buffers
On Wed, May 16, 2001 at 01:17:44AM -0400, [EMAIL PROTECTED] wrote: > Hi, > > My doubt is whether mbuf's along with clusters will be used for large data sizes >like 65536 bytes? If what you're talking about has to do with sending data over the network, we attach clusters (or other larger buffers) to mbufs. We won't fragment a huge load of data across 100 chained mbufs, for example (as that would be quite ridiculous). :-) > Kindly reply. > regards > ravi prasad > __ > Get your own FREE, personal Netscape Webmail account today at >http://webmail.netscape.com/ -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: tcp template removal / scalability patch
> - bzero(ip, sizeof(struct ip)); /* XXX overkill? */ > + bzero(ip, sizeof(struct ip)); /* XXX overkill? */ I pointed this out the first time; the change looks unnecessary. You changed what were tabs to spaces. Unless there's a good reason, it looks unnecessary. Other than that, looks fine at this end! Cheers, -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: TCP problems with large window sizes on FreeBSD (GigaTCP)
Hi Stanislav, On Wed, Aug 01, 2001 at 11:32:29PM -0400, stanislav shalunov wrote: > We want to build two or more machines that would be capable of > achieving TCP throughputs of 700-800Mb/s over WAN (with a single TCP > connection). The motivations of this exercise are spelled out on the > referred web page. Additionally, I believe that getting through with > this exercise with FreeBSD as the OS would advance FreeBSD's cause > with network researchers and advanced users of high-performance > networks. > > In order to run at such high throughput over links with RTT of roughly > 70ms we'd need window sizes in the vicinity of 8-16MB. (And, > naturally, the unidirectional loss event probability has to be less > than (.7*MSS/(RTT*bandwidth))^2 = 1e-7. We believe that we have > networks that lose less than one packet in ten million.) > > We have built the boxes now. I have started with back-to-back testing > with large window sizes. Back-to-back testing is believed to be valid > because it's hard to expect that inserting 70ms delay between the host > will make the situation any better. > > I cannot get it to run with window sizes greater than half a megabyte. > > The story, with some very preliminary analysis, is at > http://www.internet2.edu/~shalunov/gigatcp/ > > I'm not reposting it here; there are 29KB of text and 3MB of data > there. I'm adding and updating stuff as I progress. > > The questions that I have for you guys are, in decreasing order of > importance: > > 1. How do I fix the ti driver problem that apparently is holding me >back? What number of jumbo slots would be "good"? I would recommend, seeing as how you're targeting specifically TCP throughput and are not too concerned with lack of physical memory, to increase TI_JSLOTS to at least 500-600. This would have the effect of reserving roughly 5M of physically contiguous memory during driver attachment, which I think you can safely spare. If you discover that you need even more than 500-600 jumbo buffers, feel free to experiment with the TI_JSLOTS constant. For what concerns memory buffer tuning, I would also recommend, to be on the safe side, the following changes: - in uipc_mbuf.c, increase NCL_INIT to roughly 20. Seeing as how you're using if_ti which allocates its own buffer space, I don't suspect that you'll be needing many regular clusters. Check with `netstat -m' to see how many are typically in use during a test and increase NCL_INIT to that number. All this will do is pre-allocate the pool at boot time and avoid potentially expensive map allocations while your tests are running. - also in uipc_mbuf.c, increase NMB_INIT to 10240, or any sensible value you have detected is required (again, see with `netstat -m' during testing, try to determine the maximum number of mbufs you'll need). All this will do is, again, allocate them at boot time and speed up memory buffer allocations during performance testing. If you set the number to 10240, remember that each mbuf is merely 256 bytes, so you'll be giving up a mere ~2.7M for the cause, and speeding up allocations altogether. Finally, I noticed at one point in your analysis that you increased NMBCLUSTERS. You'll find that, unless you're actually running out of mbufs and/or clusters, that increasing N{MB,CL}_INIT is probably what you want to do instead. > 2. Why doesn't Fast Retransmit kick in? (See the annotated sender's >view of the stalled connection.) > > 3. Is there an off-by-one error in RST handling? (See the end of the >annotated receiver's view of the stalled connection.) I believe jlemon covered these two issues in his post, which very much makes sense as he's the overall stack guru. :-) > -- > Stanislav Shalunovhttp://www.internet2.edu/~shalunov/ > > "Nuclear war would really set back cable [television]." -- Ted Turner -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
[PATCH] mbtypes stats re-enabled in -CURRENT
Hi, Several weeks ago, Jiangyi Liu <[EMAIL PROTECTED]> sent me an early patch that re-enabled mbtypes statistics under the mbuf allocator in -CURRENT (I had disabled them temporarily while I worked on the allocator semantics itself). Just a couple of days ago, I cleaned up the patch, fixed it up, and rolled a new diff applying to most recent CURRENT. This patch re-enables mbtypes statistics under CURRENT: http://people.freebsd.org/~bmilekic/code/mb_alloc/mbtypes_stats.diff (17k) To CURRENT users with either SMP or UP : please test, you will need to: 1) cvsup fresh CURRENT sources 2) apply patch found at above URL 3) rebuild and reinstall kernel 4) reboot 5) cp /usr/src/sys/sys/mbuf.h /usr/include/sys/ 6) rebuild and reinstall both /usr/src/usr.bin/netstat and /usr/src/usr.bin/systat 7) go about your business, all network services working as usual 8) Check `netstat -m' and `systat -mbufs' outputs, more specifically look out for "Allocated mbuf types" output in the mbuf section of `netstat -m' output and look for allocated mbuf types to be displayed following a `systat -mbufs' 9) Email me ([EMAIL PROTECTED]) and let me know how it went. I'd like to commit this within the next week. To those reviewing the patch, here's how it works: Mbuf "types" (e.g. MT_DATA, etc.) are counted with the help of `long' types. Each per-CPU mbuf statistics data structure (mbpstat) contains an array of longs called mb_mbtypes[]. The size of this array is, of course, MT_NTYPES, defined in sys/mbuf.h specifically for this purpose. The general mbuf statistics structure (called mbstat) contains a `short' m_numtypes member which, during mbuf_init(), is set to MT_NTYPES. The general container's mbpstat struct, like the per-CPU container's mbpstat structs, also contains this mb_mbtypes array. During mbuf allocation, (type) being an argument already passed to m_get() gets passed down to the mb_alloc allocation routine (mb_alloc()). For mbuf clusters, MT_NOTMBUF gets passed down (MT_FREE - no longer used - has been renamed to MT_NOTMBUF). The mb_alloc() code will then increment mb_mbtypes[(type)] if (type) is not MT_NOTMBUF. This manipulation is performed cosistently because it is in the per-CPU container and because it is done while the per-CPU container's (or the general container's) list is already held. During mbuf freeing, mbuf_being_freed->m_type is passed to mb_free() by m_free(). Again, in the case of mbuf clusters, MT_NOTMBUF is passed down instead. So, mb_free() will decrement mb_mbtypes[(type)] if (type) is not MT_NOTMBUF. This manipulation is handled under the given container's lock and is therefore guaranteed to be consistent. The reason this works is because each container has its own mbpstat per-CPU structure and can therefore ensure its consistency via its own container lock. The other reason this works is because mb_mbtypes[] is, in each case, an array of _longs_ and not _unsigned longs_. This allows for the counters to go below 0, as is expected in this situation. Consider, for example, that a thread A runs on CPU 0 and allocates an MT_DATA mbuf, then this happens: mbpstat[cpu_0]->mb_mbtypes[MT_DATA]++; If one considers that mbpstat[cpu_0]->mb_mbtypes[MT_DATA] was 0 prior to this manipulation, then its new value is 1. Now consider that another thread B on CPU 1 frees this same mbuf. Then this happens: mbpstat[cpu_1]->mb_mbtypes[MT_DATA]--; If one considers that mbpstat[cpu_1]->mb_mbtypes[MT_DATA] was 0 prior to this manipulation, then its new value is -1. So to get the "real" mbtypes[(type)] value, we do the following: mbtypes[(type)] = mbpstat[cpu_1]->mb_mbtypes[(type)] + ... + mbpstat[cpu_n]->mb_mbtypes[(type)]; We are guaranteed a number >= 0, because of the fact that all the mb_mbtypes[] are manipulated consistently, under some lock, and because we will never have a certain type _decremented_ before it's once _incremented_. In other words, we will never have something like "mb_mbtypes[(type)]--" occur unless there was already at least one "mb_mbtypes[(type)]++" that occured on at least one of the other containers. MCHTYPE() is implemented to call a function m_chtype() which changes the type accordingly and manipulates the mb_mbtypes[] of the general container after acquiring the general container's lock. This function is rarely used, and it should hopefully stay that way. Regards, -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: performance issues with M_PREPEND on clusters
As I have stated to Luigi in private and as I have stated more than half a year ago now when this issue was first brought up by dwmalone, I believe that the correct way to deal with this issue is to have M_LEADINGSPACE and M_TRAILINGSPACE simply return the number of bytes that make up the leading space or trailing space, respectively, regardless of whether the underlying cluster/mbuf/ext_buf is marked writable or not. There is nothing in the name "M_LEADINGSPACE" and "M_TRAILINGSPACE" that implies anything having to do with the writability of the underlying buffer and therefore, these two should not be concerned with writability. The reason that the relevant code to do so in the first place was commented out in the code when it was imported is the result of a crude hack which was probably brought in when support for other external buffers (besides for just clusters) was implemented because while this was done, at the time, the support for marking said buffers "writable" did not exist, and so to minimize checking in code using the M_LEADINGSPACE and M_TRAILINGSPACE macros, they were merely made to return 0 in the event of an M_EXT marked mbuf. Instead, the correct approach would be, as stated above, to have M_LEADINGSPACE and M_TRAILINGSPACE return what their names imply and have the code that uses them in the kernel appropriately check for writability of the ext_buf before it's about to write to it, using the _already existing_ interface, i.e. M_WRITABLE(). Since this interface _exists_ and we have it implemented, there is no reason not to use it properly anymore. Yes, I'm willing to go around and do the work in -CURRENT, and I will do this, hopefully, by this weekend, as the [massive amount of] school work "stabilizes." I think the same approach should be taken in -STABLE. That said, I am not 100% opposed to Luigi's patch and I think that had not brought this up, it would have been left long forgotten. On Tue, Oct 23, 2001 at 02:06:28PM -0700, Luigi Rizzo wrote: > On Tue, Oct 23, 2001 at 02:00:34PM -0500, Alfred Perlstein wrote: > > > > Yes, you're right, I was mistaken in my paranioa, however you're > > missing the fact that one may want to allocate an EXT buf and still > > have it writeable. > > Yes, but this is the sort of things that are better sorted out > in -current... > > > I like your patch, it's a good idea, go for it. > > good to hear that :) > > cheers > luigi Cheers, -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: performance issues with M_PREPEND on clusters
On Fri, Oct 26, 2001 at 08:47:27PM +0900, Keiichi SHIMA / ?$BEg7D0l?(B wrote: > Right. And we will have a problem if someone changes the semantics of > M_LEADINGSPACE. The M_LEADINGSPACE macro of Net/OpenBSD does the same > as the current FreeBSD does, that is, returns 0 if M_EXT is set. > > > When I looked at this question last year I think I found that there > > were few enough places in the kernel which used M_LEADINGSPACE, so > > it should be fairly easy to check them. However, I think several > > of the uses were in KAME code. > > The patch posted first by Luigi is safe because the patch doesn't > change any semantics of M_LEADINGSPACE. I think it (and the patch) > reasonbale that M_LEADINGSPACE returns 0 when the mbuf (cluster) is > not writable and the real space when the mbuf (cluster) is writable. > > But, I disagree the later proposal from Bosko that changes the meaning > of M_LEADINGSPACE. This leads more ifdefs in the shared code > (ex. KAME, maybe other cross-platform projects) and the code > complexity. Just because M_LEADINGSPACE may be broken in the other systems does not mean that it should be broken in our system as well. I am against sacrificying good code in order to deal with the left-over stupidities (pardon the seemingly harsh vocabulary) in the other systems, when it comes to porting. You should understand that M_LEADINGSPACE was meant to return the number of leading bytes in the underlying data buffer and that it was only broken when someone decided to half-implement ext bufs. This brokeness happened to live on in present-day systems and unbreaking it should not be put on hold for the sake of pseudo-compatibility, IMHO. > --- > Keiichi SHIMA > IIJ Research Laboratory <[EMAIL PROTECTED]> > KAME Project <[EMAIL PROTECTED]> > -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: performance issues with M_PREPEND on clusters
On Fri, Oct 26, 2001 at 01:10:34PM +0100, David Malone wrote: > So the M_LEADINGSPACE macro should probaly read: > > #define M_LEADINGSPACE(m) \ >(!M_WRITABLE(m) ? 0 : \ > (m)->m_flags & M_EXT ? (m)->m_data - (m)->m_ext.ext_buf : \ > (m)->m_flags & M_PKTHDR ? (m)->m_data - (m)->m_pktdat : \ > (m)->m_data - (m)->m_dat) > > and the comments for M_LEADINGSPACE and M_TRAILINGSPACE should note > the inconsistancy where M_LEADINGSPACE returns the number of writable > bytes and M_TRAILINGSPACE returns the number bytes, but doesn't indicate > if they are writable or not. > > Maybe we should also provide M_{LEAD,TRAIL}INGSPACE_{R,W} macros > with consistant behaviour. The _R macros would be cheaper 'cos > they wouldn't have to check writability. These could be used in > the case where you already know the storage is writable. > > (Here _R would mean "regardless" 'cos it's hard to imagine a > situation where you want to know how many readable bytes are outside > the valid data region in an mbuf ;-) > > Bosko - what do you think of modifying M_LEADINGSPACE as shown > above and then providing these new macros? It would mean that we > can leave shared code alone, but where we want to optimise > M_LEADINGSPACE and M_TRAILINGSPACE we have the macros to do so. I already stated what I thought about it: it's wrong for reasons mentionned in previous Emails. It means that we will be adding support for a broken macro, not to mention enlarging the macro, which is bad especially for code that is actually using it properly. > David. -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: performance issues with M_PREPEND on clusters
On Fri, Oct 26, 2001 at 09:28:40AM -0700, Luigi Rizzo wrote: > Bosko, Keiichi has a valid point. The semantics of an interface > are described by the documentation associated with it, not by its > name or the Oxford Dictionary. And what documentation would that be? Are you referring to the [incomplete] mbuf(9) man page which does not cover M_LEADINGSPACE, or are you referring to "The Implementation of 4.4BSD" (McKusick et al.) which, to my knowledge, doesn't specify anything regarding the writability of the underlying data (I don't know if it specifies anything regarding M_LEADINGSPACE to begin with)? The M_LEADINGSPACE macro always had the line of code that would return the leading space even for ext_bufs commented out, suggesting that the code did exist in the original implementation or was intended to. You yourself noticed that the behavior of M_LEADINGSPACE was incorrect in that it returned 0 for all ext_bufs. Therefore, the issue is not whether or not M_LEADINGSPACE should be changed because, clearly, it should and you yourself are lobbying to change it. The issue is how to change it properly, and I have already presented my opinion on how it should be done. Arguing that OpenBSD and/or NetBSD don't have this behavior is totally ridiculous, seeing as how they don't have the same "writability" macros (the last time I checked) that we do anyway. > As much as it might be misleading that M_LEADINGSPACE checks for > writability, it has probably been like this for a long time, we > have plenty of code that depends on this, and people looking for > documentation will likely see this behaviour documented. Changing No. It has never been like this. > it might introduce insidious bugs in the code -- because in reality, > clusters are almost never shared except for those on the TCP send > queue and for multicast packets (but who uses multicast, anyways), > so there is a chance that we will have buggy code that goes unnoticed > for a long time. > > cheers > luigi > > On Fri, Oct 26, 2001 at 12:17:24PM -0400, Bosko Milekic wrote: > > On Fri, Oct 26, 2001 at 08:47:27PM +0900, Keiichi SHIMA / ?$BEg7D0l?(B wrote: > > > Right. And we will have a problem if someone changes the semantics of > > > M_LEADINGSPACE. The M_LEADINGSPACE macro of Net/OpenBSD does the same > ... > > > But, I disagree the later proposal from Bosko that changes the meaning > > > of M_LEADINGSPACE. This leads more ifdefs in the shared code > > > (ex. KAME, maybe other cross-platform projects) and the code > > > complexity. > > > > Just because M_LEADINGSPACE may be broken in the other systems does > > not mean that it should be broken in our system as well. I am against > > sacrificying good code in order to deal with the left-over stupidities > > (pardon the seemingly harsh vocabulary) in the other systems, when it > > comes to porting. > > You should understand that M_LEADINGSPACE was meant to return the > > number of leading bytes in the underlying data buffer and that it was > > only broken when someone decided to half-implement ext bufs. This > > brokeness happened to live on in present-day systems and unbreaking it > > should not be put on hold for the sake of pseudo-compatibility, IMHO. > > > > > --- > > > Keiichi SHIMA > > > IIJ Research Laboratory <[EMAIL PROTECTED]> > > > KAME Project <[EMAIL PROTECTED]> > > > > > > > -- > > Bosko Milekic > > [EMAIL PROTECTED] > > > > > > To Unsubscribe: send mail to [EMAIL PROTECTED] > > with "unsubscribe freebsd-net" in the body of the message > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-net" in the body of the message > -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: performance issues with M_PREPEND on clusters
On Sat, Oct 27, 2001 at 03:31:48AM +0900, Keiichi SHIMA / ?$BEg7D0l?(B wrote: > M_LEADINGSPACE might be meant to return as you were saying. But > because of some reasons (maybe a lack of time to implement the > complete code), it behaves as we see now. For a long time, in many > systems, M_LEADINGSPACE had been returning writable space only. As a > result, some codes (including KAME) rely on the behavior. This is not true. M_LEADINGSPACE has not been returning writable space only for a long time; it has been returning space in non-M_EXT mbufs for a long time. So, doing what you want to do *is* changing its behavior, whether you like it or not. I don't know how many times I have to repeat this until someone starts to get it. The issue is not whether or not you're changing the behavior of M_LEADINGSPACE because Luigi's suggestion obviously does. The issue is how to do it properly and the way it's being proposed right now is not the proper way. As it turns out, I'm really not up for extending this debate any longer. I'm not going to kill myself because something like this is being done wrongly, seeing as how I'm the only one arguing this side (according to the response hits, it appears I'm alone with this opinion), although I will maintain that I believe it is wrong, and for good reason, but I may end up eventually beating myself up if I have to repeat the same thing over and over again in the same thread. > So, I would suggest that we should not change M_LEADINGSPACE behavior, > and write some documents (or comment in the source may be enough) > about that. Because one of the reasons of the confusion of > M_LEADINGSPACE is there were no documentation about that, we had only > the code and the code said that it returns a writable length. > > (if necessary) we should create a new API to get the real(?) space as > David have proposed. > > Best regards, > > --- > Keiichi SHIMA > IIJ Research Laboratory <[EMAIL PROTECTED]> > KAME Project <[EMAIL PROTECTED]> -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf / maxfiles / maxsockets / etc autoscaling patch
/uipc_socket2.c > --- sys.old/kern/uipc_socket2.c Sat Dec 8 16:04:50 2001 > +++ sys/kern/uipc_socket2.c Sat Dec 8 16:08:43 2001 > @@ -1026,7 +1026,12 @@ > */ > static void init_maxsockets(void *ignored) > { > + int autosockets, maxuserssockets; > + > + autosockets = physmemMB * SOCKETSPERMB; > + autosockets = min(MAXAUTOSOCKETS, max(MINAUTOSOCKETS, autosockets)); > + maxuserssockets = 2 * (20 + (16 * maxusers)); > + maxsockets = max(maxuserssockets, max(autosockets, nmbclusters)); > TUNABLE_INT_FETCH("kern.ipc.maxsockets", &maxsockets); > - maxsockets = imax(maxsockets, imax(maxfiles, nmbclusters)); > } > SYSINIT(param, SI_SUB_TUNABLES, SI_ORDER_ANY, init_maxsockets, NULL); > diff -u -r sys.old/netinet/tcp_subr.c sys/netinet/tcp_subr.c > --- sys.old/netinet/tcp_subr.cSat Dec 8 16:04:42 2001 > +++ sys/netinet/tcp_subr.cSat Dec 8 16:10:31 2001 > @@ -190,6 +190,7 @@ > tcp_init() > { > int hashsize = TCBHASHSIZE; > + int autohashsize; > > tcp_ccgen = 1; > tcp_cleartaocache(); > @@ -203,6 +204,13 @@ > > LIST_INIT(&tcb); > tcbinfo.listhead = &tcb; > + > + /* Calculate autoscaled hash size, use if > default hash size. */ > + autohashsize = physmemMB * TCBHASHPERMB; > + autohashsize = min(MAXAUTOTCBHASH, max(MINAUTOTCBHASH, autohashsize)); > + while (!powerof2(autohashsize)) > + autohashsize++; > + hashsize = max(hashsize, autohashsize); > TUNABLE_INT_FETCH("net.inet.tcp.tcbhashsize", &hashsize); > if (!powerof2(hashsize)) { > printf("WARNING: TCB hash size not a power of 2\n"); > diff -u -r sys.old/powerpc/powerpc/machdep.c sys/powerpc/powerpc/machdep.c > --- sys.old/powerpc/powerpc/machdep.c Sat Dec 8 16:04:39 2001 > +++ sys/powerpc/powerpc/machdep.c Sat Dec 8 16:48:30 2001 > @@ -436,7 +436,8 @@ > __asm ("mtsprg 0, %0" :: "r"(globalp)); > > /* Init basic tunables, hz etc */ > - init_param(); > + init_hz(); > + init_param(0); /* XXX - needs to be fed physmem for proper autoscaling */ > > /* setup curproc so the mutexes work */ > > diff -u -r sys.old/sparc64/sparc64/machdep.c sys/sparc64/sparc64/machdep.c > --- sys.old/sparc64/sparc64/machdep.c Sat Dec 8 16:04:38 2001 > +++ sys/sparc64/sparc64/machdep.c Sat Dec 8 16:47:29 2001 > @@ -249,10 +249,10 @@ > end = (vm_offset_t)_end; > } > > - /* > - * Initialize tunables. > - */ > - init_param(); > + /* Init hz */ > + init_hz(); > + /* Init basic tuneables - XXX - this needs to be moved once maxmem exists >here. */ > + init_param(0); > > #ifdef DDB > kdb_init(); > diff -u -r sys.old/sys/param.h sys/sys/param.h > --- sys.old/sys/param.h Sat Dec 8 16:04:37 2001 > +++ sys/sys/param.h Sat Dec 8 16:05:28 2001 > @@ -230,6 +230,44 @@ > #define ctodb(db)/* calculates pages to devblks */ \ > ((db) << (PAGE_SHIFT - DEV_BSHIFT)) > > +/* > + * Values used in autoscaling system structures based on RAM size. > + * > + * Although settings are scattered across various subsystems, a > + * common formula is followed. Generally, there are three > + * possible values to choose from: The value suggested by maxusers, > + * the value suggested by the autoscaling formula, and a manually > + * tuned value from loader.conf. If a manually tuned value is specified, > + * this value will be used. Otherwise, the maximum of the maxusers > + * and autoscaled setting will be used. > + * > + */ > + > +/* Max processes, files. These are set in subr_param.c */ > +#define PROCPERMB 16 > +#define MINAUTOPROC 256 > +#define MAXAUTOPROC 32000 > +#define FILESPERMB 128 > +#define MINAUTOFILES 1024 > +#define MAXAUTOFILES 65536 > + > +/* Max sockets. These are set in uipc_socket2.c */ > +#define SOCKETSPERMB 64 > +#define MINAUTOSOCKETS 512 > +#define MAXAUTOSOCKETS 32000 > + > +/* Max mbuf clusters, sendfile buffers. These are set in subr_mbuf.c */ > +#define MCLSPERMB 32 > +#define MINAUTOMCLS 512 > +#define MAXAUTOMCLS 32000 > +#define SFBUFPERMB 16 > +#define MINAUTOSFBUF 1024 > +#define MAXAUTOSFBUF 32000 > + > +/* Number of TCP hash buckets. These are set in tcp_subr.c */ > +#define TCBHASHPERMB 8 > +#define MINAUTOTCBHASH 512 > +#define MAXAUTOTCBHASH 8192 > > /* > * Make this available for most of the kernel. There were too many > diff -u -r sys.old/sys/systm.h sys/sys/systm.h > --- sys.old/sys/systm.h Sat Dec 8 16:04:37 2001 > +++ sys/sys/systm.h Sat Dec 8 16:07:45 2001 > @@ -60,6 +60,7 @@ > extern struct cv selwait;/* select conditional variable */ > > extern int physmem; /* physical memory */ > +extern int physmemMB;/* physical memory size in megabytes */ > > extern dev_t dumpdev;/* dump device */ > extern long dumplo; /* offset into dumpdev */ > @@ -121,7 +122,8 @@ > > void cpu_boot __P((int)); > void cpu_rootconf __P((void)); > -void init_param __P((void)); > +void init_hz __P((void)); > +void init_param __P((u_int64_t)); > void tablefull __P((const char *)); > int kvprintf __P((char const *, void (*)(int, void*), void *, int, > _BSD_VA_LIST_)) __printflike(1, 0); -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: m_reclaim and a protocol drain
On Fri, Dec 21, 2001 at 09:42:30AM -0600, Randall Stewart wrote: > Hi all: > > I have a question. I have been working to test the new > sctp_drain function I am adding and have had a very difficult > time getting the drain function to be called by the mbuf system... > > Now here is what I most observe from some of the test cases > I am building: > > A) All inbound packets get a cluster down in the driver routine. > B) There is a much smaller limit to clusters > C) The cluster allocation routine will NOT call reclaim() et.al. This has changed in -CURRENT and it should be easy to change -STABLE to do the same. -CURRENT now drains the protocols in the cluster starvation case too. > D) Of course since the lower drivers are allocating M_DONTWAIT >even if they did I would not get the routine called. > > Now this brings to light a weakness in my mind on the reclaim > system. > > 1) One of the primary things I thought the drain() functions >help with is to ward off DOS attacks. Well, no, not really. They're just there to `help' out in any starvation case, really. > 2) If drivers all use clusters only and clusters can never >call a drain() function, does this not leave both TCP and >SCTP weak against an attack on the cluster side of the MBUF >system? Well, firstly, all clusters are accompanied by mbufs. Secondly, as mentionned above, -CURRENT drains in both cases. > 3) I can see if we are out of mbufs eventually something sending >down will do a mget(..) with a M_WAIT which can spawn the drains >should we not have something like this for a cluster allocation?? There's no way we can have M_DONTWAIT allocations possibly drain the protocols. It would be way too much time for an M_DONTWAIT allocation, especially in light of where we may be going with this in the future (i.e. processing some packets from interrupt context - perhaps). What I think you should do in your code is make the calls with M_TRYWAIT (what you call M_WAIT) wherever they are possible and only call with M_DONTWAIT where it's really not possible to wait. The M_TRYWAIT flag does not imply "run slower than M_DONTWAIT," it just means "try harder even if it takes a little longer, since we are able to block." > If we don't it seems to me the utility of the drain() fucnction is > very very limited.. > > Regards > > R > > -- > Randall R. Stewart > [EMAIL PROTECTED] 815-342-5222 (cell phone) -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: m_reclaim and a protocol drain
On Wed, Dec 26, 2001 at 06:13:40AM -0600, Randall Stewart wrote: > > > 1) One of the primary things I thought the drain() functions > > >help with is to ward off DOS attacks. > > > > Well, no, not really. They're just there to `help' out in any > > starvation case, really. > > > > This comment facinates me. The reason we made SACK's in SCTP > revokeable is due to the potential DOS attack that someone > can supposedly lauch if you don't allow the stack to revoke. > > I can actually see the reason that Sally made the comments > and had us change it so that SACK's are revokeable. However > you argue to the contrary and I wonder which is correct. > > If you do not allow revoking it is the same as if a protocol > does not hold a drain() fucntion. A attacker could easily > stuff a lot of out-of-order segments at you and thus > fill up all your mbuf's or clusters (in my current testing > case). This would then yeild a DOS since you could no longer > receive any segments and leave you high and dry I agree that it is important to drain. I didn't say that it wasn't, but merely that the drain routine in the mbuf code exists to `help out' in the starvation case. Nothing prevents one from calling a specific protocol drain routine explicitly from elsewhere. It is relatively expensive to drain all the protocols for a M_DONTWAIT caller so perhaps if the protocol expects draining to occur in tight situations, we can have the drain routines called from a different context as well (i.e. not only when mbuf allocations fail but also when, for example, something noteworthy occurs in the protocol code). > > > 2) If drivers all use clusters only and clusters can never > > >call a drain() function, does this not leave both TCP and > > >SCTP weak against an attack on the cluster side of the MBUF > > >system? > > > > Well, firstly, all clusters are accompanied by mbufs. Secondly, as > > mentionned above, -CURRENT drains in both cases. > > > > Hmm.. I will look at updating this... That would be very much appreciated. :-) I feel very bad for not doing this myself because I seem to not be tracking -STABLE much these days. But this is totally my fault and if you don't pick it up, feel free to submit a PR and I swear I will force myself to do it. :-) > > What I think you should do in your code is make the calls with > > M_TRYWAIT (what you call M_WAIT) wherever they are possible and only > > call with M_DONTWAIT where it's really not possible to wait. The > > M_TRYWAIT flag does not imply "run slower than M_DONTWAIT," it just > > means "try harder even if it takes a little longer, since we are able to > > block." > > > > A couple of comments here: > > a) There is NO M_TRYWAIT in the 4.3 strain I am working on.. I am >about to upgrade to 4.4. so maybe it will appear there :> Ah, no there is not. But M_TRYWAIT == better name given in -CURRENT to M_WAIT. Basically, they're the same flag and have the same effect, but the name M_TRYWAIT is more descriptive and thus used in -CURRENT (it means "try harder and wait if necessary, but not necessarily forever"). > b) It is NOT my code that I am talking about. The issue is I am >stuffing a LOT of out-of-order segments up the SCTP stack to >get the drain() function called (so I can test it). Nothing >ever calls the drain function and instead we lose packets at >input since the drivers in the ethernet side of things (as you >mention) do a M_DONTWAIT. My code uses M_WAIT wherever it can... > > > Hmm, maybe thats it... I need to somehow activate a condition where > I send a large segment and do a M_WAIT... Probably easier to build > a test ioctl to call the drain function though.. Hm. Well, at any given time we may get a M_WAIT allocation and that would do it. Note, however, that the drain routines _only_ get called when there are *no* more mbufs (in -STABLE) or clusters (in -CURRENT) left. That means that the mb_map has to be starved and that there are no mbufs on the free lists. They do not get called unless there is total starvation. That's why I mentionned above that perhaps it would be worthwhile to have them called from a different context somewhere from the protocol code when/if the protocol feels that they are necessary. > R > > > > If we don't it seems to me the utility of the drain() fucnction is > > > very very limited.. > > > > > > Regards > > > > > > R > > > > > > -- > > > Randall R. Stewart > > > [EMAIL PROTECTED] 815-342-5222 (cell phone) -- Bosko Milekic [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Getting rid of maxsockets.
On Fri, Mar 22, 2002 at 02:59:04AM -0600, Mike Silbersack wrote: > I've looked over vmstat -z with a UMA kernel, it's really nice to know > that everything is coexisting together now. > > There's one big target, though: mbufs. I know that Bosko put a lot of > work into his new mbuf allocator, but if you could find a way to merge > mbufs into the slab allocator the benefits would be huge. Have you > discussed doing this with Bosko yet? > > Mike "Silby" Silbersack As I have previously mentionned, mb_alloc *is* a slab-like allocator but I am prepared to remove large chunks of it and instead glue-into UMA if this can be done in a relatively clean and elegant manner which, after looking at UMA, I've decided that it probably can. -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Getting rid of maxsockets.
On Thu, Mar 21, 2002 at 11:35:52PM -0500, Jeff Roberson wrote: > On Fri, 22 Mar 2002, Mike Silbersack wrote: > > > There's one big target, though: mbufs. I know that Bosko put a lot of > > work into his new mbuf allocator, but if you could find a way to merge > > mbufs into the slab allocator the benefits would be huge. Have you > > discussed doing this with Bosko yet? > > > > Mike "Silby" Silbersack > > > > We have talked about it quite a bit. I'd love to remove the hard limit on > mbufs. I may do this soon, but I have other uma related work that will > probably come before it. I'm not so sure I like this idea. What would be better (and perhaps what you meant) is: "be able to expand the size of the mbuf allocation `pool' at runtime." In any case, we should not jump to quick conclusions with all data structures right away. Instead, I propose that we first glue-in mbuf allocations to UMA (not too difficult, given that UMA provides an allocation routine stub). If this is done properly [without macro-performance loss] then it should be rather trivial to bring in new functionality. > Jeff -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: splimp() during panic?
Archie Cobbs wrote: > Hi, > > I'm trying to debug a mbuf corruption bug in the kernel. I've added > an mbuf sanity check routine which calls panic() if anything is amiss > with the mbuf free list, etc. This function runs at splimp() and if/when > it calls panic() the cpl is still at splimp(). > > My question is: does this guarantee that the mbuf free lists, etc. will > not be modified between the time panic() is called and the time a core > file is generated? For example, if an incoming packet causes a networking > interrupt after panic() has been called but before the core file is > written, will that interrupt be blocked when it calls splimp()? splimp() ensures that no driver handlers will be executed. Further, dumpsys() is called from panic() at splhigh() which would also mean that none of those potentially troublesome handlers will run. > I've been working under this assumption but it seems to not be > valid, because I seem to be seeing panics for situations that are > not true in the core file. Are you seeing invalid stuff from DDB but valid stuff from the core file? Because if so, that's REALLY WIERD. If you're just seeing two different but invalid things, then perhaps something is happening when Debugger() runs (is it possible that the cpl() is changed after or before a breakpoint()?). > If this is not a valid assumption, is there an easy way to 'freeze' > the mbuf free lists long enough to generate the core file when an > inconsistency is found (other than adding the obvious hack)? To make doubly-sure, what you can do is just keep a variable 'foo' which you initialize to 0. Before any mbuf free list manipulations, place a 'if (foo == 0)' check. Atomically set foo to 1 before the panic. See if the inconsistency changes. If you're seeing garbage in both cases, but the garbage is inconsistent, perhaps there's a memory problem or the dump isn't working properly (I've never heard of anything like this before). > Thanks, > -Archie > > ______ > Archie Cobbs * Packet Design * http://www.packetdesign.com Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: m_split() considered harmful
On Fri, May 31, 2002 at 11:29:20AM -0700, Archie Cobbs wrote: > Hi, > > There is an inconsistency in the kernel mbuf code around the > "m->m_ext.ext_size" field of an M_EXT mbuf. > > At first glance you might assume that this field is the total amount > of contiguous space available in the external buffer pointed to by > "m->m_ext.ext_buf". For example, look at the M_TRAILINGSPACE() and > MCLGET() macros. But alas, you know where assumptions lead... :-) > > Now look at m_split(), in particular this code: > > if (m->m_flags & M_EXT) { > n->m_flags |= M_EXT; > n->m_ext = m->m_ext; > if(!m->m_ext.ext_ref) > mclrefcnt[mtocl(m->m_ext.ext_buf)]++; > else > (*(m->m_ext.ext_ref))(m->m_ext.ext_buf, > m->m_ext.ext_size); > m->m_ext.ext_size = 0; /* For Accounting XX danger */ > n->m_data = m->m_data + len; > } else { > > Notice the 'XX' which is where the problem lies. The code is > just recklessly setting m->m_ext.ext_size to zero to avoid some > "accounting" problem. This has been there since revision 1.1/rgrimes. > > This is comletely broken and violates the semantics of the > "m->m_ext.ext_size" field implied by M_TRAILINGSPACE() et. al. Good catch. Thanks Archie! > Moreover, I've also done a search of every occurrence of "ext_size" > and everywhere else in the kernel treats this feild with the same > "external buffer size" semantics, and there is no "accounting" that > is done with this field. > > For an example of where this could cause problems, notice that > a call to sbfree() on an mbuf that had gone through an m_split() > could cause a "receive 1" panic (I've seen these occasionally over > the years). > > FYI, m_split() is used in these files: > > netgraph/ng_ppp.c > netinet6/ah_input.c > netinet6/esp_input.c > netinet6/frag6.c > netsmb/smb_rq.c > > If there are no objections, I will apply the patch below. > > Thanks, > -Archie > > __ > Archie Cobbs * Packet Design * http://www.packetdesign.com > > --- kern/uipc_mbuf.c.orig Fri May 31 11:17:52 2002 > +++ kern/uipc_mbuf.c Fri May 31 11:27:42 2002 > @@ -1194,6 +1194,10 @@ > * Partition an mbuf chain in two pieces, returning the tail -- > * all but the first len0 bytes. In case of failure, it returns NULL and > * attempts to restore the chain to its original state. > + * > + * Note that the returned mbuf must be treated as read-only, because > + * it will end up sharing an mbuf cluster with the original mbuf if the > + * "breaking point" happens to lie within a cluster mbuf. > */ Can you please apply this to -CURRENT first? -CURRENT has the same problem. Notice that in -CURRENT we have a M_WRITABLE() macro which will actually evaluate to false for both mbufs as they will be referring to a cluster that has a ref count of > 1 so this comment will be implicitly represented in the code. Now all we have to do is actually start using the M_WRITABLE macro more often. :-) > struct mbuf * > m_split(m0, len0, wait) > @@ -1247,7 +1251,6 @@ > else > (*(m->m_ext.ext_ref))(m->m_ext.ext_buf, > m->m_ext.ext_size); > - m->m_ext.ext_size = 0; /* For Accounting XX danger */ > n->m_data = m->m_data + len; > } else { > bcopy(mtod(m, caddr_t) + len, mtod(n, caddr_t), remain); I don't remember why the ext_size here was this originally (as you mention, it was imported that way), but it certainly seems bogus and you catching it now is hopefully going to solve some really wierd and difficult-to-debug problems we've had involving mbufs over the years. Woop! -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: m_split() considered harmful
On Fri, May 31, 2002 at 12:54:39PM -0700, Julian Elischer wrote: > > > On Fri, 31 May 2002, Bosko Milekic wrote: > > > > > I don't remember why the ext_size here was this originally (as you > > mention, it was imported that way), but it certainly seems bogus and > > you catching it now is hopefully going to solve some really wierd and > > difficult-to-debug problems we've had involving mbufs over the years. > > > It was imported from FreeBSD-1.x by rod. > > The size argument was for two reasons. > 1/ to allow the M_LEADINGSPACE and M_TRAILINGSPACE calculations > to be done if it was a single reference object. > 2/ to allow the free function (in the case of non cluster external > objects) to know what sized object they had in the case that they needed > this information. Yeah, we still need the field to exist because of (2). I just meant that we should remove the explicit setting of 0 of the field in m_split(), and that's what Archie suggested. > I know because I added it, because I needed to do both of these at TRW in > '90-'95 under MACH/BSD and when I moved the code to FreeBSD1.x it cam > along.. there was no M_LEADINGSPACE/M_TRAILINGSPACE macro at that time.. > I did it in my code explicitly. > > It was not used in standard code because only in my own protocol stack > did I know that the damned object was not shared and writable.. > This has improved with time.. > > Having the size set to 0 however stopped users from hitting > cases where the WRITABILITY of the ext objext has not been correctly > tracked as it always returns "not enough space" (sometimes -ve). Right now, in -CURRENT, we have a M_WRITABLE macro, as you noticed, that evalutes true unless one of these conditions is true: 1) M_RDONLY is set in the mbuf 2) mbuf has external buffer attached and ref. count on that buffer (e.g., cluster, whatever) is > 1. This macro (M_WRITABLE) _is_ currently used in M_LEADINGSPACE. If the mbuf being checked is not writable, M_LEADINGSPACE will evalute to 0. However, I noticed just now that it is not used in M_TRAILINGSPACE, where it probably should be. M_TRAILINGSPACE should probably look like this: #define M_TRAILINGSPACE(m) \ ((m)->m_flags & M_EXT ? \ (M_WRITABLE(m) ? \ (m)->m_ext.ext_buf + (m)->m_ext.ext_size - \ ((m)->m_data + (m)->m_len) : 0) : \ &(m)->m_dat[MLEN] - ((m)->m_data + (m)->m_len)) This is the same philosophy adopted by M_LEADINGSPACE. I have no idea how I missed that when we did the M_WRITABLE stuff. Or did we leave it out for some specific reason? Ian Dowse or David Malone would know, as the WRITABLE code was thanks to them. :-) > If we change this we need to audit the WRITABILTY.. > e.g. it is not checked in M_TRAILINGSPACE Right. > Julian -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: m_split() considered harmful
On Fri, May 31, 2002 at 01:25:09PM -0700, Archie Cobbs wrote: [...] > As a temporary saftey measure, I'll add M_WRITABLE(m) into the > M_TRAILINGSPACE() macro. However, I see this as a temporary hack; > the correct fix is to put the burden of writability on the caller. > After all, M_TRAILINGSPACE() doesn't modify the mbuf data! > > That is, what we really need is a more general audit for code that > writes into mbufs that might be read-only -- and, as one special case > of tha, code that calls M_TRAILINGSPACE()/M_LEADINGSPACE() before writing > into an mbuf. Agreed. [...] > Updated patch below. > > -Archie [...] Patch looks good. -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Race condition with M_EXT ref count?
First of all, this issue is non-existant in -CURRENT, where all reference counting is done via a consistent interface that provides bus-locked reference count manipulations. Secondly, this may not be as much of a problem as you may think. Particularly if you consider the m_split() case, for example. For example, if you're calling m_split() on an mbuf chain that may refer to cluster(s) where none of the clusters has a ref. count greater than 1 -- which is usually the case anyway -- then this is fine; since you have posession of the chain referring to those clusters, and presumably the chain isn't sitting in some queue somewhere (if it is, you'd have to be under the protection of that queue anyway - splimp() or whatever), then you're the only one who has posession of that cluster and you'll be the only candidate incrementing the ref. count, regardless of whether an interrupt comes in or not. Perhaps the real solution would be to do something like: if (ref count > 1) { do atomic inc. } else { do regular inc. } Or, alternately, do what -CURRENT does (and it's probably simpler anyway) and just make it all atomic. I'm sorry to say that I don't develop on -STABLE much anymore (namely, because I don't run it at home) so I cannot make this change any time soon. On Mon, Jun 03, 2002 at 03:08:01PM -0700, Julian Elischer wrote: > well, it's a valid problem form the point of view of some interrupt code > changing the ext reference at teh same time that some other code is > planning on incrementing it, but while Giant is in place it's not > likely to affect anything.. > > > I presume you are talking about 4.x however right? > we can presume that each manipulator has a reference so it's not going to > GA AWAY due to a zero reference being reached, so the link can be followed > safely, which just elaves the atomicity of the addition operation. > > who are the contenders? > 1/ network mid-level code? (protected by: splnet and the BGL. > Only one cpu at a time.. > > 2/ Interrupt code running at splimp > probably freeing stuff after transmit. (receivers should have > pre-allocated buffers) > > it IS possible that there could need to be an splimp() added but I am > not clear on what part the 4.4 BGL (Big Giant Lock) plays in this.. > it may make it safe... > > On Mon, 3 Jun 2002, Archie Cobbs wrote: > > > Julian Elischer writes: > > > Not denying that, just that JHB has a preliminary implementation > > > he's been showing around... > > > > > > > > this is YET ANOTHER case for the Atomic reference counting ABI that > > > > > jhb has been talking about... > > > > > > > > I was the initial person to request an atomic_t API. > > > > You guys please stop changing the subject :-) > > > > Can somebody confirm that they think this bug is real/valid? > > > > Thanks, > > -Archie > > > > __ > > Archie Cobbs * Packet Design * http://www.packetdesign.com > > -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Race condition with M_EXT ref count?
On Mon, Jun 03, 2002 at 04:01:29PM -0700, Archie Cobbs wrote: > Bosko Milekic writes: > > Secondly, this may not be as much of a problem as you may think. > > Particularly if you consider the m_split() case, for example. For > > example, if you're calling m_split() on an mbuf chain that may refer > > to cluster(s) where none of the clusters has a ref. count greater than > > 1 -- which is usually the case anyway -- then this is fine; since you > > have posession of the chain referring to those clusters, and > > presumably the chain isn't sitting in some queue somewhere (if it is, > > you'd have to be under the protection of that queue anyway - splimp() > > or whatever), then you're the only one who has posession of that > > The *chain* won't be sitting in a queue, but there may be a different > mbuf in a queue somehwere that points to the same cluster. I did say "if the refcount is exactly 1" (which is often the case in there). > Since mid-level code only increments the ref count, I think the > worst that can happen is the ref count is incorrectly too high, > which could only cause a memory leak rather than a crash. No, the worse case is that it is too low. increment: read inc write two increments, unprotected: count initially: 1 read 1 inc 1->2read 1 write 2 inc 1->2 write 2 count finally: 2 (it should be 3 now) Two frees will now result in the cluster being freed instead of 3. The final free will be bogus. > There's not way for the ref count to be too low, or for the cluster > to be free'd twice, because all decrements are protected by splimp(). > > However, once you start adding in custom code then we may not be > so lucky.. > > -Archie > > ______ > Archie Cobbs * Packet Design * http://www.packetdesign.com -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Race condition with M_EXT ref count?
On Mon, Jun 03, 2002 at 03:54:02PM -0700, Archie Cobbs wrote: [...] > If you don't mind, I'll let you deal with the -current case, since > I'm not familiar enough with -current locking. Below is my proposed > patch for -stable. -current does not have this "problem." > It should be clear that in -stable at least, there can be a race > between an interrupt handler that free's a cluster mbuf and mid-level > code that is free'ing the same mbuf, for example. > > -Archie > > __ > Archie Cobbs * Packet Design * http://www.packetdesign.com > > --- sys/kern/uipc_mbuf.c.orig Mon Jun 3 15:43:25 2002 > +++ sys/kern/uipc_mbuf.c Mon Jun 3 15:52:28 2002 > @@ -707,11 +707,16 @@ > n->m_len = min(len, m->m_len - off); > if (m->m_flags & M_EXT) { > n->m_data = m->m_data + off; > - if(!m->m_ext.ext_ref) > - mclrefcnt[mtocl(m->m_ext.ext_buf)]++; > - else > - (*(m->m_ext.ext_ref))(m->m_ext.ext_buf, > - m->m_ext.ext_size); > + if (m->m_ext.ext_ref == NULL) { > + atomic_add_char( > + &mclrefcnt[mtocl(m->m_ext.ext_buf)], 1); > + } else { > + int s = splimp(); > + > + (*m->m_ext.ext_ref)(m->m_ext.ext_buf, > + m->m_ext.ext_size); > + splx(s); > + } > n->m_ext = m->m_ext; > n->m_flags |= M_EXT; > } else > @@ -754,11 +759,15 @@ > n->m_len = m->m_len; > if (m->m_flags & M_EXT) { > n->m_data = m->m_data; > - if(!m->m_ext.ext_ref) > - mclrefcnt[mtocl(m->m_ext.ext_buf)]++; > - else > - (*(m->m_ext.ext_ref))(m->m_ext.ext_buf, > - m->m_ext.ext_size); > + if (m->m_ext.ext_ref == NULL) > + atomic_add_char(&mclrefcnt[mtocl(m->m_ext.ext_buf)], 1); > + else { > + int s = splimp(); > + > + (*m->m_ext.ext_ref)(m->m_ext.ext_buf, > + m->m_ext.ext_size); > + splx(s); > + } > n->m_ext = m->m_ext; > n->m_flags |= M_EXT; > } else { > @@ -777,11 +786,16 @@ > n->m_len = m->m_len; > if (m->m_flags & M_EXT) { > n->m_data = m->m_data; > - if(!m->m_ext.ext_ref) > - mclrefcnt[mtocl(m->m_ext.ext_buf)]++; > - else > - (*(m->m_ext.ext_ref))(m->m_ext.ext_buf, > - m->m_ext.ext_size); > + if (m->m_ext.ext_ref == NULL) { > + atomic_add_char( > + &mclrefcnt[mtocl(m->m_ext.ext_buf)], 1); > + } else { > + int s = splimp(); > + > + (*m->m_ext.ext_ref)(m->m_ext.ext_buf, > + m->m_ext.ext_size); > + splx(s); > + } > n->m_ext = m->m_ext; > n->m_flags |= M_EXT; > } else { > @@ -1125,11 +1139,15 @@ > if (m->m_flags & M_EXT) { > n->m_flags |= M_EXT; > n->m_ext = m->m_ext; > - if(!m->m_ext.ext_ref) > - mclrefcnt[mtocl(m->m_ext.ext_buf)]++; > - else > - (*(m->m_ext.ext_ref))(m->m_ext.ext_buf, > - m->m_ext.ext_size); > + if (m->m_ext.ext_ref == NULL) > + atomic_add_char(&mclrefcnt[mtocl(m->m_ext.ext_buf)], 1); > + else { > + int s = splimp(); > + > + (*m->m_ext.ext_ref)(m->m_ext.ext_buf, > + m->m_ext.ext_size); > + splx(s); > + } > m->m_ext.ext_size = 0; /* For Accounting XX danger */ > n->m_data = m->m_data + len; > } else { It would be better if the thing was in a macro to allow for consistent ref. count manipulation, like in -CURRENT. But whatever, I don't really care. -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: m->m_pkthdr.header
On Fri, Jun 07, 2002 at 12:55:53PM -0700, Archie Cobbs wrote: > Hi, > > I'd like to get rid of this mbuf field "m->m_pkthdr.header". > Here are my reasons: > > - It's broken. Consider this code: > > copy = m_copypacket(m); > m_freem(m); Indeed, this is a bug in m_copypacket(). It should copy the data over, but again, admittedly, this is a very dangerous practise! Ideally, what ought to happen is that the copying over needs to be done by the code calling m_copypacket(), right after the m_copypacket() call but before the m_freem() call. The reason to do it from the caller code and not from m_copypacket() is that there is no reasonable way to determine, from m_copypacket(), whether or not one of our header/mbuf pointers points to a data inside the mbuf or whether it is safe to do the shallow pointer copy. >Typically m->m_pkthdr.header points into the data area of 'm'. >So now copy->m_pkthdr.header points at the same data, which has >already been freed and possibly reused. > > - It is an obscure thing most people don't know about and is >ripe for causing hard to find bugs. > > - It's only used in two places, (a) for secret communication between >the oltr driver and if_ether.c, and (b) as a temporary variable >in ip_input.c: > > $ find . -type f -print | xargs grep -w m_pkthdr.header > ./contrib/dev/oltr/if_oltr.c: m0->m_pkthdr.header = (void *)th; > ./netinet/if_ether.c:th = (struct iso88025_header *)m->m_pkthdr.header; > ./netinet/ip_input.c:m->m_pkthdr.header = ip; > ./netinet/ip_input.c:#define GETIP(m) ((struct ip*)((m)->m_pkthdr.header)) > > My proposal: > >- Replace the obfuscating GETIP() macro in ip_input.c with a variable. >- Rejigger the oltr driver to pass its "secret" information using > an auxillary mbuf instead of m->m_pkthdr.header. > > Any comments/objections? None from here. :-) Just the standard pointer, which you probably already know: -CURRENT first, please. > Thanks, > -Archie > > __ > Archie Cobbs * Packet Design * http://www.packetdesign.com Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: m->m_pkthdr.header
On Sat, Jun 08, 2002 at 09:16:06AM -0700, John Polstra wrote: > In article <[EMAIL PROTECTED]>, > Archie Cobbs <[EMAIL PROTECTED]> wrote: > > > > I'd like to get rid of this mbuf field "m->m_pkthdr.header". > [...] > > Any comments/objections? > > Yes, please bear in mind that if you change the layout of the mbuf > structures you will probably break binary compatibility with various > commercial network drivers that are shipped in binary form. That's a > big pain for those of us who are forced to use such drivers. For that > reason I don't think you should make this kind of change in -stable. > Current is obviously a different kettle of fish, but even there you > should add fillers to replace any structure elements you remove, if > possible. While I agree that -STABLE is a more sensitive fish (which is partly why I really don't enjoy dealing with it -- this argument always comes up), I think that these types of changes absolutely need to go into -CURRENT. We shouldn't have to worry about "breaking binary compatibility" in our development releases. I can almost guarantee that a lot of stuff isn't going to work without a rebuild when people decide to go from 4.x to 5.0 anyway. Placing dummy-fillers is just annoying. > John > -- > John Polstra > John D. Polstra & Co., Inc. Seattle, Washington USA > "Disappointment is a good sign of basic intelligence." -- Chögyam Trungpa -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: new zero copy sockets snapshot
On Tue, Jun 18, 2002 at 10:36:36PM -0600, Kenneth D. Merry wrote: > > I've released a new zero copy sockets snapshot, against -current from June > 18th, 2002. > > http://people.FreeBSD.org/~ken/zero_copy > > The fixes that went into this snapshot: > > - Take mutex locking out of ti_attach(), it isn't really needed. >As long as we can assume that probes of successive ti(4) instances >happen sequentially, we'll be safe in doing this. Thanks to John >Baldwin for pointing out the solution to that problem. (The lock in >ti_attach() was causing all sorts of WITNESS warnings when >bus_setup_intr() was called.) > > - Added a new routine, vm_object_allocate_wait(). This is a variant of >vm_object_allocate() that allows the user to specify whether the >uma_zalloc() call inside vm_object_allocate_wait() is called with >M_WAITOK or M_NOWAIT. This eliminates a WITNESS warning caused when >jumbo_vm_init() calls vm_object_allocate() with the jumbo lock held, and >potentially gives other callers the option of eliminating the mandatory >wait on the uma_zalloc() call. I think this problem was fixed in recent -CURRENT by JeffR. Notably, the VM system should not allow itself to recurse on itself when called with M_NOWAIT. >(vm_object_allocate() now just calls vm_object_allocate_wait() with the >proper argument.) > > With those fixes, plus several fixes that have gone into -current over the > past week or so, the zero copy sockets code runs without any WITNESS > warnings at all now. > > Ken > -- > Kenneth Merry > [EMAIL PROTECTED] > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-net" in the body of the message > -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: new zero copy sockets snapshot
On Thu, Jun 20, 2002 at 12:25:58PM -0400, Andrew Gallatin wrote: [...] > > > Do you think it would be feasable to glue in a new jumbo (10K?) > > > allocator on top of the existing mbuf and mcl allocators using the > > > existing mechanisms and the existing MCLBYTES > PAGE_SIZE support > > > (but broken out into separte functions and macros)? > > > > Assuming that you can still play those VM tricks with the pages spit > > out by mb_alloc (kern/subr_mbuf.c in -CURRENT), then this wouldn't be a > > problem at all. It's easy to add a new fixed-size type allocation to > > mb_alloc. In fact, it would be beneficial. mb_alloc uses per-CPU > > caches and also makes mbuf and cluster allocations share the same > > per-CPU lock. What could be done is that the jumbo buffer allocations > > could share the same lock as well (since they will likely usually be > > allocated right after an mbuf is). This would give us jumbo-cluster > > support, but it would only be useful for devices clued enough to break > > up the cluster into PAGE_SIZE chunks and do scatter/gather. For most > > worthy gigE devices, I don't think this should be a problem. > > I'm a bit worried about other devices.. Tradidtionally, mbufs have > never crossed page boundaries so most drivers never bother to check > for a transmit mbuf crossing a page boundary. Using physically > discontigous mbufs could lead to a lot of subtle data corruption. I assume here that when you say "mbuf" you mean "jumbo buffer attached to an mbuf." In that case, yeah, all that we need to make sure of is that the driver knows that it's dealing with non-physically-contiguous pages. For what concerns regular 2K mbuf clusters as well as the 256 byte mbufs themselves, they never cross page boundaries so this should not be a problem for those drivers that do not use jumbo clusters. > One question. I've observed some really anomolous behaviour under > -stable with my Myricom GM driver (2Gb/s + 2Gb/s link speed, Dual 1GHz > pIII). When I use 4K mbufs for receives, the best speed I see is > about 1300Mb/sec. However, if I use private 9K physically contiguous > buffers I see 1850Mb/sec (iperf TCP). > > The obvious conclusion is that there's a lot of overhead in setting up > the DMA engines, but that's not the case; we have a fairly quick chain > dma engine. I've provided a "control" by breaking my contiguous > buffers down into 4K chunks so that I do the same number of DMAs in > both cases and I still see ~1850 Mb/sec for the 9K buffers. > > A coworker suggested that the problem was that when doing copyouts to > userspace, the PIII was doing speculative reads and loading the cache > with the next page. However, we then start copying from a totally > different address using discontigous buffers, so we effectively take > 2x the number of cache misses we'd need to. Does that sound > reasonable to you? I need to try malloc'ing virtually contigous and > physically discontigous buffers & see if I get the same (good) > performance... I believe that the Intel chips do "virtual page caching" and that the logic that does the virtual -> physical address translation sits between the L2 cache and RAM. If that is indeed the case, then your idea of testing with virtually contiguous pages is a good one. Unfortunately, I don't know if the PIII is doing speculative cache-loads, but it could very well be the case. If it is and if in fact the chip does caching based on virtual addresses, then providing it with virtually contiguous address space may yield better results. If you try this, please let me know. I'm extremely interested in seeing the results! > Cheers, > > Drew Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Should we keep a cache of mbuf+cluster ready for use ?
On Sat, Jun 29, 2002 at 03:46:31PM -0700, Jeffrey Hsu wrote: > So, what you want is something like a > MGETHCL(m, how, type) > MHCLFREE(m) > interface which first looks in a combined freelist before the individual > mbuf and cluster freelists. I think it's a good idea. I would prefer to see an interface that just grabs both a cluster and an mbuf from their respective per-CPU caches (in -CURRENT) while only grabbing the lock once, if at all this is that important to you. [*] In -CURRENT right now, clusters and mbufs are allocated from different per-CPU cache lists but they _share_ per-CPU locks, therefore this is entirely do-able. Having yet another cache that holds linked cluster + mbufs would really complicate allocations. [*] Admittedly, Alfred was pushing for exactly this not long ago, but I just didn't feel like doing it because I felt that there are better things to worry about than a little unlock + relock, when we're just dropping and re-acquiring the exact same lock, I don't see it as too big of a deal. However, I'm really not sure as to how smart architectures like Intel are with bus-locked instructions and data cache use. If they're stupid about it (i.e., if they don't look at caches during a bus-locked cycle), then it could be worth it to avoid that extra unlock/relock between allocations. If they're fairly smart about it, it's really not worth it. Also, I should mention that in his original message, Luigi mentions that MGETHDR and MCLGET are rather long macros, but in -CURRENT they are functions so there is not really much added code cache pollution. Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Should we keep a cache of mbuf+cluster ready for use ?
On Sat, Jun 29, 2002 at 04:11:20PM -0700, Alfred Perlstein wrote: > If the pool is per-device-softc then it doesn't need locks and will > be a lot more efficient than even grabbing from the per cpu pool. We have an allocator that does per-CPU allocations, we don't need to add additional layers of caches on top of that. The only thing you would save would be avoiding having to lock and unlock a cache. Since the lock is a per-CPU lock (because the caches are per-CPU), the contention on it is very low unless, of course, you get pre-empted somewhere during the allocation and then migrate CPUs; but this is not that likely to happen, really (not right now, anyway). > Also, (/me does a little dance) TOLD YOU SO ABOUT THE ALLOCATION CODE! Huh? If you're referring to the grouped mbuf + cluster allocation interface, I haven't yet admitted that it is worth it. I specifically said that it depends on how the architecture deals with bus-locked instructions and cache use. In other words, it depends on how expensive a unlock+relock operation is when we're dealing with the same lock. If we're allowed to go to the cache during the bus-locked unlock+relock ops, then we should be getting cache hits so strictly speaking, them being there is not such a big deal. But again, I have no idea what happens with cache interactions during a bus-locked instruction. > hah! :) > > -- > -Alfred Perlstein [[EMAIL PROTECTED]] > 'Instead of asking why a piece of software is using "1970s technology," > start asking why software is ignoring 30 years of accumulated wisdom.' > Tax deductible donations for FreeBSD: http://www.freebsdfoundation.org/ > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-net" in the body of the message > -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?)
There are several problems with your "simple" code that make it simpler than what's in -CURRENT and in -STABLE and therefore yield [obviously] seemingly better times. 1) sleeping in the case of a system exhausted of mbufs and clusters is now broken because code that frees to this "combined" list will not generate wake ups; even if you build this functionality in, you would have to code the wakeup for a "special" case where one obtains both a cluster and an mbuf; it'll get complicated. 2) no per-CPU caches, you bring contention (in -CURRENT) from per-CPU back to global, which you probably don't care about now, but which will matter when Giant goes completely. 3) you violate write/read producer/consumer model. In -CURRENT, we try to not have the code doing the freeing write to any part of the mbuf, thus hopefully avoiding cache invalidation from occuring during a free that happens from a thread that doesn't usually write to the mbuf at all. I think that because of (1) and (2), especially, and because the kernel in -CURRENT is the way it is right now, you're getting better performance numbers. Try exhausting the system and seeing how far you get. I suspect you'll eventually just hang (because you'll allocate a bunch of stuff violating the real interface and then won't be able to wakeup and Do The Right Thing). While, as I said before, I agree that we should have an interface that allocates both an mbuf and a cluster, I don't think that this is the right approach. I believe that the correct approach would be to couple the mbuf and cluster allocation under a single per-CPU lock/unlock operation, which isn't too difficult to do. On Tue, Jul 02, 2002 at 12:10:50AM -0700, Luigi Rizzo wrote: > [Bcc to -current as relevant there] > > As a followup to my question below, i did some simple experiment > on a -current box acting as a router, using two "em" cards and > DEVICE_POLLING (just for stability). > > The whole of the code is basically below -- the receiving > side tries to grab the mbuf+cluster from the free pool first, > and falls back to the standard method on failure; the sending > side tries to attach the freed buffer to the free pool > if it is the case. > > There is a simplification here with respect to what could > go into an m_freem()/mgethdr_cluster() pair because the driver > is already holding a lock, on Giant in this case, so you do not need > further locking. Still, to come to the data points: > > CURRENT STABLE (*) > > fastfwd=1, pool_max=0 276kpps 365 kpps > fastfwd=1, pool_max=50355kpps 383 kpps > > fastfwd=0, pool_max=0 195 pps 142 kpps > fastfwd=0, pool_max=50227 kpps146 kpps > > (*) This version of STABLE that I am using for comparison has some > proprietary optimizations which make it a bit faster than normal. > However it still uses the old ipfw code, which is invoked when > fastfwd=0, and is significantly slower than the new one. > > Now this really seems to call for adding this interface into the > mbuf subsystem. I believe we have to find a name for the allocator > (the deallocator might well go into m_freem(), depending on how we > implement the locking) and whether it makes sense to lock > mgethdr_cluster() under per-cpu locks or under Giant, or even let > the caller make sure that it holds the proper lock before trying > to invoke the procedure (as i expect the "producers" or "consumers" > of these pairs to be located in the network stack, chances are that > they already hold a lock on Giant). > > > cheers > luigi > > > The code: > > struct mbuf *em_pool; > > static int em_pool_max = 50; > SYSCTL_INT(_hw, OID_AUTO, em_pool_max, CTLFLAG_RW, > &em_pool_max,0,"max size of mbuf pool"); > static int em_pool_now; > SYSCTL_INT(_hw, OID_AUTO, em_pool_now, CTLFLAG_RD, > &em_pool_now,0,"Current size of mbuf pool"); > > ... in em_get_buf() .. > > if (em_pool) { > mp = em_pool; > em_pool = mp->m_nextpkt; > em_pool_now--; > goto have_it; > } > > ... in em_clean_transmit_interrupts() ... > if ((m = tx_buffer->m_head)) { > if (em_pool_now < em_pool_max && > m->m_next == NULL && > m->m_flags & M_EXT && > M_WRITABLE(m) ) {
Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?)
On Tue, Jul 02, 2002 at 08:04:23AM -0400, Bosko Milekic wrote: > > There are several problems with your "simple" code that make it simpler > than what's in -CURRENT and in -STABLE and therefore yield [obviously] > seemingly better times. > > 1) sleeping in the case of a system exhausted of mbufs and clusters is > now broken because code that frees to this "combined" list will not > generate wake ups; even if you build this functionality in, you would > have to code the wakeup for a "special" case where one obtains both a > cluster and an mbuf; it'll get complicated. > > 2) no per-CPU caches, you bring contention (in -CURRENT) from per-CPU > back to global, which you probably don't care about now, but which will > matter when Giant goes completely. > > 3) you violate write/read producer/consumer model. In -CURRENT, we try > to not have the code doing the freeing write to any part of the mbuf, > thus hopefully avoiding cache invalidation from occuring during a free > that happens from a thread that doesn't usually write to the mbuf at > all. Forgot, also, in -CURRENT: 4) breaks framework in mb_alloc that groups pages of mbufs and clusters together allowing them to be freed in the future, if/when we decide to implement lazy freeing; it would do the same thing if we mvoed mbuf allocations to UMA, because grouping together mbufs and clusters on a separate list will violate that scheme. > I think that because of (1) and (2), especially, and because the kernel > in -CURRENT is the way it is right now, you're getting better > performance numbers. Try exhausting the system and seeing how far you > get. I suspect you'll eventually just hang (because you'll allocate a > bunch of stuff violating the real interface and then won't be able to > wakeup and Do The Right Thing). > > While, as I said before, I agree that we should have an interface that > allocates both an mbuf and a cluster, I don't think that this is the > right approach. I believe that the correct approach would be to couple > the mbuf and cluster allocation under a single per-CPU lock/unlock > operation, which isn't too difficult to do. > > On Tue, Jul 02, 2002 at 12:10:50AM -0700, Luigi Rizzo wrote: > > [Bcc to -current as relevant there] > > > > As a followup to my question below, i did some simple experiment > > on a -current box acting as a router, using two "em" cards and > > DEVICE_POLLING (just for stability). > > > > The whole of the code is basically below -- the receiving > > side tries to grab the mbuf+cluster from the free pool first, > > and falls back to the standard method on failure; the sending > > side tries to attach the freed buffer to the free pool > > if it is the case. > > > > There is a simplification here with respect to what could > > go into an m_freem()/mgethdr_cluster() pair because the driver > > is already holding a lock, on Giant in this case, so you do not need > > further locking. Still, to come to the data points: > > > > CURRENT STABLE (*) > > > > fastfwd=1, pool_max=0 276kpps 365 kpps > > fastfwd=1, pool_max=50 355kpps 383 kpps > > > > fastfwd=0, pool_max=0 195 pps 142 kpps > > fastfwd=0, pool_max=50 227 kpps146 kpps > > > > (*) This version of STABLE that I am using for comparison has some > > proprietary optimizations which make it a bit faster than normal. > > However it still uses the old ipfw code, which is invoked when > > fastfwd=0, and is significantly slower than the new one. > > > > Now this really seems to call for adding this interface into the > > mbuf subsystem. I believe we have to find a name for the allocator > > (the deallocator might well go into m_freem(), depending on how we > > implement the locking) and whether it makes sense to lock > > mgethdr_cluster() under per-cpu locks or under Giant, or even let > > the caller make sure that it holds the proper lock before trying > > to invoke the procedure (as i expect the "producers" or "consumers" > > of these pairs to be located in the network stack, chances are that > > they already hold a lock on Giant). > > > > > > cheers > > luigi > > > > > > The code: > > > > struct mbuf *em_pool; > > > > static int em_pool_max = 50; > > SYSCTL_INT(_hw, OID_AUTO, em_pool_max, CTLFLAG_RW, > > &em_pool_max,0,&
Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?)
On Tue, Jul 02, 2002 at 10:12:22AM -0700, Luigi Rizzo wrote: > On Tue, Jul 02, 2002 at 01:01:35PM -0400, Bosko Milekic wrote: > ... > > The reason I said "seemingly" is because of the fact that I don't > > think that it is your "grouping" of both clusters and mbufs that > > specifically causes the perf. increase to the extent you showed with > > your tests. What you're doing in the code you posted is also avoiding > > could be, i don't know... but this is exactly why we both agree > that we better provide an interface so that we can work out in > parallel improvements in the driver and improvements in the mbuf > code. > > > 1) Remove the malloc() for external ref. count allocations; instead, > > store the ref. count at the end of the cluster like I did when I took > > this seems to violate one of your points about cache pollution! > but i am totally neutral on this one. The slot that contains the reference count line will be invalidated in either case so this is something that cannot be avoided in both cases (the malloc() case is doubly-worse, even), so if there's no way to avoid the invalidation (it's a per-cluster reference count after all), there's not much I can do about it. > > 2) Introduce an interface that will allocate and free: > > (i) an mbuf with a cluster attached to it; > > (ii) an M_PKTHDR mbuf with a cluster attached to it; > > However, this interface would wrap the appropriate alloc/free > > routines, although it would take care to not drop the allocation > > lock between the allocations. I don't suspect this to be too > > difficult to do. > > fine with me. Ok, so we agree on the interface and you've agreed to compromise on the solution. Now I just have to deliver. :-) I'll send you a patch by the end of the week. I've decided to haul in one of my -CURRENT test boxes to work and get some work done here, hopefully. > > It is not safe if you use it too often. The allocator was designed to > > allocate, it HAS caches, it doesn't need for everyone under the Sun to > > start keeping their own caches on top of that. > > which is what happens when they realise they can do things faster :) This is really bad philosophy. It is my understanding that to Do The Right thing, we ought to make certain compromises in design. We need to allow for certain functionality to exist and we cannot do that with per-driver hacked-up "solutions." Sure, I can hack up a small little per-softc list from which I may even be able to do allocations from WITHOUT a lock, but that'll introduce a whole new level of complexity (think about mbufs getting misplaced, etc.) With huge SMP-related changes, we cannot afford anymore of these spaghetti-like changes. > > Here's what happens: > > > > Consumers A and B each keep their own "pools" of mbufs+clusters. > > ... > > look, this is exactly what happens with network interfaces. If > they fail to allocate a new mbuf, they keep recycling > the one they have from the receive queue instead of freeing it. Yes, but that's because they _need_ an mbuf, they can't get one, so they re-use one. If you build a local pool in which you store UNUSED mbufs, with no real idea of when they'll be used - only with the assumption/hope that you'll use them "soon" - and if you do this in several different places in the kernel, you are bound to hit brokeness under heavy load, when you need to block from other locations to get your network buffers and cannot do that because someone else is "assuming" they'll need a bunch "sometime soon." This is why we have per-CPU caches and a global cache. The per-CPU caches load themselves accordingly, and they'll give you what you're looking for when you need it, from a cache. Sure, the caches are a tad bit more expensive to twiddle, but this is the compromise we make in order to ensure that the system knows about our caches and is able to cope even under heavy load. > cheers > luigi -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?)
n pool. A allocates and allocates and allocates from local pool. C comes in and allocates "normally" through the allocator. C cannot find a cluster and is willing to block, so it does. A frees. No wakeup. A frees. No wakeup. A frees. No wakeup. ... maybe D frees and generates wakeup, maybe not. > #2 given that all i am asking for is defining an interface for this > type of optimization, when the time comes that Giant goes, and > assuming I am not retired yet, it will still change nothing for > the non-SMP case, and for the SMP case we can run our performance > tests and if it proves to be useful we can put in whatever > is necessary. I already told you that I think that the interface is OK; I don't think that keeping yet another freelist for this sort of thing is OK. The mbufs and clusters will be on their respective free lists anyway, there really isn't a point of keeping yet another list around, if you read the compromise above and assume that the malloc() will disappear from the common cluster allocation case. > #3 easy to fix (assuming this is really a problem) -- just have the > free pool organized as a list of pointers to the mbufs, rather than > using the m_nextpkt field as the link field. Because the free > list is short and fixed size you can even use an array to > hold it. This is what the existing cache lists do anyway. Again, there is no point of doing this AGAIN. > #4 same answer of #2: if/when... we can easily fix things in just > one place. Until then, we are just pessimizing things for no > good reason. > > In terms of API, since we both agree that we need one (and we do > need one, otherwise driver writers will just go ahead and do their > own, replicating work in each driver, as I am sure I^hthey will > start doing now), may I suggest that we export two-- one where the > caller is assumed to already hold a lock on the CPU it is running > on, and one which also does the necessary locking upfront and calls > the latter afterwards ? This way the former will be usable right > now and efficiently from all contexts which run under Giant, and > easy to upgrade when/if we need it. > > And, I am sorry to say... you raise some valid points, but without > performance numbers to justify them, it is hard to tell how relevant > they are in practice. > > I have made too many mistakes of this kind myself to believe my > intuition. > > cheers > luigi > > On Tue, Jul 02, 2002 at 09:03:54AM -0400, Bosko Milekic wrote: > > > > There are several problems with your "simple" code that make it simpler > > > than what's in -CURRENT and in -STABLE and therefore yield [obviously] > > > seemingly better times. > > > > > > 1) sleeping in the case of a system exhausted of mbufs and clusters is > > > now broken because code that frees to this "combined" list will not > > > generate wake ups; even if you build this functionality in, you would > > > have to code the wakeup for a "special" case where one obtains both a > > > cluster and an mbuf; it'll get complicated. > > > > > > 2) no per-CPU caches, you bring contention (in -CURRENT) from per-CPU > > > back to global, which you probably don't care about now, but which will > > > matter when Giant goes completely. > > > > > > 3) you violate write/read producer/consumer model. In -CURRENT, we try > > > to not have the code doing the freeing write to any part of the mbuf, > > > thus hopefully avoiding cache invalidation from occuring during a free > > > that happens from a thread that doesn't usually write to the mbuf at > > > all. > > > >Forgot, also, in -CURRENT: > > > >4) breaks framework in mb_alloc that groups pages of mbufs and > >clusters together allowing them to be freed in the future, if/when we > >decide to implement lazy freeing; it would do the same thing if we > >mvoed mbuf allocations to UMA, because grouping together mbufs and > >clusters on a separate list will violate that scheme. > > > > > I think that because of (1) and (2), especially, and because the kernel > > > in -CURRENT is the way it is right now, you're getting better > > > performance numbers. Try exhausting the system and seeing how far you > > > get. I suspect you'll eventually just hang (because you'll allocate a > > > bunch of stuff violating the real interface and then won't be able to > > > wakeup and Do The Right Thing). > > > > > > While, as I said before, I agree that we should have an interface that > > > allocates both an mbuf and a cluster, I don't think that this is the > > > right approach. I believe that the correct approach would be to couple > > > the mbuf and cluster allocation under a single per-CPU lock/unlock > > > operation, which isn't too difficult to do. > > 1 > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-net" in the body of the message > -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: virtually contig jumbo mbufs (was Re: new zero copy sockets snapshot)
[ -current trimmed ] On Fri, Jul 05, 2002 at 08:08:47AM -0400, Andrew Gallatin wrote: > Would this be easier or harder than simple, physically contiguous > buffers? I think that its only worth doing if its easier to manage at > the system level, otherwise you might as well use physically > contiguous mbufs. My main goal is to see the per-driver cache's of > physical memory disappear ;) It would be much easier. The problem with getting physically contiguous memory is that shortly after the system gets going, memory becomes fragmented. So, from a system's perspective, it's very hard to get physically contiguous pages. This is why you see most drivers that actually do this sort of thing pre-allocate a pool of such beasts early on during boot up. Unfortunately, this means that they'll be eating up a lot of memory (some may stay unused forever) at a very early stage. As for the guarantee that the data region will start at a page boundary, yes I can ensure that as long as you don't tamper with offsetting the m_data field of the mbuf after the allocator hands it to you. That is, I can guarantee this: [ mbuf] [ ] [ m_data -]--->[ jumbo buf ] [ (page 1) ] [---] [ (page 2) ] [---] [ (page 3) ] So, as you see, it is deffinately do-able to have all the jumbo bufs start at a page boundary; however, it may be more worthwhile to have some of them start later. We would have to implement it first and we would have to do some measurements to see what really works best. What I hate about jumbo bufs is that there's a lot of wastage in the last (3rd page). I can't exactly use the last half of that last page for, say, a 2K cluster, because then I wouldn't be respecting the bucket "infrastructure" in mb_alloc that allows easy implementation of page freeing. Pretty much the only "realistic" thing I could do is allocate jumbo bufs in groups of 3 or 4; this would ensure much less wastage but would mean that not all of them would start at page boundaries. > Drew Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: virtually contig jumbo mbufs (was Re: new zero copy sockets snapshot)
On Fri, Jul 05, 2002 at 10:14:05AM -0400, Andrew Gallatin wrote: > I think this would be fine, But we'd need to know more about the > hardware limitations of the popular GiGE boards out there. We know > Tigon-II can handle 4 scatters, but are there any that can handle 3 > but not four? Why would you need 4? I can absolutely guarantee that a jumbo buf will not go over 3 pages (on i386, and 2 pages on alpha). > Drew Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: virtually contig jumbo mbufs (was Re: new zero copy sockets snapshot)
On Fri, Jul 05, 2002 at 10:45:50AM -0400, Andrew Gallatin wrote: > > Bosko Milekic writes: > > > > On Fri, Jul 05, 2002 at 10:14:05AM -0400, Andrew Gallatin wrote: > > > I think this would be fine, But we'd need to know more about the > > > hardware limitations of the popular GiGE boards out there. We know > > > Tigon-II can handle 4 scatters, but are there any that can handle 3 > > > but not four? > > > > Why would you need 4? I can absolutely guarantee that a jumbo buf > > will not go over 3 pages (on i386, and 2 pages on alpha). > > > > Perhaps I misread your earlier message. I thought you wanted to be > free to align them on something other than a page boundary, so as to > not waste so much space. In that case you'd need 4 scatters on i386, > right? All I meant was that I wouldn't guarantee that the buffer _begins_ at a page boundary. However, it still should only span at most 3 pages on i386. Let's say for the sake of argument that we won't need more than 9212 bytes for a jumbo buffer. Let's make the buffer 4 bytes bigger so that we'll have space for a ref. counter for it too (how convenient), so we'll allocate buffers of 9212 + 4 = 9216 bytes. It's time to allocate from the map: I'll try to grab 9 pages for a total of 4 jumbo buffers each spanning at most 3 pages (this is i386). I'll start the first buffer at the beginning of the first page and it'll end ((9216 - 8192) = 1024) bytes into the third page. I'll then start the second jumbo buffer there and finish it 2048 bytes into 5th page. I would then start the 3rd jumbo buffer there and finish it 3072 bytes into the 7th page. The last jumbo buf would start there and finish at the end of the 9th page. If my calculations are correct, that means that each jumbo buffer would span exactly 3 pages, but only the first one in the grouping would begin at a page boundary. There is also minimum wastage here. Briefly: Assuming that size_of_buf >= 2 * PAGE_SIZE, this is the above: PAGE_SIZE % (size_of_buf - 2 * PAGE_SIZE) == 0 We pick size_of_buf as small as possible but enough to accomodate the MTU and, and if possible, the ref. counter. I think 9216 should be a good number. At the same time, we try to make: bufs_per_bucket = PAGE_SIZE (integer DIV) (size_of_buf - 2 * PAGE_SIZE) bufs_per_bucket fairly reasonable (i.e., not too big). In the above scenario, size_of_buf = 9216 and bufs_per_bucket = 4. Admittedly, the numbers above were cooked up, but I think that they're fairly reasonable. > Drew Cheers, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: virtually contig jumbo mbufs (was Re: new zero copy sockets snapshot)
On Fri, Jul 05, 2002 at 09:45:01AM -0700, John Polstra wrote: > The BCM570x chips (bge driver) definitely need a single physically > contiguous buffer for each received packet. This is totally ridiculous for gigE hardware, IMO. Do you know of other cards that can't do scatter gather DMA? > John > -- > John Polstra > John D. Polstra & Co., Inc.Seattle, Washington USA > "Disappointment is a good sign of basic intelligence." -- Chögyam Trungpa Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: the incredible shrinking socket
On Mon, Jul 08, 2002 at 10:16:36AM -0700, Luigi Rizzo wrote: > On Mon, Jul 08, 2002 at 11:56:22AM -0500, Mike Silbersack wrote: > ... > > Speaking of competition, someone should go look at this: > > > > http://mail-index.netbsd.org/current-users/2002/07/03/0011.html > > UDP sockets have the same problem... i posted patches for that > case around dec.2000 which i never ended up committing. I spent a half-hour trying to dig for that thread. Do you recall what the subject of it was? When I saw this come up on DaemonNews, the first thing that came to mind was seeing that discussion; but now I can't find it through the mailing list search thing on the website. :-( > cheers > luigi > > > Sounds interesting. > > > > Mike "Silby" Silbersack Cheers, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: the incredible shrinking socket
On Mon, Jul 08, 2002 at 12:37:57PM -0500, Jonathan Lemon wrote: > On Mon, Jul 08, 2002 at 10:29:37AM -0700, Luigi Rizzo wrote: > > On Mon, Jul 08, 2002 at 01:22:14PM -0400, Bosko Milekic wrote: > > ... > > > > > Speaking of competition, someone should go look at this: > > > > > > > > > > http://mail-index.netbsd.org/current-users/2002/07/03/0011.html > > > > > > > > UDP sockets have the same problem... i posted patches for that > > > > case around dec.2000 which i never ended up committing. > > > > > > I spent a half-hour trying to dig for that thread. Do you recall what > > > the subject of it was? When I saw this come up on DaemonNews, the > > > > it was "[patch] fast sbappend*, please try..." > > see > > > > >http://docs.freebsd.org/cgi/getmsg.cgi?fetch=366972+0+archive/2001/freebsd-net/20010211.freebsd-net > > > > jlemon had an amended patch for that. > > I think we should revisit this keeping in mind the tcp case as well. > > I still have the amended patch in my tree, I'll dig it out this week. Luigi also mentionned at the end of the discussion that it would be worthwhile to - besides for just keeping a pointer to the last mbuf in the sockbuf - keep a pointer to the last mbuf in the packet. Maybe this pointer could be stashed in the m_pkthdr struct. > -- > Jonathan Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
mbuf external buffer reference counters
Hi, Right now, in -CURRENT, there is this hack that I introduced that basically just allocates a ref. counter for external buffers attached to mbufs with malloc(9). What this means is that if you do something like allocate an mbuf and then a cluster, there's a malloc() call that is made to allocate a small (usually 4-byte) reference counter for it. That sucks, and even -STABLE doesn't do this. I changed it this way a long time ago for simplicity's sake and since then I've been meaning to do something better here. The idea was, for mbuf CLUSTERS, to stash the counter at the end of the 2K buffer area, and to make MCLBYTES = 2048 - sizeof(refcount), which should be more than enough, theoretically, for all cluster users. This is by far the easiest solution (I had it implemented about 10 months ago) and it worked great. The purpose of this Email is to find out if anyone has concrete information on why this wouldn't work (if they think it wouldn't). So, if someone has an example of some broken code somewhere that wouldn't like this, please point it out to me now before I go off and do this again and commit it. Thanks, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
On Thu, Jul 11, 2002 at 01:38:02PM -0700, Luigi Rizzo wrote: > Hi, > > certainly removing the malloc will improve performance a lot. > > As I already mentioned to Bosko, in principle the available area > in ext.buffers is irrelevant, and i do not believe this will break > anything (and if it does, it will be easy to fix in the kernel), > but some applications might decide to do writes in multiple of 1K > and trimming away the refcount area might easily result in suboptimal > allocation of storage within the kernel. Can you elaborate on the sub-optimal performance comment with, perhaps, an example? I'm sorry but I'm sometimes slow to understand these days (lack of sleep). Just recently, I was wondering why a 10megabit card was being outperformed by a 100megabit card, where the first was plugged into a 10/100 dual speed hub. D'oh. :-( > I'd probably try to use the same technique as -stable (i.e. have > a separate array for all refcounts), if that is not too bad > from other points of view. The problem with this approach is that I'm probably going to be allocating jumbo bufs from the same map, in which case you would have huge `gaps' in your address <-> ref. count location map and, as a result, a lot of wastage. Plus the jumbo bufs will actually store the counter at the end of themselves, so it would be nice if clusters did the same. I don't mind either way, but it's the first reason that compells me to stash them at the end of the cluster. > cheers > luigi > > On Thu, Jul 11, 2002 at 04:20:26PM -0400, Bosko Milekic wrote: > > > > Hi, > > > > Right now, in -CURRENT, there is this hack that I introduced that > > basically just allocates a ref. counter for external buffers attached > > to mbufs with malloc(9). What this means is that if you do something > > like allocate an mbuf and then a cluster, there's a malloc() call that > > is made to allocate a small (usually 4-byte) reference counter for it. > > > > That sucks, and even -STABLE doesn't do this. I changed it this way > > a long time ago for simplicity's sake and since then I've been meaning > > to do something better here. The idea was, for mbuf CLUSTERS, to > > stash the counter at the end of the 2K buffer area, and to make > > MCLBYTES = 2048 - sizeof(refcount), which should be more than enough, > > theoretically, for all cluster users. This is by far the easiest > > solution (I had it implemented about 10 months ago) and it worked > > great. > > > > The purpose of this Email is to find out if anyone has concrete > > information on why this wouldn't work (if they think it wouldn't). > > So, if someone has an example of some broken code somewhere that > > wouldn't like this, please point it out to me now before I go off and > > do this again and commit it. > > > > Thanks, > > -- > > Bosko Milekic > > [EMAIL PROTECTED] > > [EMAIL PROTECTED] Thanks for your feedback. -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
On Thu, Jul 11, 2002 at 01:56:08PM -0700, Luigi Rizzo wrote: > example: userland does an 8KB write, in the old case this requires > 4 clusters, with the new one you end up using 4 clusters and stuff > the remaining 16 bytes in a regular mbuf, then depending on the > relative producer-consumer speed the next write will try to fill > the mbuf and attach a new cluster, and so on... and when TCP hits > these data-in-mbuf blocks will have to copy rather than reference > the data blocks... > > Maybe it is irrelevant for performance, maybe it is not, > i am not sure. I see what you're saying. I think that what this means is simply that the `optimal' chunk of data to send is just a different size, so instead of it being 8192 bytes, it'll be something like 8180 bytes or something (to account for the counters). So, in other words, it really depends on the frequency of exact 8192 sized sends in userland applications. This is a good observation if we're going to be doing benchmarking, but I'm not sure whether the repercussions are that important (unless, as I said, there's a lot of applications that send exactly 8192 byte chunks?). Basically, what we're doing is shifting the optimal send size when using exactly 4 clusters, in this case, to (8192 - 16) bytes. We can still send with exactly 4 clusters, it's just that the optimal send size is a little different, that's all (this produces a small shift in block send benchmark curves, usually). > > The problem with this approach is that I'm probably going to be > > allocating jumbo bufs from the same map, in which case you would have > > huge `gaps' in your address <-> ref. count location map and, as a > > how huge ? and do you really need to use the same map rather than > two different ones ? Well, I can use a different map, I guess (I use a different map for mbufs in order to not let huge cluster allocations eat up all of the address space reserved for mbufs). However, it seems that jumbo bufs and clusters are logically equivalent (your driver will either use one or the other) so it would make sense to have them share the same `chunk' of address space. As for the gaps, they are quite huge. I think we calculated a week or so ago when discussing jumbo bufs that we would probably end up allocating them in chunks of 3 or 4 at a time. So that would mean at least ~9 page 'holes' in the address space from which clusters are allocated, so that would mean ~18 counters wasted, at least, for every hole. With the number of jumbo bufs we would have, that can really add up. > cheers > luigi > -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
On Thu, Jul 11, 2002 at 11:00:56PM +0200, Juan Francisco Rodriguez Hervella wrote: > First of all, let me say that Im newbie with these topics, > I only know a bit about mbufs, but I dont understand how > can an application trim away the refcount if the size is > MCLBYTES = 2040 - sizeof(refcount). > > Don't this limit the room which an app. can write ? No. It just changes it. Also, it's not the application trimming away anything, and the size is not 2040, it's 2048. :-) FWIW, the application has no idea about mbufs or clusters. These are buffers and structures only the kernel knows about. > Thanks. > > JFRH -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
On Thu, Jul 11, 2002 at 07:31:17PM -0400, Kelly Yancey wrote: > > This is a good observation if we're going to be doing benchmarking, > > but I'm not sure whether the repercussions are that important (unless, > > as I said, there's a lot of applications that send exactly 8192 > > byte chunks?). Basically, what we're doing is shifting the optimal > > send size when using exactly 4 clusters, in this case, to (8192 - 16) > > bytes. We can still send with exactly 4 clusters, it's just that the > > optimal send size is a little different, that's all (this produces a > > small shift in block send benchmark curves, usually). > > > > Are you kidding? Benchmarks, presumably like every other piece of > software produced by someone trying to get the most performance out of > the system, are more likely to have power-of-two write buffers. Are you > willing to risk that they didn't also just happen to pick a multiple of > 2^11? > > Yes, it seems elegant to put the counters in the space that is normally > unused for receive mbuf clusters, but you can't just blow off Luigi's > point regarding the send side. First of all, I'm not "blowing off" anyone's comments. I don't appreciate the fact that you're eagerly instructing me to "not blow off comments" (which I didn't do to begin with) without providing any more constructive feedback. All I pointed out was that the optimal block size is merely changed from an exact 2k, 4k, 8k, etc. to something slightly smaller. What point are *you* trying to put across? Tell me what's bad about that or, better: Do you have a better suggestion to make? What do *you* suggest we do with the external ref. counts? Please, spare me the flame bait. I wasn't being confrontational when I answered Luigi's post and I don't need anyone turning this into something confrontational. Thanks. > Kelly > > -- > Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org} -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
On Thu, Jul 11, 2002 at 04:10:32PM -0700, Julian Elischer wrote: > Don't forget that "external" does not neccesarily mean "cluster". > I still consider the method used in (hmm was it NetBSD or OSF/1?) > to be very good.. > > mbufs that referred to the same object were linked together. > I forget the details exactly. maybe someone else can remember.. > it did it without ref counts somehow.. Yes, this is in NetBSD still and it is very elegant. I remember looking at this a long time ago but to be honest, the reason I didn't implement it then first escaped me. However, thanks to David Malone's awesome commit messages, I found it: rev 1.53 of sys/sys/mbuf.h, extract: [...] "NetBSD's system of linked lists of mbufs was cosidered, but Alfred felt it would have locking issues when the kernel was made more SMP friendly." [...] I think it's almost clear now that there are, in fact, no SMP issues with it (we don't do per-cluster locking, or anything ridiculous like that), so unless Alfred has the reason again, I'll consider that method again instead. Thanks for the constructive feedback. Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
I'm sorry. I should have waited before hitting the "send" button. I've had a long and [shitty] day and I shouldn't have blew it off here. Sorry. -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
On Fri, Jul 12, 2002 at 12:10:41AM -0700, Alfred Perlstein wrote: > * Julian Elischer <[EMAIL PROTECTED]> [020712 00:00] wrote: > > > > > > On Thu, 11 Jul 2002, Alfred Perlstein wrote: > > > > > > That's true, but could someone explain how one can safely and > > > effeciently manipulate such a structure in an SMP environment? > > > > what does NetBSD do for that? > > They don't! > > *** waves skull staff exasperatedly *** > > RORWLRLRLLRL Again, Alfred is right. :-) I can't think of a way to ensure that the owner of the other mbuf doesn't manipulate its two forward/backward pointers while we're manipulating ours. The only way that springs to mind is to have them protected by a mutex, but: 1) that would be very expensive and would bloat the mbuf structure a LOT; 2) we would probably run into lock order reversal problems. I see now what Alfred meant when he made his original comment. -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
On Fri, Jul 12, 2002 at 04:26:53AM -0700, Jon Mini wrote: > On Thu, Jul 11, 2002 at 11:41:04PM -0700, Alfred Perlstein wrote: > > > That's a cool idea.. haven't looked at NetBSD but am imagining the > > > mbufs would be linked in a 'ring'. This works because you never > > > care how many references are, just whether there's one or more than > > > one, and this is easy to tell by examining the ring pointer. > > > I.e., you never have to iterate through the entire ring. > > > > That's true, but could someone explain how one can safely and > > effeciently manipulate such a structure in an SMP environment? > > > > I'm not saying it's impossible, I'm just saying it didn't seem > > intuative to me back then, as well as now. > > I'm probably speaking out of turn here (I have no idea what structure you > all are talking about), but a monodirectional ring can be safely modified > with a compare-and-exchange atomic operation. The jist of the problem is that when you want to say, remove yourself from the list, you have to: 1) your "next"'s back pointer to your "back" pointer 2) your "Prev"'s next pointer to your "next" pointer So that's two operations but for all you know your "next" or your "back" may be doing the same thing to you at the same time. As far as I know, you can't (intuitively) figure out a way to do both of these atomically. i.e., maybe you'll set your next's back pointer to whatever you have in `back' but then your `back' guy will set your back pointer to whatever he has in `back' and then your next guy's back pointer will be invalid, for example. So I guess that what we're dealing with isn't really a "monodirectional" ring. Right? > -- > Jonathan Mini <[EMAIL PROTECTED]> > http://www.freebsd.org/ Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
On Fri, Jul 12, 2002 at 11:03:45AM -0700, John Polstra wrote: > I've been out of town and I realize I'm coming into this thread late > and that it has evolved a bit. But I still think it's worthwhile to > point out a very big problem with the idea of putting the reference > count at the end of each mbuf cluster. It would have disastrous > consequences for performance because of cache effects. Bear with me > through a little bit of arithmetic. > > Consider a typical PIII CPU that has a 256 kbyte 4-way set-associative > L2 cache with 32-byte cache lines. 4-way means that there are 4 > different cache lines associated with each address. Each group of 4 > is called a set, and each set covers 32 bytes of the address space > (the cache line size). > > The total number of sets is: > > 256 kbytes / 32 bytes per line / 4 lines per set = 2048 sets > > and as mentioned above, each set covers 32 bytes. > > The cache wraps around every 256 kbytes / 4-way = 64 kbytes of address > space. In other words, if address N maps onto a given set, then > addresses N + 64k, N + 128k, etc. all map onto the same set. > > An mbuf cluster is 2 kbytes and all mbuf clusters are well-aligned. > So the wrap around of the cache occurs every 64 kbytes / 2 kbytes per > cluster = 32 clusters. To put it another way, all of the reference > counts would be sharing (i.e., competing for) the same 32 cache sets > and they would never utilize the remaining 2061 sets at all. Only > 1.56% of the cache (32 sets / 2048 sets) would be usable for the > reference counts. This means there would be a lot of cache misses as > reference count updates caused other reference counts to be flushed > from the cache. > > These cache effects are huge, and they are growing all the time as CPU > speeds increase while RAM speeds remain relatively constant. I've thought about the cache issue with regards to the ref. counts before, actually, and initially, I also thought the exact same thing as you bring up here. However, there are a few things you need to remember: 1) SMP; counters are typically referenced by several different threads which may be running on different CPUs at any given point in time, and this means that we'll probably end up having corresponding cache lines invalidated back and forth anyway; 2) Using more cache lines may not be better overall, we may be doing write-backs of other data already there; in any case, we would really have to measure this; 3) By far the most important: all modifications to the ref. count are atomic, bus-locked, ops. I spoke to Peter a little about this and although I'm not 100% sure, we think that bus-locked fetch-inc/dec-stores need the bus anyway. If that's the case, then we really don't care about whether or not they get cached, right? > John > -- > John Polstra > John D. Polstra & Co., Inc. Seattle, Washington USA > "Disappointment is a good sign of basic intelligence." -- Chögyam Trungpa Thanks for the cool infos. and feedback. Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
On Fri, Jul 12, 2002 at 06:55:37PM -0400, Andrew Gallatin wrote: [...] FWIW, BSD/OS also does similar to -STABLE. [...] > I agree with John about where to put the refcnts: I think we should > have a big hunk of memory for the refcnts like in -stable. My > understanding is that the larger virtually contig mbufs are the only > thing that would cause a problem for this, or is that incorrect? > If so, then why not just put their counter elsewhere? > > One concrete example against putting the refcnts into the cluster is > that it would cause NFS servers & clients to use 25% more mbufs for a > typical 8K read or write request. If we decide to allocate jumbo bufs from their own seperate map as well then we have no wastage for the counters for clusters if we keep them in a few pages, like in -STABLE, and it should all work out fine. For the jumbo bufs I still maintain that we should keep the counter for them at the end of the buf because the math works out (see my post in that thread with the math example) and because their total size is not a power of 2 anyway. They'll also be more randomly spread out and use more cache slots. > Drew Regards, -- Bosko Milekic [EMAIL PROTECTED] [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Rename of MSIZE kernel option..
Not that my opinion really holds much weight with you guys but for what it's worth, I think the change would be gratuitist. 1. MSIZE has been around forever. 2. The argument that sys/sys/mbuf.h should have MSIZE removed/changed because some other code may use it is fallacious. The "other code" should be careful not to use a constant that has been so named for the longest time I can recall. On Mon, Oct 14, 2002 at 04:15:12PM -0400, John Baldwin wrote: > Would people be open to renaming the 'MSIZE' kernel option to something > more specific such as 'MBUF_SIZE' or 'MBUFSIZE'? Using 'MSIZE' can > break other places in the kernel. For example, ISA device ivars have > an ivar for the size of a memory resource called 'MSIZE' and the kernel > option causes breakage in src/sys/isa/isavar.h: > > ISA_ACCESSOR(msize, MSIZE, int) > > when ISA_ACCESSOR is properly defined via __BUS_ACCESSOR() rather than > homerolling a private copy of __BUS_ACCESSOR(). For now I've fixed it > to rename the ISA ivar to ISA_IVAR_MEMSIZE but MSIZE seems to be too > generic a name for a kernel option and it would be nice to avoid this > problem in the future. > > -- > > John Baldwin <[EMAIL PROTECTED]> <>< http://www.FreeBSD.org/~jhb/ > "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: @stake advisory: etherleak
On Tue, Jan 07, 2003 at 02:15:02PM -0800, Nate Lawson wrote: > The short of it is that if a tx packet is < 64 bytes (min ethernet frame > len), data can be leaked if the driver transmits 64 bytes. It seems our > use of mbufs would prevent leakage but I haven't examined any drivers to > verify this. > > http://www.atstake.com/research/advisories/2003/atstake_etherleak_report.pdf > > -Nate Heh! This seems like a pretty long paper for what it's worth. I can see how we could have the same "problem" but keep in mind the scope of the issue (the paper blows it out of proportion, I think, for someone who may not necessarily understand the issue too well): we're talking about very small packets here, under 64 bytes. So the "leaked information" will be AT MOST 63 bytes, but practically somewhere around 10 bytes and would be very occasional. Not to mention that it's totally undefined and random. An "attacker" might as well just rely on temperature to guess at how to interpret what he/she's seeing in those few bytes. The data in our case is probably DMA'd straight out of the mbuf's data region so what you'll probably find in there is just randomness from something before, not necessarily network data. Moreover, as the paper does mention, although the RFCs do appear to specify that for packets that are padded to the minimum ethernet packet size the padded bytes should be zeroed, it's pretty ambiguous as to where the zero-ing should happen. As noted, some cards will do the zeroing onboard. If you really see this as a 'security problem' then feel free to zero the padded bytes manually in the problematic drivers. Doing it universally would be too much in my opinion, especially given that some hardware takes care of it itself. -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: @stake advisory: etherleak
On Tue, Jan 07, 2003 at 06:09:23PM -0800, Nate Lawson wrote: > On Tue, 7 Jan 2003, Bosko Milekic wrote: > > On Tue, Jan 07, 2003 at 02:15:02PM -0800, Nate Lawson wrote: > > Not to mention that > > it's totally undefined and random. > > Well, you have the guarantee that it's network data since the min mbuf > size is 128 bytes and a packet must be less than 64 bytes to trigger this. No, not really. Yes, the mbufs come from a vmmap that is reserved for mbufs and clusters but it doesn't mean that the underlying physical memory contains network data, necessarily. It is the case if you're getting an mbuf from the allocator cache, but as soon as you make that call to vm you no longer have that guarantee. > > An "attacker" might as well just > > rely on temperature to guess at how to interpret what he/she's seeing > > in those few bytes. The data in our case is probably DMA'd straight > > out of the mbuf's data region so what you'll probably find in there is > > just randomness from something before, not necessarily network data. > > Since the mbuf pool is statically allocated at boot, it's likely only mbuf > hdrs or contents would leak this way. Still, this is data leakage even > though it's a small channel. No. You are leaking contents but it is not necessarily network data. It may be more often than not, but there is no way to determine what the nature/source of the data is remotely. > > Moreover, as the paper does mention, although the RFCs do appear to > > specify that for packets that are padded to the minimum ethernet > > packet size the padded bytes should be zeroed, it's pretty ambiguous > > as to where the zero-ing should happen. As noted, some cards will do > > the zeroing onboard. > > For cards that don't, it's the driver's responsibility to do this. Some > do (xe) and some don't (ie). > > > If you really see this as a 'security problem' then feel free to zero > > the padded bytes manually in the problematic drivers. Doing it > > universally would be too much in my opinion, especially given that > > some hardware takes care of it itself. > > I believe this should be done for the same reason that pages allocated to > the user are pre-zeroed. OK, but not in a universal fashion. I think drivers with this problem should implement it locally. Not all drivers have the problem because not all network cards "forget" to pad the data with zeros. > -Nate -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Performance tuning hints of gigabit networking?
You're not running out of mbufs or clusters, you're out of RAM. Don't bump up nmbclusters anymore because you don't need to; instead, add more RAM. On Wed, Feb 26, 2003 at 10:05:51PM +0900, CHOI Junho wrote: > > Hi, > > I am looking for a good resource for kernel tuning on very high > bandwidth HTTP servers(avg 500Mbit/sec, peak 950Mbit/sec). Today I > faced very unusual situation with 950Mbit/sec bandwidth! > > > netstat -m > 16962/93488/262144 mbufs in use (current/peak/max): > 16962 mbufs allocated to data > 16952/65536/65536 mbuf clusters in use (current/peak/max) > 15 Kbytes allocated to network (14% of mb_map in use) > 512627 requests for memory denied > 2614 requests for memory delayed > 0 calls to protocol drain routines > > I set kern.ipc.nmbclusters=65536, but it overflowed. This is P-IV Xeon > 1.8G, 2GB RAM, and one Intel 1000baseSX(em driver) machine running > 4.7-RELEASE-pX. This server is running only one service, HTTP. I use > thttpd, since apache doesn't work in such a high load. thttpd is highly > amazing, just give <1 load in any time. > > Once I tried to increase kern.ipc.nmbclusters to 131072 or > higher(multiple of 65536 or 32768, tuning(7) only cites about 32768 > case..), it fails to boot kernel when 262144, or kernel panic in > somewhat higher load when 131072, so I gave up other changes and fall > back to 65536. > > What is a good way to calcurate this value safely? Here is another > hint, /etc/sysctl.conf: > > net.link.ether.inet.log_arp_wrong_iface=0 > kern.ipc.maxsockbuf=2048000 > kern.ipc.somaxconn=4096 > kern.ipc.maxsockets=6 > kern.maxfiles=65536 > kern.maxfilesperproc=32768 > net.inet.tcp.rfc1323=1 > net.inet.tcp.delayed_ack=0 > net.inet.tcp.sendspace=65535 > net.inet.tcp.recvspace=65535 > net.inet.udp.recvspace=65535 > net.inet.udp.maxdgram=57344 > net.inet.icmp.drop_redirect=1 > net.inet.icmp.log_redirect=1 > net.inet.ip.redirect=0 > net.inet6.ip6.redirect=0 > net.link.ether.inet.max_age=1200 > net.inet.ip.sourceroute=0 > net.inet.ip.accept_sourceroute=0 > net.inet.icmp.bmcastecho=0 > net.inet.icmp.maskrepl=0 > net.inet.tcp.inflight_enable=1 > > kernel configuration is not specially tuned, except DEVICE_POLLING and > HZ=2000. > > -- > CHOI Junho <http://www.kr.FreeBSD.org/~cjh> KFUG > FreeBSD Project Web Data Bank > Key fingerprint = 1369 7374 A45F F41A F3C0 07E3 4A01 C020 E602 60F5 > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-net" in the body of the message > -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf cache
On Wed, Mar 05, 2003 at 01:12:55AM +0200, Petri Helenius wrote: > > > Any comments on the high cpu consumption of mb_free? Or any other places > > > where I should look to improve performance? > > > > What do you mean "high cpu consumption?" The common case of mb_free() > > is this: > > According to profiling mb_free takes 18.9% of all time consumed in kernel and is > almost double to next cpu consuming function. Since I´m looking how to optimize > the system, it´s usually a good idea to start looking where most CPU is spent. > > For example, I´m wondering if mbufs get unneccessarily freed and allocated or thrown > around different buckets because of the tunables are wrong. I´m not saying that > there must be something wrong with the code itself. > > For example what does "in use" mean below: There is no way there is enough > traffic on the system to allocate 7075 mbufs when this netstat -m was taken. > > mbuf usage: > GEN cache: 0/0 (in use/in pool) > CPU #0 cache: 7075/8896 (in use/in pool) > CPU #1 cache: 1119/4864 (in use/in pool) > Total: 8194/13760 (in use/in pool) > Mbuf cache high watermark: 8192 > Mbuf cache low watermark: 128 > > > Pete This does look odd... maybe there's a leak somewhere... does "in use" go back down to a much lower number eventually? What kind of test are you running? "in pool" means that that's the number in the cache while "in use" means that that's the number out of the cache currently being used by the system; but if you're telling me that there's no way usage could be that high while you ran the netstat, either there's a serious leak somewhere or I got the stats wrong (anyone else notice irregular stats?) Another thing I find odd about those stats is that you've set the high watermark to 8192, which means that in the next free, you should be moving buckets to the general cache... see if that's really happening... The low watermark doesn't affect anything right now. Can you give me more details on the exact type of test you're running? Let's move this to -current instead of -current and -net please (feel free to trim the one you want), getting 3 copies of the same message all the time is kinda annoying. :-( -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf cache
On Tue, Mar 04, 2003 at 11:34:11PM +0200, Petri Helenius wrote: > > I did some profiling on -CURRENT from a few days back, and I think I haven´t > figured the new tunables out or the code is not doing what it´s supposed to > or I´m asking more than it is supposed to do but it seems that mb_free > is being quite wasteful... > > Any pointers to how the new high/low watermark tunables should be used? > > Is it normal that after almost all traffic has been stopped there is still 8k+ > mbufs in "cache"? > > Pete Yes, it's normal. The commit log clearly states that the new watermarks do nothing for now. I have a patch that changes that but I haven't committed it yet because I left for vacation last Sunday and I only returned early this Monday. Since then, I've been too busy to clean up and commit it. In about a week or so you should notice a difference. -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf cache
On Wed, Mar 05, 2003 at 12:24:27AM +0200, Petri Helenius wrote: > > > > Yes, it's normal. The commit log clearly states that the new > > watermarks do nothing for now. I have a patch that changes that but I > > haven't committed it yet because I left for vacation last Sunday and I > > only returned early this Monday. Since then, I've been too busy to > > clean up and commit it. In about a week or so you should notice a > > difference. > > > Any comments on the high cpu consumption of mb_free? Or any other places > where I should look to improve performance? What do you mean "high cpu consumption?" The common case of mb_free() is this: bucket = mb_list->ml_btable[MB_BUCKET_INDX(m, mb_list)]; owner = bucket->mb_owner & ~MB_BUCKET_FREE; switch (owner) { ... default: cnt_lst = MB_GET_PCPU_LIST_NUM(mb_list, owner); MB_PUT_OBJECT(m, bucket, cnt_lst); MB_MBTYPES_DEC(cnt_lst, type, 1); if (bucket->mb_owner & MB_BUCKET_FREE) { SLIST_INSERT_HEAD(&(cnt_lst->mb_cont.mc_bhead), bucket, mb_blist); bucket->mb_owner = cnt_lst->mb_cont.mc_numowner; } } That's the common path, modulo a couple checks on whether or not the container being freed to needs to be moved or whether we're in a starvation situation. The only thing to be done, possibly, is make the routine smaller by moving those couple of actions to seperate routines, but I'm not clear that this is very useful, given that mb_free()'s usage is restricted to only inside subr_mbuf.c > Pete -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Zero Copy Sockets?
On Sun, Jun 22, 2003 at 12:46:26PM -0700, George V. Neville-Neil wrote: > Hi, > > I'm reading over the internals of the network stack in > -CURRENT and I'm wondering if the Zero Copy stuff is actually > in use yet. > > Thanks, > George Yes. But your driver needs to support it. Currently, the only driver that does is if_ti in src/sys/pci/if_ti*.[ch] sendfile(2) is also zero-copy for the most part (for the file part). That's been in use for a long time. src/sys/kern/uipc_syscalls.c Regards, -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] TECHNOkRATIS Consulting Services * http://www.technokratis.com/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fine grained locking at the socket level?
On Sun, Jun 22, 2003 at 04:58:54PM -0700, George V. Neville-Neil wrote: > Hi, > > It would seem that splnet() and frieds now simply return 0, > which I figure is part of making the code look like it used > to. What I'm wondering is why the Giant lock is still used in > the socket layer? I thought sockets had had fine grained > locking applied to them. Am I confused? I'm looking at the > bits du jour (-CURRENT). > > Thanks, > George The short answer is: we're not done. The long answer is: we're not done. :-) We can't simply unwind Giant just anywhere yet because there is still code in other layers that requires Giant. Cheers, -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] TECHNOkRATIS Consulting Services * http://www.technokratis.com/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Kernel tuning for large maxsockets
On Tue, Jul 15, 2003 at 05:22:55PM -0400, Scot Loach wrote: > Currently, whenever maxsockets is increased, this causes kernel memory to be > preallocated for each type of pcb (tcp, udp, raw, divert). The number of > pcbs preallocated for each of these is always the same as maxsockets. > > This is probably a waste of memory for raw sockets and divert sockets, since > they would not normally be used in large numbers. A large server could > save kvm by reducing the number of divert and raw pcbs preallocated. For > example, on a machine configured for 200,000 maxsockets we would save 75MB > of kvm out of a total of 262MB that would have been preallocated. This kvm > savings can be used to increase maxsockets even more. What version of FreeBSD are you talking about here? In -current, the pcbs come off of zones which are capped at a maximum w.r.t. maxsockets. The kva space comes out of kmem_map and the objects are kept cached in their respective zones. One thing to note is that the zones are setup with UMA_ZONE_NOFREE, so the pages wired down for pcb use are probably never unwired. I don't know why, exactly, UMA_ZONE_NOFREE is required for all of the pcb zones in the netinet code. > Is there any reason I should not modify the kernel code to only let a small, > fixed number of raw and divert pcbs be preallocated instead of having them > scale with maxsockets? > > Next, does this seem like a generally useful thing that could be rolled back > into the source tree? I could make this a kernel option or a tunable sysctl > variable. > > thanks > > Scot Loach -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] TECHNOkRATIS Consulting Services * http://www.technokratis.com/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"