Network buffer allocations: mbuma, PLEASE TEST

2004-05-26 Thread Bosko Milekic
Hi,

  If you're running -CURRENT, please test this:

http://people.freebsd.org/~bmilekic/code/mbuma2.diff

  It is several extensions to UMA and mbuf & cluster allocation
  built on top of it.

  Once you apply the patch from src/, you need to rebuild and
  reinstall src/usr.bin/netstat, src/usr.bin/systat, and then
  a new kernel.  When you're configuring your new kernel,
  you should remove the NMBCLUSTERS compile-time option, it's
  no longer needed.  Clusters will still be capped off
  according to maxusers (which is auto-tuned itself).
  Alternately, if you want theoretically unlimited number
  of clusters, you can tune the boot-time kern.ipc.nmbclusters
  tunable to zero.

  Unless final issues arise I'm going to commit this tomorrow
  morning; it's been tested already quite a bit, and performance
  considered.  A paper is available and was presented at
  BSDCan 2004; in case you missed it:

http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf

  It has been looked at for quite some time now.  Additional
  code cleanups will need to occur following commit, maybe.
  Future work is also possible, see the paper if you're
  interested in taking some of it on.

  Oh, and keep me in the CC; I have no idea if I'm
  subscribed to these lists anymore.  You should also follow
  up to this thread on -net and not on -hackers (trim
  -hackers from CC in the future).  Thanks and happy
  hacking!

Regards,
--
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]
 
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: polling(4) rocks!

2004-11-17 Thread Bosko Milekic

On Wed, Nov 17, 2004 at 02:08:25PM -0500, Charles Swiger wrote:
> On Nov 17, 2004, at 1:52 PM, Eugene Grosbein wrote:
> >On Wed, Nov 17, 2004 at 09:13:51PM +0300, Yar Tikhiy wrote:
> >[ ...praise of polling(4)... ]
> >Does polling(4) increase latency? It is very imortant for router
> >that handles lots of RTP (VoIP) traffic.
> 
> Using polling does increase the latency of the traffic, slightly: if 
> you follow the recommended configuration and set HZ=1000 or so, the 
> added latency typically ends up being around 1ms.
>
> That's not enough to affect VoIP significantly.  I've got 12 VoIP phone 
> lines going through a FreeBSD 4.10 firewall using polling(4) over a T1 
> now.  The firewall box is a Dell P3/400MHz or so using fxp and a 
> quad-port card using the DEC 21x4x chipset.

  Latency is affected if in the case of using interrupts, it is lower
  than for polling.  If you plot a latency versus pps graph, for
  example, you might notice that in the case of interrupts, the curve
  keeps growing beyond a certain pps number, whereas in the case of
  polling it flattens out.  This happens because in the case of
  interrupts, more packets means more interrupts, always (assuming your
  card's coalescing abilities have been exhausted, i.e., very
  high pps, all the time), and more interrupts means more latency.  In
  the case of polling, more packets does not mean more interrupts, and
  the system will process them at a more or less constant rate.

> -- 
> -Chuck

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Dingo and PerForce

2004-12-19 Thread Bosko Milekic

This way he can push changes back into 'dingo' before pushing them back
into HEAD.  It makes a lot of sense, actually, to do this sort of 3-tier
approach when multiple people are working on different parts of a larger
sub-project, like for example dingo.

HEAD (top, must work) <-- (dingo, pre-production, test all changes
together) <-- (your individual branch, single set of changes).

You develop in your individual branch, test your changes.  If all is
well, you push into dingo, where changes get tested with respect to
other dingo-related changes (which have not yet been pushed into HEAD).
When it's all ready, everything gets pushed at once into HEAD, or in
pieces, but you know that the individual pieces work well together.
This is because dingo changes less often than HEAD.

-Bosko

On Sun, Dec 19, 2004 at 06:20:17PM +0100, Max Laier wrote:
> On Sunday 19 December 2004 05:23, [EMAIL PROTECTED] wrote:
> > For those who use PerForce and want to work on Dingo there is
> > now a dingo branch, named "dingo".  The dingo branch contains
> > all of src, not just sys, as I suspect there are userland bits
> > we'll want to do.  I know I'll be doing userland things.
> > Please make your own sub-branch off of dingo.  For example,
> > here is mine (gnn_dingo):
> <...>
> > The main branch (dingo) is locked against changes and, more
> > importantly, accidental deletion.  Let me know if there are any issues
> > with any of this.
> >
> > Oh, and the integration from vendor to dingo was done last Saturday
> > night around 23:00 JST.
> 
> I am not all that familiar with perforce so I am wondering: What is the 
> benefit of having this branch if it will be tracking HEAD anyway, we can just 
> branch off the normal vendor branch. Or do you plan to integrate parts back?
> 
> I have 'mlaier_carp2' off the vendor branch. What will be the benefit if I 
> move it to 'dingo' instead?
> 
> -- 
> /"\  Best regards,  | [EMAIL PROTECTED]
> \ /  Max Laier      | ICQ #67774661
>  X   http://pf4freebsd.love2party.net/  | [EMAIL PROTECTED]
> / \  ASCII Ribbon Campaign  | Against HTML Mail and News



-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Network Emulation Software [recomendations please ?]

2005-01-21 Thread Bosko Milekic

On Fri, Jan 21, 2005 at 11:14:48AM +1030, Wilkinson, Alex wrote:
> Hi all,
> 
> I would like to find some network simulation software that is similar but
> better than NIST Net [http://www-x.antd.nist.gov/nistnet/].
> 
> Can anyone recommend any network simulation software that "is a
> general-purpose tool for emulating performance dynamics in IP
> networks".
> 
> Cheers
> 
>  - aW

  If you are open to commercial [non-free] solutions, take a look at
  OPNet Modeler (http://www.opnet.com/products/modeler/home.html). 

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Race condition in mb_free_ext()?

2005-03-01 Thread Bosko Milekic
On Tue, 1 Mar 2005 13:40:18 -0500, John Baldwin <[EMAIL PROTECTED]> wrote:
> On Monday 28 February 2005 07:04 pm, Kris Kennaway wrote:
> > I'm seeing an easily-provoked livelock on quad-CPU sparc64 machines
> > running RELENG_5.  It's hard to get a good trace because the processes
> > running on other CPUs cannot be traced from DDB, but I've been lucky a
> > few times:
> >
> > db> show alllocks
> > Process 15 (swi1: net) thread 0xf8001fb07480 (18)
> > exclusive sleep mutex so_snd r = 0 (0xf800178432a8) locked @
> > netinet/tcp_input.c:2189 exclusive sleep mutex inp (tcpinp) r = 0
> > (0xf800155c3b08) locked @ netinet/tcp_input.c:744 exclusive sleep mutex
> > tcp r = 0 (0xc0bdf788) locked @ netinet/tcp_input.c:617 db> wh 15
> > Tracing pid 15 tid 18 td 0xf8001fb07480
> > sab_intr() at sab_intr+0x40
> > psycho_intr_stub() at psycho_intr_stub+0x8
> > intr_fast() at intr_fast+0x88
> > -- interrupt level=0xd pil=0 %o7=0xc01a0040 --
> > mb_free_ext() at mb_free_ext+0x28
> > sbdrop_locked() at sbdrop_locked+0x19c
> > tcp_input() at tcp_input+0x2aa0
> > ip_input() at ip_input+0x964
> > netisr_processqueue() at netisr_processqueue+0x7c
> > swi_net() at swi_net+0x120
> > ithread_loop() at ithread_loop+0x24c
> > fork_exit() at fork_exit+0xd4
> > fork_trampoline() at fork_trampoline+0x8
> > db>
> >
> > That code is here in mb_free_ext():
> >
> > /*
> >  * This is tricky.  We need to make sure to decrement the
> >  * refcount in a safe way but to also clean up if we're the
> >  * last reference.  This method seems to do it without race.
> >  */
> > while (dofree == 0) {
> > cnt = *(m->m_ext.ref_cnt);
> > if (atomic_cmpset_int(m->m_ext.ref_cnt, cnt, cnt - 1)) {
> > if (cnt == 1)
> > dofree = 1;
> > break;
> > }
> > }
> 
> Well, this is obtuse at least.  A simpler version would be:
> 
> do {
> cnt = *m->m_ext.ref_cnt;
> } while (atomic_cmpset_int(m->m_ext.ref_cnt, cnt, cnt - 1) == 0);
> dofree = (cnt == 1);
> 
> --
> John Baldwin <[EMAIL PROTECTED]>  <><  http://www.FreeBSD.org/~jhb/
> "Power Users Use the Power to Serve"  =  http://www.FreeBSD.org

Your suggestion will always enter the loop and do the atomic
regardless of what dofree is set to above that code (not shown in
Kris' paste):

[...]
/* Account for lazy ref count assign. */
if (m->m_ext.ref_cnt == NULL)
dofree = 1;
else
dofree = 0;

/*
 * This is tricky.  We need to make sure to decrement the
 * refcount in a safe way but to also clean up if we're the
 * last reference.  This method seems to do it without race.
 */
[...]

The segment could still be reworked, but anyway:

This does not appear to explain the livelock.  What's m->m_ext.ref_cnt
point to? And what is the value at the location pointed to by
m->m_ext.ref_cnt? Regardless, though, the livelock itself, assuming it
is due to a long time being spent spinning in the above loop, should
not be caused by underruns or overruns of the reference count (those
may only cause leaking of the cluster).

Furthermore, the above code has been around in that form for some time
now and in fact the loop was probably entered *more* often in the past
(before the 'dofree' variable was introduced there).  Since when are
you able to cause the livelock to happen, and are you sure it is the
mb_free_ext() that is looping indefinitely?

I do not know sparc64 well, but what are the semantics of
atomic_cmpset_int()? I see that it is defined to use the 'casa'
instruction; does atomic_cmpset_int() behave the same way as it does
on i386?

-Bosko 

-- 
Bosko Milekic - If I were a number, I'd be irrational.
Contact Info: http://bmilekic.unixdaemons.com/contact.txt
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Race condition in mb_free_ext()?

2005-03-01 Thread Bosko Milekic
P.S.: Not sure if this is related:
http://www.mail-archive.com/bk-commits-24@vger.kernel.org/msg00136.html


On Tue, 1 Mar 2005 18:04:27 -0500, Bosko Milekic
<[EMAIL PROTECTED]> wrote:
> On Tue, 1 Mar 2005 13:40:18 -0500, John Baldwin <[EMAIL PROTECTED]> wrote:
> > On Monday 28 February 2005 07:04 pm, Kris Kennaway wrote:
> > > I'm seeing an easily-provoked livelock on quad-CPU sparc64 machines
> > > running RELENG_5.  It's hard to get a good trace because the processes
> > > running on other CPUs cannot be traced from DDB, but I've been lucky a
> > > few times:
> > >
> > > db> show alllocks
> > > Process 15 (swi1: net) thread 0xf8001fb07480 (18)
> > > exclusive sleep mutex so_snd r = 0 (0xf800178432a8) locked @
> > > netinet/tcp_input.c:2189 exclusive sleep mutex inp (tcpinp) r = 0
> > > (0xf800155c3b08) locked @ netinet/tcp_input.c:744 exclusive sleep 
> > > mutex
> > > tcp r = 0 (0xc0bdf788) locked @ netinet/tcp_input.c:617 db> wh 15
> > > Tracing pid 15 tid 18 td 0xf8001fb07480
> > > sab_intr() at sab_intr+0x40
> > > psycho_intr_stub() at psycho_intr_stub+0x8
> > > intr_fast() at intr_fast+0x88
> > > -- interrupt level=0xd pil=0 %o7=0xc01a0040 --
> > > mb_free_ext() at mb_free_ext+0x28
> > > sbdrop_locked() at sbdrop_locked+0x19c
> > > tcp_input() at tcp_input+0x2aa0
> > > ip_input() at ip_input+0x964
> > > netisr_processqueue() at netisr_processqueue+0x7c
> > > swi_net() at swi_net+0x120
> > > ithread_loop() at ithread_loop+0x24c
> > > fork_exit() at fork_exit+0xd4
> > > fork_trampoline() at fork_trampoline+0x8
> > > db>
> > >
> > > That code is here in mb_free_ext():
> > >
> > > /*
> > >  * This is tricky.  We need to make sure to decrement the
> > >  * refcount in a safe way but to also clean up if we're the
> > >  * last reference.  This method seems to do it without race.
> > >  */
> > > while (dofree == 0) {
> > > cnt = *(m->m_ext.ref_cnt);
> > > if (atomic_cmpset_int(m->m_ext.ref_cnt, cnt, cnt - 1)) {
> > > if (cnt == 1)
> > > dofree = 1;
> > > break;
> > > }
> > > }
> >
> > Well, this is obtuse at least.  A simpler version would be:
> >
> > do {
> > cnt = *m->m_ext.ref_cnt;
> > } while (atomic_cmpset_int(m->m_ext.ref_cnt, cnt, cnt - 1) == 0);
> > dofree = (cnt == 1);
> >
> > --
> > John Baldwin <[EMAIL PROTECTED]>  <><  http://www.FreeBSD.org/~jhb/
> > "Power Users Use the Power to Serve"  =  http://www.FreeBSD.org
> 
> Your suggestion will always enter the loop and do the atomic
> regardless of what dofree is set to above that code (not shown in
> Kris' paste):
> 
> [...]
> /* Account for lazy ref count assign. */
> if (m->m_ext.ref_cnt == NULL)
> dofree = 1;
> else
> dofree = 0;
> 
> /*
>  * This is tricky.  We need to make sure to decrement the
>  * refcount in a safe way but to also clean up if we're the
>  * last reference.  This method seems to do it without race.
>  */
> [...]
> 
> The segment could still be reworked, but anyway:
> 
> This does not appear to explain the livelock.  What's m->m_ext.ref_cnt
> point to? And what is the value at the location pointed to by
> m->m_ext.ref_cnt? Regardless, though, the livelock itself, assuming it
> is due to a long time being spent spinning in the above loop, should
> not be caused by underruns or overruns of the reference count (those
> may only cause leaking of the cluster).
> 
> Furthermore, the above code has been around in that form for some time
> now and in fact the loop was probably entered *more* often in the past
> (before the 'dofree' variable was introduced there).  Since when are
> you able to cause the livelock to happen, and are you sure it is the
> mb_free_ext() that is looping indefinitely?
> 
> I do not know sparc64 well, but what are the semantics of
> atomic_cmpset_int()? I see that it is defined to use the 'casa'
> instruction; does atomic_cmpset_int() behave the same way as it does
> on i386?
> 
> -Bosko
> 
> --
> Bosko Milekic - If I were a number, I'd be irrational.
> Contact Info: http://bmilekic.unixdaemons.com/contact.txt
> 


-- 
Bosko Milekic - If I were a number, I'd be irrational.
Contact Info: http://bmilekic.unixdaemons.com/contact.txt
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Race condition in mb_free_ext()?

2005-03-01 Thread Bosko Milekic

You know, the more and more we run into these problems specific to how
reference counting is performed, I keep wondering whether some cleverly
defined macros for dealing with common reference counting operations
would be worthwhile.  I'm not saying "introduce a refcount_t type and a
full-blown abstraction," but some template-like stuff might be useful.
It took a while just to get the refcount decrement on free path free of
races on SMP, and that's only in the mbuf code.

-Bosko

On Tue, Mar 01, 2005 at 03:14:38PM -0800, Kris Kennaway wrote:
> On Tue, Mar 01, 2005 at 06:04:27PM -0500, Bosko Milekic wrote:
> 
> > This does not appear to explain the livelock.
> 
> alc and dwhite tracked it down to a missing volatile causing gcc to
> mis-optimize the loop:
> 
> -   cnt = *(m->m_ext.ref_cnt);
> +   cnt = *(volatile u_int *)(m->m_ext.ref_cnt);
> 
> I'm currently testing that.
> 
> Kris


-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Race condition in mb_free_ext()?

2005-03-01 Thread Bosko Milekic

On Mon, Feb 28, 2005 at 10:00:25PM -0800, Doug White wrote:
> Forgive me for being naieve, but is there a reason you don't do an atomic
> subtraction on the refcount?  I can see why it repeats -- if two things
> are warring over the refcount one or the other keep trying until one wins
> -- but the subtraction would seem more intuitive.

  The subtraction is atomic and is part of the cmpset.  If you were to
  only do a subtraction, you risk racing on figuring out what the
  counter value before the subtraction was and making sure that it stays
  consistent after the subtraction.  That is the purpose of the cmpset.
  The idea is that only the LAST thread to decrement the counter down to
  exactly 1 frees the cluster.

  If you look at the CVS history for that routine and its various
  incarnations (you might need to look at kern/subr_mbuf.c in the attic,
  since mb_free_ext() used to be there, iirc), you will see various
  points in time where we had this wrong.

> -- 
> Doug White|  FreeBSD: The Power to Serve
> [EMAIL PROTECTED]  |  www.FreeBSD.org

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: generic network protocols parser ?

2005-03-04 Thread Bosko Milekic

On Fri, Mar 04, 2005 at 11:07:34AM -0500, Aziz KEZZOU wrote:
> Hi all,
> I am wondering if any one knows  about a generic parser which takes a
> packet (mbuf) of a certain protocol (e.g RSVP ) as input and generates
> some data structre representing the packet ?
> 
> I've been searching for a while and found that ethereal and tcpdump
> for example use specific data structres and functions to dissect each
> protocol packets. Is this the only approach possible ?
> 
> My supervisor suggested using a TLV (Type/Length/Value) approach
> instead. Any opinions about that?
> 
> If no such a parser exists is there any practical reason why ?
> 
> Thanks,
> Aziz

  You can only go so far with generic parsing.  Eventually you will want
  some protocol specific value to be extracted and your parser will have
  to know about what the packet looks like.

  What are you trying to do, exactly?

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Giant-free polling [PATCH]

2005-03-12 Thread Bosko Milekic

On Fri, Mar 11, 2005 at 03:14:50PM +0100, Pawel Jakub Dawidek wrote:
> On Fri, Mar 11, 2005 at 04:55:25PM +0300, dima wrote:
> +> I thought about using list also, but considered it to bring
> +> too much overhead to the code. The original idea of handling arrays
> +> seems to be very elegant.
> 
> Overhead? Did you run any benchmarks to prove it?
> I find list-version much more elegant that using an array.

  You should however be aware that iterating over an array of references
  as opposed to a list in the polling path, if you only consider the
  cost of iteration itself, is faster on both SMP and UP.  With the list
  approach, you will have to bring in at least three items into cache,
  at each iteration:

(1) The list node;

(2) The list node's member (the 'pr' reference data in your patch);

(3) The device's mutex.

  With the array approach, you will remove (1).

  If the relative frequency of insertions and removals is low (which I
  think it is in the polling case), you ought to consider doing the
  array thing (considering the predictable scheduling nature of polling,
  you could use the cache advantages due to the use of the array).

  Also, you will still be able to implement dynamic removal/insertion
  from/to the polling array.  The fact that it is an array might just
  require re-allocation, not a big deal.

> I also don't like the idea of calling handler method with two locks
> held (one sx and one mutex)...

  Yeah, this should be re-worked.  However, the use of the sx lock is
  neat.  Unfortunately, our sx lock implementation always resorts to
  using a regular mutex lock, even when doing a shared lock (read only)
  acquisition.  Since the window of the underlying sx lock mutex is
  relatively short, in most cases the performance advantage of using the
  sx lock is noticable.  However, if the window of the original mutex is
  also short (compared to the sx lock window), then I wonder if it is
  worth it.  The sx lock is really only designed for optimizing
  long-reads, in our case.

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Abusing m_ext for a worthy goal.

2000-12-11 Thread Bosko Milekic


 Hi Alfred,

On Mon, 11 Dec 2000, Alfred Perlstein wrote:

> Ok, part of the mp cleanup got me looking at af_unix sockets.
> 
> Here's the deal, descriptors can be called "in-flight" meaning they
> have been passed from one application to another via
> CMG_DATA but not picked up on the other side yet.
> 
> The kernel does a bunch of hokey stuff to garbage collect them.
> 
> Instead of doing that I'd like to abuse MEXTADD to setup a callback
> to remove the references automagicaly when the mbuf it free'd.
> 
> Here's the problem:
> 
>   I want to use the mbuf clusters, I don't want to use any special
>   external storage.  I want my callback to free the cluster.
> 
> Any hints on how to do this?
> 
> here's what I have so far:
> 
> MEXTADD(control, mtod(control, char *), control->m_len, unp_mbfree,
> cm, M_RDONLY, EXT_CMSG_DATA);
> 
> control is a mbuf with a cluster attached the has fd's written in it.
> cm is a pointer to the mbuf cluster.

Here, the general idea is correct, but you'll have to do things this
  way:

  #define EXT_CMSG_DATA 666 /* A hack, for now. */

  An allocation:

/* Here make sure you're not holding any other mutex. */
mtx_enter(&mclfree.m_mtx, MTX_DEF);
_MCLALLOC(control->m_ext.ext_buf, M_WAIT); /* Or M_DONTWAIT */
mtx_exit(&mclfree.m_mtx, MTX_DEF);
if (control->m_ext.ext_buf != NULL) {
/* Assuming you want to M_RDONLY your data. Otherwise, just
   change M_RDONLY to 0. */
MEXTADD(control, control->m_ext.ext_buf, MCLBYTES, unp_mbfree,
cm, M_RDONLY, EXT_CMSG_DATA);
if (control->m_ext.ref_cnt == NULL) {
_MCLFREE(control->m_ext.ext_buf);
/* Allocation of ref. counter failed, so just fail
gracefully instead of panic (this should never happen,
but make sure to check. */
return (ENOBUFS);
}
} else
return (ENOBUFS);

> static void
> unp_mbfree(caddr_t mm, void *claddr)
> {
> struct mbuf *m = (struct mbuf *)mm;
> 
> unp_scan(m, unp_discard);
> _MCLFREE(claddr);
> }

Actually, your free routine needs to accept the address of the
  cluster as a first argument, and the optional argument as a second
  argument. According to your MEXTADD() above, your optional argument will
  be something called "cm." In your case, what you need to make your
  optional argument be is the actual mbuf "control" (you can cast it to
  struct mbuf *). Then, you can call unp_scan() on that mbuf (second
  argument) and call _MCLFREE() on the first argument, which will
  effectively be the address of the cluster.

> does this look right?  would you like to abstract the mbuf
> clusters further before i started using the _ macros?

Abstracting mbuf clusters is deffinately something I'm looking at
  doing (as you've probably seen me mention on some posts to some of the
  lists), but for now, the above should do. I'd rather not rush into
  abstracting the EXT_CLUSTER type further for now "just for the sake of
  abstracting it." I want to make sure that we don't lose any performance
  whatsoever for what concerns clusters (i.e. I'm trying to decide on
  whether the cost of calling a function for clusters as well is worth it
  at this point). Briefly, the day will come. :-)

> --
> -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
> "I have the heart of a child; I keep it in a jar on my desk."

  Cheers,
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Abusing m_ext for a worthy goal.

2000-12-11 Thread Bosko Milekic


  Something I forgot in first Email...

On Mon, 11 Dec 2000, Bosko Milekic wrote:

[...]
>   Actually, your free routine needs to accept the address of the
>   cluster as a first argument, and the optional argument as a second
>   argument. According to your MEXTADD() above, your optional argument will
>   be something called "cm." In your case, what you need to make your
>   optional argument be is the actual mbuf "control" (you can cast it to
>   struct mbuf *). Then, you can call unp_scan() on that mbuf (second
>   argument) and call _MCLFREE() on the first argument, which will
>   effectively be the address of the cluster.

Don't forget to also, after running _MCLFREE() on the ext_buf, to
  NULL out the control mbuf's ext_buf, and to remove the M_EXT bit from the
  mbuf if you are planning to re-use it. Basically, make sure to clean up
  after yourself after you forcibly free the cluster "manually."

[...]

  Later,
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Abusing m_ext for a worthy goal.

2000-12-11 Thread Bosko Milekic


On Mon, 11 Dec 2000, Alfred Perlstein wrote:

> Hmm, I think instead of doing this sort of abuse, what I should be
> doing is allocating an mbuf header, then allocating a mbuf cluster
> then attaching them using the MEXTADD() macro.

Alfred, I'm surprised you're pointing this out when in my first post,
  the code example does exactly this, minus the mbuf allocation. :-)

> I'm going to look at the code to see if there's a clean way to do this,
> if not can you provide an interface for allocating and free'ing
> clusters by themselves?
> 
> thanks,
> -- 
> -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
> "I have the heart of a child; I keep it in a jar on my desk."


  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: MEXT_IS_REF broken.

2000-12-12 Thread Bosko Milekic


On Tue, 12 Dec 2000, Alfred Perlstein wrote:

[...]
> And since Chuck backs me up on this, can we consider this discussion
> over as soon as John commits his code?

I think this is a valid point.

This is not the first time the need for proper refcount "interface"
  arises, or at least the discussion of the need for it, and I would say
  that it is better to do this now than later. Clearly, the ref. count code
  in mbuf.h is suffering from a pretty nasty race.
Even if we use the particularities of the mbuf cluster referencing
  system to fix it one way now, we're bound to encounter similar problems
  sooner or later.

> thanks,
> -- 
> -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
> "I have the heart of a child; I keep it in a jar on my desk."

  Cheers,
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: need mbuf generic free() hook

2000-12-12 Thread Bosko Milekic


  Alfred,

You've brought this up before. This looks familiar to your earlier
  post(s) regarding how to "abuse m_ext." Mind you, I may be mistaken, but
  that would be because this isn't being made awfully clear.
Earlier you talked about forcing a custom free routine to be called
  even in the case of mbuf clusters, and we devised a way.
Are you suggesting now to have a custom free routine called when your
  mbuf is freed (i.e. even when it is a non-M_EXT mbuf?)
If that's so, I would urge you to re-examine the code because doing
  something like that would be generally wrong. I say "generally" because
  I'm always open to exceptions, as you know. So please clarify.

On Tue, 12 Dec 2000, Alfred Perlstein wrote:

> Ok, so I'm trying to rid FreeBSD of the unp_gc() junk in uipc_usrreq.c
> I just realized why I'm leaking file descriptors:
> 
> Only mbufs with external storage can have a 'free' function, the
> problem is that for the most part when passing a single fd I'm only
> using a single mbuf to pass my data around, my ext_free routine is
> probably never called, and on top of that I'm probably corrupting
> my mbuf by using it when I shouldn't be.
> 
> This is terrible, what should be a simple free() hook is now my
> worst nightmare.  A couple of options come to mind:
> 
> a) always have a free hook, the extra if (m->m_freehook != NULL)
> check can't be that expensive.
> b) a macro to make an mbuf into an mbuf cluster that points to itself,
> it should check to see if it has enough room to shuffle the data
> and return an error if it can't accomidate both the data and the
> expanded mbug header.
> 
> I like 'a' but would be willing to deal with 'b' as we've never
> had a need for this, although that would mean copying about the
> size of an mbuf in the worst case when a subsystem needs this 
> facility.
> 
> Suggestions? Comments?
> 
> -- 
> -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
> "I have the heart of a child; I keep it in a jar on my desk."
> 
> 


  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Ratelimint Enhancement patch (Please Review One Last Time!)

2000-12-13 Thread Bosko Milekic


  Hi,

A while ago (it's been at least two weeks now), Mike Silbersack
  requested a review for:

  http://www.silby.com/patches/ratelimit-enhancement-2.patch

  To quote the description on his web page, this diff will:

  * ICMP ECHO and TSTAMP replies are now rate-limited.
  * RSTs generated due to packets sent to open and unopen ports
are now seperated into separate queues.
  * Each rate limiting queue now has its own description, as 
follows:
   Suppressing udp flood/scan: 212/200 pps
   Suppressing outgoing RST due to port scan: 202/200 pps
   Suppressing outgoing RST due to ACK flood: 19725/200 pps
   Suppressing ping flood: 230/200 pps
   Suppressing icmp tstamp flood: 210/200 pps

  While the descriptions for the two RST cases can be accused
  of oversimplification, they should cut down on questions by
  users confused with the current terminology.  Experienced
  users can always run a packet sniffer if they need more
  exact knowledge of what's occuring.

The diff was initially reviewed by me and green, and the recommended
  changes were mainly stylistic. I want to commit this code, but I'm
  posting it up here in case someone has any final objections or review.

  Thanks,
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Ratelimint Enhancement patch (Please Review One Last Time!)

2000-12-13 Thread Bosko Milekic


On Wed, 13 Dec 2000, Bill Fumerola wrote:

> On Wed, Dec 13, 2000 at 02:42:53PM -0500, Richard A. Steenbergen wrote:
> 
> > It could just as easily be a SYN flood against a single port... or a large
> > number of clients trying to connected to your crashed web server... :P Or
> > it could just as easily be an ack flood against a port without a listener
> > and be showing up in the "not the ack flood" counter.
> 
> Exactly. Bikeshedding the millions of possible reasons the queue/ratelimit
> was triggered is silly.
> 
> Bosko, please change the descriptions to something very generic before
> committing them ("ratelimiting TCP RST packets: x/y pps" or something)

Mike said he would do it and re-post the diff.

> -- 
> Bill Fumerola - security yahoo / Yahoo! inc.
>   - [EMAIL PROTECTED] / [EMAIL PROTECTED]

  Later,
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Updated ratelimit patch

2000-12-14 Thread Bosko Milekic


  Looks good to me, I'll commit this within the next 24 hours unless
  someone complains with reason.

  Thanks Mike.

On Thu, 14 Dec 2000, Mike Silbersack wrote:

> 
> Ok, I've redone the messages in the ratelimiting patch.  They are now as
> follows:
> 
> Limiting icmp unreach response from 439 to 200 packets per second
> Limiting closed port RST response from 283 to 200 packets per second
> Limiting open port RST response from 18724 to 200 packets per second
> Limiting icmp ping response from 211 to 200 packets per second
> Limiting icmp tstamp response from 394 to 200 packets per second
> 
> No other changes have been made, and the updated patch is available at:
> http://www.silby.com/patches/ratelimit-enhancement-3.patch
> 
> Mike "Silby" Silbersack

  Regards,
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Changing the names of some M_flags

2000-12-16 Thread Bosko Milekic


  Hello,

Recently, there was a bikeshed on one of the lists dealing with
  whether or not to rename M_WAIT and M_DONTWAIT flags to something else
  that would "communicate more of the Right Thing" to developers
  considering that for mbuf allocations, M_WAIT may return a NULL pointer
  in the case where all mbuf resources are depleted and mbuf_wait time has
  already been spent waiting.

The proposed flag names were/are:

M_WAIT --> M_TRY_WAIT
M_DONTWAIT --> M_DONTBLOCK

I have a diff sitting here:

http://people.freebsd.org/~bmilekic/m_flag_rnm.diff

It's roughly 160K.

On one side, personally, I wasn't so pro-the change, as I see it more
  as a bloat to the repo than anything else. But, I have to admit that some
  folks have expressed a significant interest in having them changed and
  with reason, so I agreed to do it. On this same side, it wasn't such a
  bad thing to do after all as it led me to spot some not so obfuscated and
  obfuscated (and potentially serious in the future) bugs which I have yet
  to fix. A good example is people mixing up M_WAIT and M_WAITOK in calls
  to malloc() which could potentially lead to disaster if we were to alter
  the values of the flags.
On the other side, I personally do not passionately feel the change 
  "needs" to be made. But, as mentionned, if the majority of developers
  think that it will avoid needless miscommunication, then I have no
  problem committing it.

I don't want to start a bigger bikeshed from this post than it
  already is. I'd just like to know if anyone has any reasonable objections
  to having this change happen.

  Regards,
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Changing the names of some M_flags

2000-12-16 Thread Bosko Milekic


On Sat, 16 Dec 2000, Alfred Perlstein wrote:

> I think M_DONTWAIT is fine as it was, and M_TRYWAIT instead of M_TRY_WAIT.
> 
> Leaving it as M_DONTWAIT should reduce the delta by quite a bit and
> M_TRYWAIT vs M_TRY_WAIT because you have M_DONTWAIT/M_DONTBLOCK.
> 
> -Alfred

I agree. Anyone else before I re-roll? :-)

  Cheers, 
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Changing the names of some M_flags

2000-12-16 Thread Bosko Milekic


On Sat, 16 Dec 2000, Alfred Perlstein wrote:

> > Is this just going to make portablity between the various *BSD kernels
> > more difficult for what's essentially a cosmetic change?  I'm thinking
> > of things like KAME, ALTQ, etc.
> 
> I agree, however this argument keeps coming up:
>   "I thought M_WAIT meant it would wait forever!"
> 
> Personally, I think developers should do a bit more research
> and should have noticed all the places where M_WAIT was followed
> by a check for NULL and be able to bridge the gap.
> 
> So honestly, I'm against the change, but if it has to be done
> then I'd like to see the M_DONTWAIT and M_TRYWAIT.
> 
> -- 
> -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
> "I have the heart of a child; I keep it in a jar on my desk."

This is _EXACTLY_ my opinion and standpoint on this issue.

In terms of portability, though, let's be realistic: the name of a
  flag isn't going to be a major concern. When you port code, you typically
  have to read/understand what's going on anyway. So, if NetBSD for
  example, uses malloc() to allocate mbufs and passes the WAIT/NOWAIT flag
  directly down to malloc(), then it may as well never return if malloc()
  can't find a page. When porting to FreeBSD, the developer needs to know
  that that is not the case here and must make appropriate changes anyway.
  In fact, changing the flag may HELP the developer doing the porting into
  not making the mistake of assuming that it will be the same.

  Regards,
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



M_flag rename: M_WAIT to M_TRYWAIT

2000-12-18 Thread Bosko Milekic


  Ok, the new version of the diff is here, and it's a mere 35K (compared to
  the last one which was 162k):

  http://people.freebsd.org/~bmilekic/m_flag_rnm.diff

  Big bloat reduction compared to last time.

  It renames M_WAIT to M_TRYWAIT, M_DONTWAIT stays M_DONTWAIT, and also
  fixes a few rather lame mistakes, such as M_WAIT being passed to malloc()
  (as opposed to M_WAITOK) which may prevent future problems if we were to
  redefine the values of the malloc/mbuf alloc flags. M_WAIT is maintained
  and defined as M_TRYWAIT for now, but its use is deprecated.

  Anyone have final comments before this is committed? Then I can go on and
  change the man page to reflect it as well and send a small heads-up to
  -current. I have three small PostIt[tm]'s worth of "to do" fixes
  following this change which involve developers previously assuming that
  M_WAIT will never return NULL. Going in and making sure these guys check
  for the possibility of failure will be done once this goes in (I'm doing
  it in two parts to avoid merging headaches while working).

  Please note that this is planned only for -CURRENT at this time.

  Regards,
  Bosko Milekic
  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Intel ethernet driver source code

2000-12-28 Thread Bosko Milekic


  Hi,

Please show us how you're manipulating the mbufs.
  You should not be getting page faults unless you're doing
  something wrong.
A good idea would also be to check the fault virtual
  address and enable debugging support in your kernel in order
  to be able to analyze the situation as much as you can.

Regards,
Bosko.

- Original Message -
From: "Marcus Vinícius Midena Ramos" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, December 28, 2000 8:58 AM
Subject: Intel ethernet driver source code


> Hello,
>
> I am developing an application to measure the performance of an IP
> communications network, accounting for packet loss rate and transmission
> delay values between any two sites on a net. This work resembles that of
> www.advanced.org and is also based on IETF's IPPM initiative.
>
> One of the goals is to insert a GPS timestamp on the packets to be
> transmitted as late as possible before they are put on the wire. This
> means I cannot do this job on the application level. Instead, this has
> to be done on the ethernet driver level in order to achieve the maximum
> accuracy, specially when the measurement machine is under heavy load.
>
> I've studied the "if_fxp.c" (Intel ethernet driver) source code plus
> Stevens' TCP Illustrated vol. 2, which helped me get a good
> understanding of the data structures manipulated by this code. My aim
> now is to insert the GPS timestamps on the packets to be transmitted
> right before they go to the interface.
>
> To do this, I am trying to modify the driver source code to achieve the
> following objectives: (1) read the contents of the mbufs in order to
> identify the ones that hold the packets to be transmitted, and (2)
> insert the timestamps on those ellegible for so.
>
> However, any access to mbufs from inside the driver causes an error
> message "page fault while in kernel mode" to be displayed the new kernel
> is booted. It then enters a loop, trying to reboot forever.
>
> I guess the mbufs are in an area somehow protected by the kernel, which
> prevents even the driver to have access to it. May question is: how can
> I go around this problem and be able to read and change the contents of
> the mbufs from inside the driver ?
>
> Thank you very much.
> Marcus.
>
>
>
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
>



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: (forw) Two NICs In FreeBSD

2001-01-03 Thread Bosko Milekic


I would go with (a) unless someone is religiously obsessed with having the
message stay.

Else go with (b) but see if you can fit it in with some "generic verbosity"
sysctl knob, as opposed to creating its own.

Later,
Bosko.

> /kernel: arp: 1.2.3.4 is on dc0 but got reply from 00:00:c5:79:d0:0c on dc1
>
> Hi can we axe this message already?  I see the possibility of using
> two 100mbit cards on a switch with a gig uplink to be reason enough
> to either axe it, or make it a sysctl to quiet the warnings.
>
> I've also had this broken config setup as a temporary thing and
> my console getting flooded was also quite irritating.
>
> Anyone going to raise a rukus if I turn it
> a) off
> b) sysctl default off
> c) sysctl default on
>
> thanks,
> -Alfred
>
> - Forwarded message from Tim Gustafson <[EMAIL PROTECTED]> -
>
> From: Tim Gustafson <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Subject: Two NICs In FreeBSD
> Date: Wed, 03 Jan 2001 23:37:06 -0500
> Message-Id: <[EMAIL PROTECTED]>
> X-Mailer: QUALCOMM Windows Eudora Version 5.0
> Sender: [EMAIL PROTECTED]
>
> Hello
>
> I just installed two NIC cards into my FreeBSD machine that are on the same
> LAN, each with a different IP.  However, I keep getting this sort of
> message in my syslog:
>
> /kernel: arp: 1.2.3.4 is on dc0 but got reply from 00:00:c5:79:d0:0c on dc1
>
> I realize that the message is because it's getting two ARP replies, but
> this is normal (for this network, anyhow).  How can I suppress these
> messages from appearing in the syslog?  Is it OK to have both interfaces
> receiving ARP replies?  Will this mess anything up?
>
> Tim
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>Tim Gustafson  [EMAIL PROTECTED]
>www.falconsoft.com  (631)475-6662
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>Share your knowledge - it's a way to achieve immortality.
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
>
> - End forwarded message -
>
> --
> -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
> "I have the heart of a child; I keep it in a jar on my desk."




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



lock order violation? (Was: Fw: [patch] quick review)

2001-01-08 Thread Bosko Milekic

 Hello,

As Alfred suggests below, there may be a potential problem with the mbuf
wait code calling m_reclaim() [the drain routines] with the mmbfree (mbuf
freelist mutex) lock held. Presently, to avoid a race condition, we call
m_reclaim() with the lock held and recurse into it from the drain routines
which call m_free{m}() to free mbufs. This was initially written this way; it
was partially adopted from BSD/OS (it's done the same way). This way, the
mbuf wait code is sure to be the first one to get the chance to allocate
after the drain routines run. Unfortunately, as mentionned, there appears to
be a potential lock order problem. Using as an example the BSD/OS code,
something like this happens:

- In ip_slowtimo(), IPQ_LOCK is obtained and with the lock held, ip_freef()
is called. ip_freef() calls m_freem() which grabs the MBUF_LOCK. The lock
order here is IPQ -> MBUF.

- In ip_drain(), which is called from m_reclaim() with the MBUF lock already
held, IPQ lock is acquired. Then ip_freef() is called which does m_freem()
[which grabs MBUF lock]. The lock order here, since ip_drain() is called with
MBUF already held is MBUF -> IPQ.

Are we correct to assume that there is a potential for deadlock here?
Clearly, if one thread in ip_slowtimo() has IPQ and another thread presently
draining has MBUF, there is deadlock.

The proposed patch (for FreeBSD) is at the address immediately below. For
now, what it does is avoid the lock recursion by calling m_reclaim() without
the mbuf lock held. We deal with the race condition. In the future, if we
can, we may want to weed out the drain routines and make sure that there are
no lock ordering problems; but for now, until we're done with all the
locking, I think it's better to do it this way.

Thanks,
Bosko.

P.S.: Please feel free to snip -arch from the CC list, if judged appropriate.

Alfred Perlstein wrote:

> * Bosko Milekic <[EMAIL PROTECTED]> [010108 19:13] wrote:
> > Hi Alfred,
> >
> > Can you please review this:
> >
> > http://people.freebsd.org/~bmilekic/code/no_rec_mbsleep.patch
> >
> > All it does is in m_mballoc_wait() [wait routine for mbuf allocation]
> > drop the mbuf freelist lock before calling the drain routines. It avoids
> > recursing into the lock from within the drain routines. We discussed this
> > some time ago and you brought up the fact that it should probably be
changed;
> > although I don't recall exactly why (besides for generally just getting
rid
> > of the lock recursion and relatively large lock contention)... I recall
you
> > having an even more important reason, can you remind me, if applicable?
>
> It's a lock ordering problem, you _can_ unwind the lock in the protocol
> drain routines before locking down a socketbuffer or whatnot, but the
> fact of the matter is if you have to lock _anything_ you pretty much
> have to unwind the lock or have a reversal problem as the mbuf system
> should be a leaf-lock.
>
> You're also blocking the natural free'ing of the mbufs by other code
> running.
>
> So the comment is sorta lacking, it's to avoid:
> 1) lock recursion (as you stated)
> 2) lock reversal
> 3) needing to to explicitly unwind the lock to avoid #2
>
> At least I think. :)
>
> It looks like BSD/os holds the lock during the protocol drain routines,
> but I think that's a mistake.
>
> Would you mind reposting this to -net/-smp, I think I'm right about
> this, but some peer review couldn't hurt, I wasn't sure if reposting
> was ok considering this came across in private.
>
> --
> -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
> "I have the heart of a child; I keep it in a jar on my desk."




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: no buffer space available

2001-01-27 Thread Bosko Milekic


Does this happen regularly? Check `netstat -m' on the machine (you'll
probably have to do it from the console) when this is happening.
If it happens regularly, does anything specific happen that "helps
reproduce it?"
What version of FreeBSD are you running?

-Bosko

G. Jason Middleton wrote:

> I have a problem.
> I can telnet into my bsd box from the net and after a while it will
just not
> let me in. It just drops right off the face of the earth.  So when i
> actually try to ping something from the box after this happens i get
an
> error as follows:
>
> ping: sendto: no buffer space avaialble.
>
>
> when i reboot everything works like it should.
>
> Got any ideas?
>
> Regards,
>
> G. Jason Middleton
> University of Maryland Baltimore County




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: no buffer space available (outcome of netstat -m)

2001-01-28 Thread Bosko Milekic


What device are you using? What is the interface you ifconfig'd
down and back up?

Jason, can you also contribute?

Boris, if you're reading this, if you see that both these
gentlemen have the same interface, perhaps we have a winner? I've been
unable to find the problem as of yet (frankly, I've had little time to
look too deeply).

Regards,
Bosko.

Wesley Morgan wrote:

> When this happened to me once before, I ifconfig'd the interface
down and
> back up... Viola, working interface. Never figured out what caused
it
> though :)
>
> On Sun, 28 Jan 2001, G. Jason Middleton wrote:
>
> > The original problem was an still is > ping: sendto: no buffer
space
> > avaialble.
> >
> > I was able to reproduce the error today while telneting in to the
box.  I
> > was ftping through telnet and i did an "ls" and it got halfway
through a
> > directory listing and crapped out on my.  the box was no where to
be found
> > on the network/internet.
> >
> > OK i did the netstat -m and got he following output
> >
> > 298/352/4096 mbufs in use (current/peak/max)
> > 170 mbufs allocated to data
> > 128 mbufs allocated to packets headers
> > 41/160/1024 mbuf clusters in use  (current/peak/max)
> > 408 kbytes allocated to network (13% of mb-map in use)
> > 0 requests for mem denied
> > 0 requests for mem delayed
> > 0 calls to protocol drain routines
> >
> >
> > what next?
> >
> > Jason
> >
> >
> > Does this happen regularly? Check `netstat -m' on the machine
(you'll
> > probably have to do it from the console) when this is happening.
> > If it happens regularly, does anything specific happen that "helps
> > reproduce it?"
> > What version of FreeBSD are you running?
> >
> > -Bosko
> >
> > G. Jason Middleton wrote:
> >
> > > I have a problem.
> > > I can telnet into my bsd box from the net and after a while it
will
> > just not
> > > let me in. It just drops right off the face of the earth.  So
when i
> > > actually try to ping something from the box after this happens i
get
> > an
> > > error as follows:
> > >
> > > ping: sendto: no buffer space avaialble.
> > >
> > >
> > > when i reboot everything works like it should.
> > >
> > > Got any ideas?
> > >
> > > Regards,
> > >
> > > G. Jason Middleton
> > > University of Maryland Baltimore County
> >
> >
> >
> > To Unsubscribe: send mail to [EMAIL PROTECTED]
> > with "unsubscribe freebsd-net" in the body of the message
> >
>
> --
>_ __ ___   ___ ___
___
>   Wesley N Morgan   _ __ ___ | _ ) __|
\
>   [EMAIL PROTECTED] _ __ | _ \._ \
|) |
>   FreeBSD: The Power To Serve  _
|___/___/___/
>   6bone: 3ffe:1ce3:7::b4ff:fe53:c297
> Hi! I'm a .signature virus! Copy me into your ~/.signature to help
me spread!
>
>
>
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
>



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Sequential mbuf read/write extensions

2001-02-06 Thread Bosko Milekic


Boris Popov wrote:

[...]
> Since currently there isn't many consumers of this code I can
> suggest to define an option LIBMBUF in the kernel configuration file
and
> add KLD libmbuf (with interface libmbuf), so kernel footprint will
not be

I am in favor of such an option on the condition that it is
temporary. In other words, only until we decide "we have converted
enough code to use this code so we should remove the option now." The
reason is that otherwise, we will be faced with numerous "#ifdef
LIBMBUF ... #else ... #endif" code. I assume this is what you meant,
anyway, so I have no objections. :-) The API looks great by the way,
and I will try to give a more detailed review in the next few days.
:-)

For now:

#define M_TRYWAIT M_WAIT is not right.
(M_WAIT is no longer to be used in the mbuf code.)

The succesfull return values are 0, I don't have a problem with this,
specifically, but I would assume that this:
if (!mb_init(mbp))  ... would be more "logical" (I use the term
loosely) if it meant: "if initialization fails" (now it means "if
initialization is succesful").

> significantly affected. The names of source and header files are
> questionable too and I would appreciate good suggestions (currently
they
> are subr_mbuf.c and subr_mbuf.h).

Hmmm. Maybe subr_mblib.c and libmb.h ? I don't want to turn this
into a bikeshed ( :-) ), so I suggest that you decide. Personally, I
would prefer that it be something other than "subr_mbuf.c" simply
because it may be a little misleading in some cases.

> Well, and finally here you will find full source code of proposed
> API: http://www.butya.kz/~bp/mbuf/
>
> Any comments and suggestions are greatly appreciated.
>
> --
> Boris Popov
> http://www.butya.kz/~bp/

Boris, this is really a great interface and nice looking, clean code.
Thank you!

Regards,
Bosko.




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Fw: if_ed.c && BRIDGE

2001-02-07 Thread Bosko Milekic

Hi Luigi, -net:

There seems to be a problem with the BRIDGE-specific
"optimization" in if_ed.c. What it does is avoid getting a packet from
the card if bridge_in() says that we're going to drop it. However, the
code is broken and something eventually leads to a page fault.
Disactivating this portion of if_ed.c rids us of the problem, but
doesn't "fix it" properly.

Rich has some debugging information including, I believe, a crash
dump. Rich, if you can please post that here (i.e. the backtraces and
fault information) as I seemed to have lost it (gr, sorry about
that).

I would rather not just disable the BRIDGE section of if_ed.c to
"mask out" the problem.

Regards,
Bosko.

Rich Wales wrote:

> Bosko --
>
> That "if_ed.c" patch you had me try (commenting out the
packet-dropping
> optimization) seems to be quite solid.  If you wanted to go ahead
and
> commit it into -STABLE . . . .
>
> Rich Wales [EMAIL PROTECTED]
http://www.webcom.com/richw/
>



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Sequential mbuf read/write extensions

2001-02-08 Thread Bosko Milekic


Boris Popov wrote:

[...]
> Not exactly so. 'option LIBMBUF' will just connect the source file
> to kernel makefile. There is no need for any #ifdef's in the code.

Right. But I assume LIBMBUF will absolutely be needed if code that
uses the routines is compiled. What I just meant to say was: "when the
code using these routines grows to be significant enough, then we can
just remove the option."

> > #define M_TRYWAIT M_WAIT is not right.
> > (M_WAIT is no longer to be used in the mbuf code.)
>
> You omitted the surrounding "#ifndef M_TRYWAIT" which makes this
> code portable to RELENG_4 (mind you, this code taken from smbfs). Of
> course, this should be stripped before import.

I did, you're right. I guess I saw the "ifndef" wrong... I read
this with only -CURRENT in mind and was afraid that the mbuf code
flags would start mixing in with the malloc code flags -- something I
tried to fight off in the past while.

> > The succesfull return values are 0, I don't have a problem with
this,
> > specifically, but I would assume that this:
> > if (!mb_init(mbp))  ... would be more "logical" (I use the term
> > loosely) if it meant: "if initialization fails" (now it means "if
> > initialization is succesful").
>
> I'm generally don't like such syntax if function or variable name
> do not clearly specify which value it should have/return on success.
> Nearly all functions in this file return zero or error code, so the
> correct syntax of the above will be:
>
> error = mb_init(mbp);
> if (!error)
>
> or
>
> if (error)
> return error;
>
> or
>
> if (mb_init(mbp) != 0)
> return ESOMETHINGEVIL;

OK.

> > > significantly affected. The names of source and header files are
> > > questionable too and I would appreciate good suggestions
(currently
> > they
> > > are subr_mbuf.c and subr_mbuf.h).
> >
> > Hmmm. Maybe subr_mblib.c and libmb.h ? I don't want to turn
this
> > into a bikeshed ( :-) ), so I suggest that you decide. Personally,
I
> > would prefer that it be something other than "subr_mbuf.c" simply
> > because it may be a little misleading in some cases.
>
> Good point.
>
> > Boris, this is really a great interface and nice looking, clean
code.
>
> I'm sure, this code can be significantly improved by mbuf gurus :)
>
> --
> Boris Popov
> http://www.butya.kz/~bp/

Ok, I have a few things to add (although I'm sure you'll be more into
reading Ian Dowse's comments) :-)

in mb_append_record(), you walk all the "record" mbufs to get to the
last "record." How good would be the tradeoff? i.e. keeping a pointer
to the last pkt in the mbdata structure's mbuf chain? We would grow
the structure by a pointer, and we may have to maintain the last
record pointer; but isn't the only place where we would have to
"maintain it" in mb_append_record() anyway?

in mb_init(), the m->m_pkthdr.rcvif = NULL; can be ommitted, as
MGETHDR() will do that. The m->m_len = 0 should stay for now.

m_getm() looks like it should belong in uipc_mbuf.c -- it looks quite
a bit like the "all or nothing" allocation routine I have sitting
here. The difference is that mine doesn't take size as an argument,
but rather the actual count of mbufs and all it does is allocate
`count' mbufs and attach a cluster to each one of them. If it can't
allocate a cluster or an mbuf at any point, it frees everything and
returns. Now that I think about it, I'd much rather have `size' passed
in instead, even though some callers may not know the `size' (think
drivers that pre-allocate mbufs + clusters, they typically know the
`count'), it turns out that it is cheaper to compute the count from
the size than the optimal size from the count, in the mbuf case. If
you don't mind, I would strongly recommend moving m_getm() to
uipc_mbuf.c. Code that doesn't know the `size' but knows the `count'
(like some driver code) can do;

m = m_get(M_TRYWAIT, MT_DATA);
if (m == NULL) {
/* we can't even allocate one mbuf, we're really low,
   so don't even bother calling m_getm(). The other
   option would be to have m_getm() not require
us to pre-allocate an mbuf at all and do all the
   work, but then that may interfere with code like
   yours which needs to pass in an existing mbuf
   that has already been allocated. */
m_free(m);
/* fail right here */
} else {
size = count * (MLEN + MCLBYTES);
if (m_getm(m, size) == NULL) {
/* everything has been properly freed for us,
we don't have to worry about leaking mbufs. */
/* fail right here. */
}
}

For this to work, though, m_getm() needs to be modified to free all of
`top' chain if it can't get either a cluster or an mbuf. I don't know
if this was intentional, but it seems to me that there is a subtle
problem in m_getm() as it is now:

if (len > MINCLSIZE) {
MCLGET(m, M_TRYWAIT);
if ((m->m_flags & M_EXT) == 0) {
   m_freem(m); <-- frees only one mbuf
   return NULL;
}
}

I think what may ha

Re: Sequential mbuf read/write extensions

2001-02-08 Thread Bosko Milekic


Boris Popov wrote:

[...]
> > For this to work, though, m_getm() needs to be modified to free
all of
> > `top' chain if it can't get either a cluster or an mbuf. I don't
know
> > if this was intentional, but it seems to me that there is a subtle
> > problem in m_getm() as it is now:
> >
> > if (len > MINCLSIZE) {
> > MCLGET(m, M_TRYWAIT);
> > if ((m->m_flags & M_EXT) == 0) {
> >m_freem(m); <-- frees only one mbuf
> ^^ cluster is not in the chain yet, so it have to be
> freed.

m_free() may be more appropriate than m_freem() then, but see
below.

> >return NULL;
> > }
> > }
> >
> > I think what may happen here is that you will leak your `top'
chain if
> > you fail to allocate a cluster.
>
> The original semantic was not to free an entire chain because
> m_getm() do not reallocates original (top) mbuf(s) (which may
contain
> data) and only adds new mbufs/clusters if possible. So, the calls
like
> m_get(mb->mb_top) will not left the wild pointer. There is also
simple way
> to deal with such behavior:
>
> mtop = m_get(...);
> if (mtop == NULL)
> fail;
> if (m_getm(mtop) == NULL) {
> m_freem(mtop);
> fail;
> }
>
> Probably m_getm() should return error code rather than pointer to
> mbuf to avoid confusion.

I understand this part, but what I think you missed in my comment
is that m_getm() should probably free what it already allocated before
finally failing. It may not need to free `top' because of the wild
pointer, as you say. But think of this:

m_getm() is called with a larger `size' - it decides that given the
`size' it will need to allocate a total of exactly 6 mbufs and 6
clusters for each mbuf. It loops and allocates, succesfully, 5 of
those mbufs and 5 clusters. So `top' chain has now grown and includes
those mbufs. Then what happens in the last iteration is that it
allocates the 6th mbuf OK (it has not yet placed it on the chain) and
fails to allocate a cluster, so it frees just that one mbuf (and not
the mbufs it allocated in prior iterations and attached to `top'
chain) and returns NULL. Your code that calls m_getm() then just
fails, leaving `top' with what it could allocate. Note that in my mail
I said "assuming this is a leak," thus recognizing the possibility
that you did this intentionally. :-) Right now, I'll assume that this
_was_ intentional, as that is what I understand from the above. But in
any case, if we do move this to uipc_mbuf.c, we need to do one of the
following:

(a) make m_getm() free what it allocated in previous loop iterations
before it failed (as described above) or

(b) leave m_getm() the way it is BUT write an additional function that
will simply wrap the call to m_getm() and flush properly for it if it
fails (EXACTLY like your code snippet above).

I'll gladly settle for either, but if we do go with (b), then the
m_freem() should be changed to an m_free(), as it reflects the fact
that we are only freeing the one mbuf and we should document this
behavior, certainly. If you want, I'll roll up a diff in a few days
(once I get what is presently dragging in my "commit this" queue out)
and commit it. If you prefer to do this yourself, then feel free. :-)

> --
> Boris Popov
> http://www.butya.kz/~bp/

Regards,
Bosko.




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: buffer problems with ep

2001-02-15 Thread Bosko Milekic


Dan Debertin wrote:

> I have a busy 4.1.1-RELEASE nfs server that stops responding at its
ep
> interface every hour or so. Pings to hosts on that network respond
with
> "sendto: no buffer space available". I've recompiled with MAXUSERS
up at
> 128 and NMBCLUSTERS at 32768, both of which seem plenty high. The
problem
> goes away, to return an hour later, after a quick 'ifconfig ep0
down;
> ifconfig ep0 up'.
>
> Other people seem to have run into this problem as well, but
> there is no sign of a fix in the archives anywhere.

Can you try placing a "#define EP_LOCAL_STATS" at the _top_ of
if_epvar.h and rebuilding+reinstalling the ep driver? See if you get
anything printed to the console from the ep driver. If you get
something that begins with "Status:" then look under it for
"NOMB=" and take note of it.

> Here's netstat -m when it's working:
>
> 194/320/131072 mbufs in use (current/peak/max):
> 179 mbufs allocated to data
> 1 mbufs allocated to packet headers
> 14 mbufs allocated to fragment reassembly queue headers
> 176/254/32768 mbuf clusters in use (current/peak/max)
> 588 Kbytes allocated to network (68% in use)
> 0 requests for memory denied
> 0 requests for memory delayed
> 0 calls to protocol drain routines
>
> and when it isn't:
> 172/320/131072 mbufs in use (current/peak/max):
> 134 mbufs allocated to data
> 38 mbufs allocated to packet headers
> 52/254/32768 mbuf clusters in use (current/peak/max)
> 588 Kbytes allocated to network (25% in use)
> 0 requests for memory denied
> 0 requests for memory delayed
> 0 calls to protocol drain routines
>
> Thanks for any help you an render,
>
> Dan Debertin
> --
> ++ Unix is the worst operating system, except for all others.
>
> ++ Dan Debertin
> ++ Senior Systems Administrator
> ++ Bitstream Underground, LLC
> ++ [EMAIL PROTECTED]
> ++ (612)321-9290 x108
> ++ GPG Fingerprint: 0BC5 F4D6 649F D0C8 D1A7  CAE4 BEF4 0A5C 300D
2387

Regards,
Bosko.




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: buffer problems with ep

2001-02-16 Thread Bosko Milekic

What type of card is this, exactly? (i.e. name, model number, etc.)

There is some trickery in the NetBSD driver that involves
"Vortex-based (3c59x pci, eisa) and Boomerang (3c900)" which allow
FDDI-sized 4500 byte packets. According to the NetBSD driver's
comments, these cards commands dealing with sizes (such as setting
threshold, for example) upshift the size value by 2 bits in order to
accomodate the full-size packet length (because the 11bits usually
provided to store the command argument isn't enough to span the full
length).

If this may indeed be a problem, we can try right (down) shifting all
the ORed in length values by 2 bits, thus expecting the card to
interpret the lengths as whatever the value we or'ed in is << 2.

If you have one of these cards, let me know and I'll send you a diff
off-list.

Regards,
Bosko.

Dan Debertin wrote:

> Okay, one more clue. When it's working, the interface flags are as
> follows:
>
> flags=8843 mtu 1500
>
> And when it stops, the OACTIVE flag is added:
>
> flags=8c43 mtu 1500
>
> Dan Debertin
> --
> ++ Unix is the worst operating system, except for all others.
>
> ++ Dan Debertin
> ++ Senior Systems Administrator
> ++ Bitstream Underground, LLC
> ++ [EMAIL PROTECTED]
> ++ (612)321-9290 x108
> ++ GPG Fingerprint: 0BC5 F4D6 649F D0C8 D1A7  CAE4 BEF4 0A5C 300D
2387




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Kernel crush due to frag attack

2001-02-25 Thread Bosko Milekic


Adrian Penisoara wrote:

> Hi,
>
>   As we are facing a heavy fragments attack (40-60byte packets in a
> ~ 1000 pkts/sec flow) I see some sporadic panics. Kernel/world is
> 4.2-STABLE as of 18 Jan 2001 -- it's a production machine and I
hadn't yet
> the chance for another update; if it's been fixed in the mean time I
would
> be glad to hear it...
>
>  I have attached a gdb trace and a snip of a tcpdump log. When I
rebuilt
> the kernel with debug options it seemed to crush less often. I
remember
> that at the time of this panic I had an ipfw rule to deny IP
fragments.

This is one of those "odd" faults I've seen in -STABLE sometimes.
Thanks to good debugging information you've provided, to be noted:

#16 0xc014de98 in m_copym (m=0xc07e7c00, off0=0, len=40, wait=1)
  at ../../kern/uipc_mbuf.c:621
621 n->m_pkthdr.len -= off0;
(kgdb) list
616   if (n == 0)
617 goto nospace;
618   if (copyhdr) {
619 M_COPY_PKTHDR(n, m);
620 if (len == M_COPYALL)
621   n->m_pkthdr.len -= off0;<-- fault happens here (XXX)
622else
623   n->m_pkthdr.len = len;
624   copyhdr = 0;
625}
(kgdb) print n
$1 = (struct mbuf *) 0x661c20
(kgdb) print *n
cannot read proc at 0
(kgdb) print m
$2 = (struct mbuf *) 0xc07e7c00

Where the fault happens (XXX), the possible problem is that the mbuf
pointer n is bad, and as printed from the debugger, it does appear to
be bad. However, there are two things to note:

1. the fault virtual address displayed in the trap message:

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x89c0c800
[...]

is different from the one printed in your analysis (even though
0x89c0c800 seems bogus as well, although it is at a correct boundry).

2. Nothing bad happens in M_COPY_PKTHDR() which dereferences an
equivalent pointer.

Something seriously evil is happening here and, unfortunately, I have
no idea what.

Does this only happen on this one machine? Or is it reproducable on
several different machines? I used to stress test -STABLE for mbuf
starvation and never stumbled upon one of these `spontaneous pointer
deaths' myself. Although I have seen other weird problems reported by
other people, but only in RELENG_3.

If you cannot reproduce it on any other machines, I would start
looking at possibly bad hardware... unless someone else sees something
I'm not.

>  If you need further data just ask, I'd be glad to help,
>  Ady (@warpnet.ro)

Regards,
Bosko.



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: kernel: nd6_storelladdr failed, mbuf leak

2001-03-05 Thread Bosko Milekic


John Hay wrote:


> I have configured a 4-stable machine to be a router, routing ipv4,
ipv6
> and ipx. To be able to do Ethernet_II framing on one interface and
802.3
> on another I have used if_ef.ko.
>
> I then noticed that "... kernel: nd6_storelladdr failed" gets logged
> often and after a while all mbufs are used. It turned out that in
> sys/net/if_ethersubr.c in ether_output() when nd6_storelladdr()
fails,
> it does a return(0) and does not free the mbuf. I checked -current
> and it is still like that.

It should not be freeing the mbuf, because that mbuf is being
passed as an argument to ether_output(). It is typically the caller
that ought to be responsible for freeing the mbuf in this case.

> Now the reason it fails is that the ef(4) device use an ifp->if_type
> (IFT_XETHER) that nd6_storelladdr() does not expect.
>
> Oh as a workaround I have configured route6d to ignore fxp1f0.
>
> John
> --
> John Hay -- [EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: kernel: nd6_storelladdr failed, mbuf leak

2001-03-06 Thread Bosko Milekic


itojun wrote:

>
> >> > > I then noticed that "... kernel: nd6_storelladdr failed" gets
logged
> >> > > often and after a while all mbufs are used. It turned out
that in
> >> > > sys/net/if_ethersubr.c in ether_output() when
nd6_storelladdr()
> >> > fails,
> >> > > it does a return(0) and does not free the mbuf. I
checked -current
> >> > > and it is still like that.
>
> will correct it.  thanks for reporting.
>
> itojun

This behavior is absolutely horrible. What ought to be fixed, if
ether_input() is supposed to be freeing the passed in mbuf, is that
ether_input() should instead accept a pointer to a pointer so that
after it frees the mbuf it can NULL out the initial pointer. Because
of promoting similar existing coding practises in this area of the
code, what we're seeing is ether_input() effectively returning an mbuf
to the free list while the caller still holds a live pointer to that
mbuf.

Regards,
Bosko.



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: kernel: nd6_storelladdr failed, mbuf leak

2001-03-08 Thread Bosko Milekic

I'll do it right now if itojun doesn't mind (to save him a task :-) )
and get authorization from Jordan to commit.

Thanks,
Bosko.

John Hay wrote:

> >
> > >> > will correct it.  thanks for reporting.
> >
> >
http://www.kame.net/dev/cvsweb.cgi/kame/kame/sys/netinet6/nd6.c.diff?r
1=1.135&r2=1.136
> >
> > itojun
> >
>
> Any chance of this finding its way into -current and -stable
(preferably
> before the release)?
>
> John
> --
> John Hay -- [EMAIL PROTECTED]
>
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
>


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: kernel: nd6_storelladdr failed, mbuf leak

2001-03-08 Thread Bosko Milekic

Euuh, scratch that. I don't see anything committed to HEAD yet, at
all. No use in MFCing something that hasn't yet been committed. :-)

Bosko

I wrote:

> I'll do it right now if itojun doesn't mind (to save him a task
:-) )
> and get authorization from Jordan to commit.
>
> Thanks,
> Bosko.
>
> John Hay wrote:
>
> > >
> > > >> > will correct it.  thanks for reporting.
> > >
> > >
>
http://www.kame.net/dev/cvsweb.cgi/kame/kame/sys/netinet6/nd6.c.diff?r
> 1=1.135&r2=1.136
> > >
> > > itojun
> > >
> >
> > Any chance of this finding its way into -current and -stable
> (preferably
> > before the release)?
> >
> > John
> > --
> > John Hay -- [EMAIL PROTECTED]
> >
> > To Unsubscribe: send mail to [EMAIL PROTECTED]
> > with "unsubscribe freebsd-net" in the body of the message
> >
>
>
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
>


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: kernel: nd6_storelladdr failed, mbuf leak

2001-03-09 Thread Bosko Milekic


John Hay wrote:

> >
> > >I'll do it right now if itojun doesn't mind (to save him a task
:-) )
> > >and get authorization from Jordan to commit.
> >
> > please go ahead, i can review the diff if you want me to.
> >
>
> Ok, here is a patch for -current. It was taken from the kame code
> with minor adjustments to fit into our tree. I have tested it on
> -current, but not on -stable yet.
>
> Itojun will you look it over please?
>
> I'm gone for the weekend now, so I'll only be able to commit it by
> Sunday night. If someone wants to do it before then, you're welcome.

I have this one sitting around which is a direct "port" of the
KAME patch:

http://people.freebsd.org/~bmilekic/code/kame_nd6.diff

The differences between the two in terms of functionality are very
small, but I prefer your patch because it removes the printf()s from
if_ethersubr.c

Since itojun OK'd it, I'll commit this today. I'll look into an
immediate MFC.

> John
> --
> John Hay -- [EMAIL PROTECTED]

[...]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf leak? fxp?

2001-04-05 Thread Bosko Milekic


On Thu, Apr 05, 2001 at 10:18:38AM -0700, Archie Cobbs wrote:
> Archie Cobbs writes:
> > I have this machine that starts running out of mbufs every few days
> > ("looutput: mbuf allocation failed") and then crashes, and was wondering
> > if anyone else has seen similar behavior...
> > 
> > For example...
> > 
> > Yesterday...
> > $ netstat -m
> > 461/624/4096 mbufs in use (current/peak/max):
> > 459 mbufs allocated to data
> > 2 mbufs allocated to packet headers
> > 434/490/1024 mbuf clusters in use (current/peak/max)
> > 1136 Kbytes allocated to network (36% of mb_map in use)
> > 0 requests for memory denied
> > 0 requests for memory delayed
> > 0 calls to protocol drain routines
> > 
> > Today...
> > $ netstat -m
> > 947/1072/4096 mbufs in use (current/peak/max):
> > 945 mbufs allocated to data
> > 2 mbufs allocated to packet headers
> > 920/946/1024 mbuf clusters in use (current/peak/max)
> > 2160 Kbytes allocated to network (70% of mb_map in use)
> > 0 requests for memory denied
> > 0 requests for memory delayed
> > 0 calls to protocol drain routines
> > 
> > It appears that something is slowly eating up mbuf clusters.
> > The machine is on a network with continuous but very low volume
> > traffic, including some random multicast, NTP, etc. The machine
> > itself is doing hardly anything at all.
> 
> Well, my current guess is that this is simply an NMBCLUSTERS problem.
> I increased NMBCLUSTERS to 8192 and it hasn't happened again yet.

I kind of doubt that, judging simply from the netstat -m outputs
you have posted above. In niether one is the number of clusters allocated
meeting the maximum number of allocatable clusters. If it were the case, you
would likely see some numbers for "requests for memory denied" and/or
"requests for memory delayed."
In any case, increasing NMBCLUSTERS to the number you mention is
not a bad idea.
 
> This machine has 5 ethernet interfaces, which must be probably more
> than the default NMBCLUSTERS can handle.
> 
> I wonder if we should increase the default NMBCLUSTERS, or document
> somewhere that > 4 interfaces requires doing so?

Well, the way it should be done is that `maxusers' should be
increased, if anything. `maxusers' automatically tunes NMBCLUSTERS and
NMBUFS accordingly. Chances are, if you are explicitly declaring
`NMBCLUSTERS ' in your kernel configuration file, that you are
actually lowering the number of clusters/mbufs that would otherwise be
allowed with your given `maxusers' value (unless you have an unreasonably
low maxusers).

> Thanks for all the suggestions...
> 
> -Archie
> 
> __
> Archie Cobbs * Packet Design * http://www.packetdesign.com

Regards,

-- 
Bosko Milekic
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf leak? fxp?

2001-04-06 Thread Bosko Milekic


On Thu, Apr 05, 2001 at 09:04:15PM -0700, Archie Cobbs wrote:
> Bosko Milekic writes:
> > NMBUFS accordingly. Chances are, if you are explicitly declaring
> > `NMBCLUSTERS ' in your kernel configuration file, that you are
> > actually lowering the number of clusters/mbufs that would otherwise be
> > allowed with your given `maxusers' value (unless you have an unreasonably
> > low maxusers).
> 
> Mmm.. I don't understand that.. can you explain?

Heh. I'm sorry for being so "obscure" about this. After re-reading
it, I realize I should have probably just quoted the following:

#ifndef NMBCLUSTERS
#define NMBCLUSTERS (512 + MAXUSERS * 16)
#endif
TUNABLE_INT_DECL("kern.ipc.nmbclusters", NMBCLUSTERS, nmbclusters);
TUNABLE_INT_DECL("kern.ipc.nmbufs", NMBCLUSTERS * 4, nmbufs);

(from src/sys/kern/uipc_mbuf.c)

So, for example, for MAXUSERS 256, NMBCLUSTERS is 4608, whereas I have
seen people do things like this before:

maxusers 256
options NMBCLUSTERS 4096

Thus actually reducing the address space allotted to clusters.

> -Archie
> 
> __
> Archie Cobbs * Packet Design * http://www.packetdesign.com

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: I call sosend occur error . The error code=22.

2001-04-17 Thread Bosko Milekic


On Thu, Apr 19, 2001 at 12:36:11AM +0800, Jessey wrote:
> Dear Sir:
> I call sosend occur error. The error code=22.
> What wrong in my code?
> 
> The code follow:
> 
> socreate(); file://is ok;
> 
> |
> }
> 
> top=m_get(MT_DATA,M_DONTWAIT);
> 
>  if(top==NULL)
>  {
>  log(LOG_INFO, "top=NULL occur error\n");
>  goto out;
>  }
> 
>  top->m_next=NULL;
>  top->m_nextpkt=NULL;
>  top->m_len=50;
>  adr=sizeof(struct m_hdr) +sizeof(int)+sizeof(struct ifnet *);
>  log(LOG_INFO, "adr= %d \n",adr);
> 
>  top->m_data=top+adr;
  ^^
First of all, that looks totally bogus!

>  log(LOG_INFO, "top= %x, top->m_data= %x,top+adr=%x \n",top 
> ,top->m_data,top+adr);
> 
>  top->m_type=MT_DATA;
>  top->m_flags=M_PKTHDR;
>  top->m_pkthdr.len=50;
>  top->m_pkthdr.rcvif=NULL;

You should probably just be using m_gethdr. Is this kernel code?
 
>  log(LOG_INFO, "top= %x, Data= %x \n",top ,top->m_data);
> 
>  error = sosend(so,(struct sockaddr *) &sin,NULL,top, 
> 0,MSG_DONTROUTE);

Error code 22 is EINVAL. Something sosend() calls is telling you that
whatever you're passing to sosend() is probably invalid. Are you sure that
what you're trying to do is OK?

> Thanks/regards

-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Free BSD network buffers

2001-05-16 Thread Bosko Milekic



On Wed, May 16, 2001 at 01:17:44AM -0400, [EMAIL PROTECTED] wrote:
> Hi,
> 
> My doubt is whether mbuf's along with clusters will be used for large data sizes 
>like 65536 bytes?

If what you're talking about has to do with sending data over the
network, we attach clusters (or other larger buffers) to mbufs. We won't
fragment a huge load of data across 100 chained mbufs, for example (as that
would be quite ridiculous). :-) 
 
> Kindly reply.
> regards
> ravi prasad
> __
> Get your own FREE, personal Netscape Webmail account today at 
>http://webmail.netscape.com/

-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: tcp template removal / scalability patch

2001-06-22 Thread Bosko Milekic


> - bzero(ip, sizeof(struct ip));   /* XXX overkill? */
> + bzero(ip, sizeof(struct ip));   /* XXX overkill? */

I pointed this out the first time; the change looks unnecessary.
You changed what were tabs to spaces. Unless there's a good reason, it
looks unnecessary. Other than that, looks fine at this end!

Cheers,
-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: TCP problems with large window sizes on FreeBSD (GigaTCP)

2001-08-01 Thread Bosko Milekic


Hi Stanislav,

On Wed, Aug 01, 2001 at 11:32:29PM -0400, stanislav shalunov wrote:
> We want to build two or more machines that would be capable of
> achieving TCP throughputs of 700-800Mb/s over WAN (with a single TCP
> connection).  The motivations of this exercise are spelled out on the
> referred web page.  Additionally, I believe that getting through with
> this exercise with FreeBSD as the OS would advance FreeBSD's cause
> with network researchers and advanced users of high-performance
> networks.
> 
> In order to run at such high throughput over links with RTT of roughly
> 70ms we'd need window sizes in the vicinity of 8-16MB.  (And,
> naturally, the unidirectional loss event probability has to be less
> than (.7*MSS/(RTT*bandwidth))^2 = 1e-7.  We believe that we have
> networks that lose less than one packet in ten million.)
> 
> We have built the boxes now.  I have started with back-to-back testing
> with large window sizes.  Back-to-back testing is believed to be valid
> because it's hard to expect that inserting 70ms delay between the host
> will make the situation any better.
> 
> I cannot get it to run with window sizes greater than half a megabyte.
> 
> The story, with some very preliminary analysis, is at
> http://www.internet2.edu/~shalunov/gigatcp/
> 
> I'm not reposting it here; there are 29KB of text and 3MB of data
> there.  I'm adding and updating stuff as I progress.
> 
> The questions that I have for you guys are, in decreasing order of
> importance:
> 
> 1. How do I fix the ti driver problem that apparently is holding me
>back?  What number of jumbo slots would be "good"?

I would recommend, seeing as how you're targeting specifically TCP
throughput and are not too concerned with lack of physical memory, to
increase TI_JSLOTS to at least 500-600. This would have the effect of
reserving roughly 5M of physically contiguous memory during driver attachment,
which I think you can safely spare. If you discover that you need even more
than 500-600 jumbo buffers, feel free to experiment with the TI_JSLOTS
constant.
For what concerns memory buffer tuning, I would also recommend, to be
on the safe side, the following changes:

 - in uipc_mbuf.c, increase NCL_INIT to roughly 20. Seeing as how you're using
   if_ti which allocates its own buffer space, I don't suspect that you'll
   be needing many regular clusters. Check with `netstat -m' to see how many
   are typically in use during a test and increase NCL_INIT to that number. All
   this will do is pre-allocate the pool at boot time and avoid potentially
   expensive map allocations while your tests are running.

 - also in uipc_mbuf.c, increase NMB_INIT to 10240, or any sensible value you
   have detected is required (again, see with `netstat -m' during testing,
   try to determine the maximum number of mbufs you'll need). All this will
   do is, again, allocate them at boot time and speed up memory buffer
   allocations during performance testing. If you set the number to 10240,
   remember that each mbuf is merely 256 bytes, so you'll be giving up a mere
   ~2.7M for the cause, and speeding up allocations altogether.

Finally, I noticed at one point in your analysis that you increased
NMBCLUSTERS. You'll find that, unless you're actually running out of mbufs
and/or clusters, that increasing N{MB,CL}_INIT is probably what you want to
do instead.

> 2. Why doesn't Fast Retransmit kick in?  (See the annotated sender's
>view of the stalled connection.)
> 
> 3. Is there an off-by-one error in RST handling?  (See the end of the
>annotated receiver's view of the stalled connection.)

I believe jlemon covered these two issues in his post, which very
much makes sense as he's the overall stack guru. :-)

> -- 
> Stanislav Shalunovhttp://www.internet2.edu/~shalunov/
> 
> "Nuclear war would really set back cable [television]."  -- Ted Turner

-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



[PATCH] mbtypes stats re-enabled in -CURRENT

2001-08-21 Thread Bosko Milekic


Hi,

  Several weeks ago, Jiangyi Liu <[EMAIL PROTECTED]> sent me an early patch that
re-enabled mbtypes statistics under the mbuf allocator in -CURRENT (I had
disabled them temporarily while I worked on the allocator semantics itself).
  Just a couple of days ago, I cleaned up the patch, fixed it up, and rolled
a new diff applying to most recent CURRENT. This patch re-enables mbtypes
statistics under CURRENT:

  http://people.freebsd.org/~bmilekic/code/mb_alloc/mbtypes_stats.diff (17k)

  To CURRENT users with either SMP or UP : please test, you will need to:

  1) cvsup fresh CURRENT sources
  2) apply patch found at above URL
  3) rebuild and reinstall kernel
  4) reboot
  5) cp /usr/src/sys/sys/mbuf.h /usr/include/sys/
  6) rebuild and reinstall both /usr/src/usr.bin/netstat and
 /usr/src/usr.bin/systat
  7) go about your business, all network services working as usual
  8) Check `netstat -m' and `systat -mbufs' outputs, more specifically look
 out for "Allocated mbuf types" output in the mbuf section of `netstat -m'
 output and look for allocated mbuf types to be displayed following a
 `systat -mbufs'
  9) Email me ([EMAIL PROTECTED]) and let me know how it went.

  I'd like to commit this within the next week.

  To those reviewing the patch, here's how it works:

  Mbuf "types" (e.g. MT_DATA, etc.) are counted with the help of `long' types.
Each per-CPU mbuf statistics data structure (mbpstat) contains an array of
longs called mb_mbtypes[]. The size of this array is, of course, MT_NTYPES,
defined in sys/mbuf.h specifically for this purpose. The general mbuf statistics
structure (called mbstat) contains a `short' m_numtypes member which, during
mbuf_init(), is set to MT_NTYPES. The general container's mbpstat struct, like
the per-CPU container's mbpstat structs, also contains this mb_mbtypes array.
  During mbuf allocation, (type) being an argument already passed to m_get()
gets passed down to the mb_alloc allocation routine (mb_alloc()). For mbuf
clusters, MT_NOTMBUF gets passed down (MT_FREE - no longer used - has been
renamed to MT_NOTMBUF). The mb_alloc() code will then increment
mb_mbtypes[(type)] if (type) is not MT_NOTMBUF. This manipulation is performed
cosistently because it is in the per-CPU container and because it is done
while the per-CPU container's (or the general container's) list is already
held.
  During mbuf freeing, mbuf_being_freed->m_type is passed to mb_free() by
m_free(). Again, in the case of mbuf clusters, MT_NOTMBUF is passed down
instead. So, mb_free() will decrement mb_mbtypes[(type)] if (type) is not
MT_NOTMBUF. This manipulation is handled under the given container's lock
and is therefore guaranteed to be consistent.
  The reason this works is because each container has its own mbpstat per-CPU
structure and can therefore ensure its consistency via its own container lock.
The other reason this works is because mb_mbtypes[] is, in each case, an
array of _longs_ and not _unsigned longs_. This allows for the counters to
go below 0, as is expected in this situation. Consider, for example, that
a thread A runs on CPU 0 and allocates an MT_DATA mbuf, then this happens:

 mbpstat[cpu_0]->mb_mbtypes[MT_DATA]++;

  If one considers that mbpstat[cpu_0]->mb_mbtypes[MT_DATA] was 0 prior to
this manipulation, then its new value is 1.
  Now consider that another thread B on CPU 1 frees this same mbuf. Then this
happens:

 mbpstat[cpu_1]->mb_mbtypes[MT_DATA]--;

  If one considers that mbpstat[cpu_1]->mb_mbtypes[MT_DATA] was 0 prior to
this manipulation, then its new value is -1. So to get the "real"
mbtypes[(type)] value, we do the following:

 mbtypes[(type)] = mbpstat[cpu_1]->mb_mbtypes[(type)] + ... +
   mbpstat[cpu_n]->mb_mbtypes[(type)];

  We are guaranteed a number >= 0, because of the fact that all the mb_mbtypes[]
are manipulated consistently, under some lock, and because we will never have
a certain type _decremented_ before it's once _incremented_. In other words,
we will never have something like "mb_mbtypes[(type)]--" occur unless there was
already at least one "mb_mbtypes[(type)]++" that occured on at least one of the
other containers.
  MCHTYPE() is implemented to call a function m_chtype() which changes the type
accordingly and manipulates the mb_mbtypes[] of the general container after
acquiring the general container's lock. This function is rarely used, and it
should hopefully stay that way.

Regards,
-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: performance issues with M_PREPEND on clusters

2001-10-23 Thread Bosko Milekic


As I have stated to Luigi in private and as I have stated more than
half a year ago now when this issue was first brought up by dwmalone,
I believe that the correct way to deal with this issue is to have
M_LEADINGSPACE and M_TRAILINGSPACE simply return the number of bytes
that make up the leading space or trailing space, respectively,
regardless of whether the underlying cluster/mbuf/ext_buf is marked
writable or not.

There is nothing in the name "M_LEADINGSPACE" and "M_TRAILINGSPACE"
that implies anything having to do with the writability of the
underlying buffer and therefore, these two should not be concerned
with writability.

The reason that the relevant code to do so in the first place was
commented out in the code when it was imported is the result of a
crude hack which was probably brought in when support for other
external buffers (besides for just clusters) was implemented because
while this was done, at the time, the support for marking said
buffers "writable" did not exist, and so to minimize checking in
code using the M_LEADINGSPACE and M_TRAILINGSPACE macros, they were
merely made to return 0 in the event of an M_EXT marked mbuf.

Instead, the correct approach would be, as stated above, to have
M_LEADINGSPACE and M_TRAILINGSPACE return what their names imply and
have the code that uses them in the kernel appropriately check for
writability of the ext_buf before it's about to write to it, using
the _already existing_ interface, i.e. M_WRITABLE(). Since this
interface _exists_ and we have it implemented, there is no reason
not to use it properly anymore.

Yes, I'm willing to go around and do the work in -CURRENT, and I
will do this, hopefully, by this weekend, as the [massive amount of]
school work "stabilizes." I think the same approach should be taken
in -STABLE. That said, I am not 100% opposed to Luigi's patch and I
think that had not brought this up, it would have been left long
forgotten.  

On Tue, Oct 23, 2001 at 02:06:28PM -0700, Luigi Rizzo wrote:
> On Tue, Oct 23, 2001 at 02:00:34PM -0500, Alfred Perlstein wrote:
> > 
> > Yes, you're right, I was mistaken in my paranioa, however you're
> > missing the fact that one may want to allocate an EXT buf and still
> > have it writeable.
> 
> Yes, but this is the sort of things that are better sorted out
> in -current...
> 
> > I like your patch, it's a good idea, go for it.
> 
> good to hear that :)
> 
>   cheers
>   luigi

Cheers,
-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: performance issues with M_PREPEND on clusters

2001-10-26 Thread Bosko Milekic


On Fri, Oct 26, 2001 at 08:47:27PM +0900, Keiichi SHIMA / ?$BEg7D0l?(B wrote:
> Right.  And we will have a problem if someone changes the semantics of
> M_LEADINGSPACE.  The M_LEADINGSPACE macro of Net/OpenBSD does the same
> as the current FreeBSD does, that is, returns 0 if M_EXT is set.
> 
> > When I looked at this question last year I think I found that there
> > were few enough places in the kernel which used M_LEADINGSPACE, so
> > it should be fairly easy to check them. However, I think several
> > of the uses were in KAME code.
> 
> The patch posted first by Luigi is safe because the patch doesn't
> change any semantics of M_LEADINGSPACE.  I think it (and the patch)
> reasonbale that M_LEADINGSPACE returns 0 when the mbuf (cluster) is
> not writable and the real space when the mbuf (cluster) is writable.
> 
> But, I disagree the later proposal from Bosko that changes the meaning
> of M_LEADINGSPACE.  This leads more ifdefs in the shared code
> (ex. KAME, maybe other cross-platform projects) and the code
> complexity.

  Just because M_LEADINGSPACE may be broken in the other systems does
not mean that it should be broken in our system as well. I am against
sacrificying good code in order to deal with the left-over stupidities
(pardon the seemingly harsh vocabulary) in the other systems, when it
comes to porting.
  You should understand that M_LEADINGSPACE was meant to return the
number of leading bytes in the underlying data buffer and that it was
only broken when someone decided to half-implement ext bufs. This
brokeness happened to live on in present-day systems and unbreaking it
should not be put on hold for the sake of pseudo-compatibility, IMHO.

> ---
> Keiichi SHIMA
> IIJ Research Laboratory  <[EMAIL PROTECTED]>
> KAME Project <[EMAIL PROTECTED]>
> 

-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: performance issues with M_PREPEND on clusters

2001-10-26 Thread Bosko Milekic


On Fri, Oct 26, 2001 at 01:10:34PM +0100, David Malone wrote:
> So the M_LEADINGSPACE macro should probaly read:
> 
> #define   M_LEADINGSPACE(m)   \
>(!M_WRITABLE(m) ? 0 :  \
>   (m)->m_flags & M_EXT ? (m)->m_data - (m)->m_ext.ext_buf :   \
>   (m)->m_flags & M_PKTHDR ? (m)->m_data - (m)->m_pktdat : \
>   (m)->m_data - (m)->m_dat)
> 
> and the comments for M_LEADINGSPACE and M_TRAILINGSPACE should note
> the inconsistancy where M_LEADINGSPACE returns the number of writable
> bytes and M_TRAILINGSPACE returns the number bytes, but doesn't indicate
> if they are writable or not.
> 
> Maybe we should also provide M_{LEAD,TRAIL}INGSPACE_{R,W} macros
> with consistant behaviour.  The _R macros would be cheaper 'cos
> they wouldn't have to check writability. These could be used in
> the case where you already know the storage is writable.
> 
> (Here _R would mean "regardless" 'cos it's hard to imagine a
> situation where you want to know how many readable bytes are outside
> the valid data region in an mbuf ;-)
> 
> Bosko - what do you think of modifying M_LEADINGSPACE as shown
> above and then providing these new macros? It would mean that we
> can leave shared code alone, but where we want to optimise
> M_LEADINGSPACE and M_TRAILINGSPACE we have the macros to do so.

  I already stated what I thought about it: it's wrong for reasons
mentionned in previous Emails. It means that we will be adding support
for a broken macro, not to mention enlarging the macro, which is bad
especially for code that is actually using it properly.

>   David.

-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: performance issues with M_PREPEND on clusters

2001-10-26 Thread Bosko Milekic



On Fri, Oct 26, 2001 at 09:28:40AM -0700, Luigi Rizzo wrote:
> Bosko, Keiichi has a valid point. The semantics of an interface
> are described by the documentation associated with it, not by its
> name or the Oxford Dictionary.

  And what documentation would that be? Are you referring to the
[incomplete] mbuf(9) man page which does not cover M_LEADINGSPACE, or
are you referring to "The Implementation of 4.4BSD" (McKusick et al.)
which, to my knowledge, doesn't specify anything regarding the
writability of the underlying data (I don't know if it specifies
anything regarding M_LEADINGSPACE to begin with)?
  The M_LEADINGSPACE macro always had the line of code that would return
the leading space even for ext_bufs commented out, suggesting that the
code did exist in the original implementation or was intended to. You
yourself noticed that the behavior of M_LEADINGSPACE was incorrect in
that it returned 0 for all ext_bufs. Therefore, the issue is not whether
or not M_LEADINGSPACE should be changed because, clearly, it should and
you yourself are lobbying to change it. The issue is how to change it
properly, and I have already presented my opinion on how it should be
done. Arguing that OpenBSD and/or NetBSD don't have this behavior is
totally ridiculous, seeing as how they don't have the same "writability"
macros (the last time I checked) that we do anyway.

> As much as it might be misleading that M_LEADINGSPACE checks for
> writability, it has probably been like this for a long time, we
> have plenty of code that depends on this, and people looking for
> documentation will likely see this behaviour documented.  Changing

  No. It has never been like this.

> it might introduce insidious bugs in the code -- because in reality,
> clusters are almost never shared except for those on the TCP send
> queue and for multicast packets (but who uses multicast, anyways),
> so there is a chance that we will have buggy code that goes unnoticed
> for a long time.
> 
>   cheers
>   luigi
> 
> On Fri, Oct 26, 2001 at 12:17:24PM -0400, Bosko Milekic wrote:
> > On Fri, Oct 26, 2001 at 08:47:27PM +0900, Keiichi SHIMA / ?$BEg7D0l?(B wrote:
> > > Right.  And we will have a problem if someone changes the semantics of
> > > M_LEADINGSPACE.  The M_LEADINGSPACE macro of Net/OpenBSD does the same
> ...
> > > But, I disagree the later proposal from Bosko that changes the meaning
> > > of M_LEADINGSPACE.  This leads more ifdefs in the shared code
> > > (ex. KAME, maybe other cross-platform projects) and the code
> > > complexity.
> > 
> >   Just because M_LEADINGSPACE may be broken in the other systems does
> > not mean that it should be broken in our system as well. I am against
> > sacrificying good code in order to deal with the left-over stupidities
> > (pardon the seemingly harsh vocabulary) in the other systems, when it
> > comes to porting.
> >   You should understand that M_LEADINGSPACE was meant to return the
> > number of leading bytes in the underlying data buffer and that it was
> > only broken when someone decided to half-implement ext bufs. This
> > brokeness happened to live on in present-day systems and unbreaking it
> > should not be put on hold for the sake of pseudo-compatibility, IMHO.
> > 
> > > ---
> > > Keiichi SHIMA
> > > IIJ Research Laboratory  <[EMAIL PROTECTED]>
> > > KAME Project <[EMAIL PROTECTED]>
> > > 
> > 
> > -- 
> >  Bosko Milekic
> >  [EMAIL PROTECTED]
> > 
> > 
> > To Unsubscribe: send mail to [EMAIL PROTECTED]
> > with "unsubscribe freebsd-net" in the body of the message
> 
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
> 

-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: performance issues with M_PREPEND on clusters

2001-10-26 Thread Bosko Milekic


On Sat, Oct 27, 2001 at 03:31:48AM +0900, Keiichi SHIMA / ?$BEg7D0l?(B wrote:
> M_LEADINGSPACE might be meant to return as you were saying.  But
> because of some reasons (maybe a lack of time to implement the
> complete code), it behaves as we see now.  For a long time, in many
> systems, M_LEADINGSPACE had been returning writable space only.  As a
> result, some codes (including KAME) rely on the behavior.

  This is not true. M_LEADINGSPACE has not been returning writable space
only for a long time; it has been returning space in non-M_EXT mbufs for
a long time. So, doing what you want to do *is* changing its behavior,
whether you like it or not. I don't know how many times I have to repeat
this until someone starts to get it. The issue is not whether or not
you're changing the behavior of M_LEADINGSPACE because Luigi's
suggestion obviously does. The issue is how to do it properly and the
way it's being proposed right now is not the proper way.
  As it turns out, I'm really not up for extending this debate any
longer. I'm not going to kill myself because something like this is
being done wrongly, seeing as how I'm the only one arguing this side
(according to the response hits, it appears I'm alone with this
 opinion), although I will maintain that I believe it is wrong, and for
good reason, but I may end up eventually beating myself up if I have to
repeat the same thing over and over again in the same thread.

> So, I would suggest that we should not change M_LEADINGSPACE behavior,
> and write some documents (or comment in the source may be enough)
> about that.  Because one of the reasons of the confusion of
> M_LEADINGSPACE is there were no documentation about that, we had only
> the code and the code said that it returns a writable length.
> 
> (if necessary) we should create a new API to get the real(?) space as
> David have proposed.
> 
> Best regards,
> 
> ---
> Keiichi SHIMA
> IIJ Research Laboratory  <[EMAIL PROTECTED]>
> KAME Project <[EMAIL PROTECTED]>

-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf / maxfiles / maxsockets / etc autoscaling patch

2001-12-08 Thread Bosko Milekic
/uipc_socket2.c
> --- sys.old/kern/uipc_socket2.c   Sat Dec  8 16:04:50 2001
> +++ sys/kern/uipc_socket2.c   Sat Dec  8 16:08:43 2001
> @@ -1026,7 +1026,12 @@
>   */
>  static void init_maxsockets(void *ignored)
>  {
> + int autosockets, maxuserssockets;
> + 
> + autosockets = physmemMB * SOCKETSPERMB;
> + autosockets = min(MAXAUTOSOCKETS, max(MINAUTOSOCKETS, autosockets));
> + maxuserssockets = 2 * (20 + (16 * maxusers)); 
> + maxsockets = max(maxuserssockets, max(autosockets, nmbclusters));
>   TUNABLE_INT_FETCH("kern.ipc.maxsockets", &maxsockets);
> - maxsockets = imax(maxsockets, imax(maxfiles, nmbclusters));
>  }
>  SYSINIT(param, SI_SUB_TUNABLES, SI_ORDER_ANY, init_maxsockets, NULL);
> diff -u -r sys.old/netinet/tcp_subr.c sys/netinet/tcp_subr.c
> --- sys.old/netinet/tcp_subr.cSat Dec  8 16:04:42 2001
> +++ sys/netinet/tcp_subr.cSat Dec  8 16:10:31 2001
> @@ -190,6 +190,7 @@
>  tcp_init()
>  {
>   int hashsize = TCBHASHSIZE;
> + int autohashsize;
>   
>   tcp_ccgen = 1;
>   tcp_cleartaocache();
> @@ -203,6 +204,13 @@
>  
>   LIST_INIT(&tcb);
>   tcbinfo.listhead = &tcb;
> +
> + /* Calculate autoscaled hash size, use if > default hash size. */
> + autohashsize = physmemMB * TCBHASHPERMB;
> + autohashsize = min(MAXAUTOTCBHASH, max(MINAUTOTCBHASH, autohashsize));
> + while (!powerof2(autohashsize))
> + autohashsize++;
> + hashsize = max(hashsize, autohashsize);
>   TUNABLE_INT_FETCH("net.inet.tcp.tcbhashsize", &hashsize);
>   if (!powerof2(hashsize)) {
>   printf("WARNING: TCB hash size not a power of 2\n");
> diff -u -r sys.old/powerpc/powerpc/machdep.c sys/powerpc/powerpc/machdep.c
> --- sys.old/powerpc/powerpc/machdep.c Sat Dec  8 16:04:39 2001
> +++ sys/powerpc/powerpc/machdep.c Sat Dec  8 16:48:30 2001
> @@ -436,7 +436,8 @@
>   __asm ("mtsprg 0, %0" :: "r"(globalp));
>  
>   /* Init basic tunables, hz etc */
> - init_param();
> + init_hz();
> + init_param(0); /* XXX - needs to be fed physmem for proper autoscaling */
>  
>   /* setup curproc so the mutexes work */
>  
> diff -u -r sys.old/sparc64/sparc64/machdep.c sys/sparc64/sparc64/machdep.c
> --- sys.old/sparc64/sparc64/machdep.c Sat Dec  8 16:04:38 2001
> +++ sys/sparc64/sparc64/machdep.c Sat Dec  8 16:47:29 2001
> @@ -249,10 +249,10 @@
>   end = (vm_offset_t)_end;
>   }
>  
> - /*
> -  * Initialize tunables.
> -  */
> - init_param();
> + /* Init hz */
> + init_hz();
> + /* Init basic tuneables - XXX - this needs to be moved once maxmem exists 
>here. */
> + init_param(0);
>  
>  #ifdef DDB
>   kdb_init();
> diff -u -r sys.old/sys/param.h sys/sys/param.h
> --- sys.old/sys/param.h   Sat Dec  8 16:04:37 2001
> +++ sys/sys/param.h   Sat Dec  8 16:05:28 2001
> @@ -230,6 +230,44 @@
>  #define ctodb(db)/* calculates pages to devblks */ \
>   ((db) << (PAGE_SHIFT - DEV_BSHIFT))
>  
> +/*
> + * Values used in autoscaling system structures based on RAM size.
> + *
> + * Although settings are scattered across various subsystems, a
> + * common formula is followed.  Generally, there are three
> + * possible values to choose from:  The value suggested by maxusers,
> + * the value suggested by the autoscaling formula, and a manually
> + * tuned value from loader.conf.  If a manually tuned value is specified,
> + * this value will be used.  Otherwise, the maximum of the maxusers
> + * and autoscaled setting will be used.
> + *
> + */
> +
> +/* Max processes, files.  These are set in subr_param.c */
> +#define PROCPERMB 16
> +#define MINAUTOPROC 256
> +#define MAXAUTOPROC 32000
> +#define FILESPERMB 128
> +#define MINAUTOFILES 1024
> +#define MAXAUTOFILES 65536
> +
> +/* Max sockets.  These are set in uipc_socket2.c */
> +#define SOCKETSPERMB 64
> +#define MINAUTOSOCKETS 512
> +#define MAXAUTOSOCKETS 32000
> +
> +/* Max mbuf clusters, sendfile buffers.  These are set in subr_mbuf.c */
> +#define MCLSPERMB 32
> +#define MINAUTOMCLS 512
> +#define MAXAUTOMCLS 32000
> +#define SFBUFPERMB 16
> +#define MINAUTOSFBUF 1024
> +#define MAXAUTOSFBUF 32000
> +
> +/* Number of TCP hash buckets.  These are set in tcp_subr.c */
> +#define TCBHASHPERMB 8
> +#define MINAUTOTCBHASH 512
> +#define MAXAUTOTCBHASH 8192
>  
>  /*
>   * Make this available for most of the kernel.  There were too many
> diff -u -r sys.old/sys/systm.h sys/sys/systm.h
> --- sys.old/sys/systm.h   Sat Dec  8 16:04:37 2001
> +++ sys/sys/systm.h   Sat Dec  8 16:07:45 2001
> @@ -60,6 +60,7 @@
>  extern struct cv selwait;/* select conditional variable */
>  
>  extern int physmem;  /* physical memory */
> +extern int physmemMB;/* physical memory size in megabytes */
>  
>  extern dev_t dumpdev;/* dump device */
>  extern long dumplo;  /* offset into dumpdev */
> @@ -121,7 +122,8 @@
>  
>  void cpu_boot __P((int));
>  void cpu_rootconf __P((void));
> -void init_param __P((void));
> +void init_hz __P((void));
> +void init_param __P((u_int64_t));
>  void tablefull __P((const char *));
>  int  kvprintf __P((char const *, void (*)(int, void*), void *, int,
> _BSD_VA_LIST_)) __printflike(1, 0);


-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: m_reclaim and a protocol drain

2001-12-21 Thread Bosko Milekic


On Fri, Dec 21, 2001 at 09:42:30AM -0600, Randall Stewart wrote:
> Hi all:
> 
> I have a question. I have been working to test the new
> sctp_drain function I am adding and have had a very difficult
> time getting the drain function to be called by the mbuf system...
> 
> Now here is what I most observe from some of the test cases
> I am building:
> 
> A) All inbound packets get a cluster down in the driver routine.
> B) There is a much smaller limit to clusters
> C) The cluster allocation routine will NOT call reclaim() et.al.

  This has changed in -CURRENT and it should be easy to change -STABLE
to do the same. -CURRENT now drains the protocols in the cluster
starvation case too.

> D) Of course since the lower drivers are allocating M_DONTWAIT
>even if they did I would not get the routine called.
> 
> Now this brings to light a weakness in my mind on the reclaim
> system.
> 
> 1) One of the primary things I thought the drain() functions 
>help with is to ward off DOS attacks.

  Well, no, not really. They're just there to `help' out in any
starvation case, really.

> 2) If drivers all use clusters only and clusters can never
>call a drain() function, does this not leave both TCP and
>SCTP weak against an attack on the cluster side of the MBUF
>system?

  Well, firstly, all clusters are accompanied by mbufs. Secondly, as
mentionned above, -CURRENT drains in both cases.

> 3) I can see if we are out of mbufs eventually something sending
>down will do a mget(..) with a M_WAIT which can spawn the drains
>should we not have something like this for a cluster allocation??

  There's no way we can have M_DONTWAIT allocations possibly drain the
protocols. It would be way too much time for an M_DONTWAIT allocation,
especially in light of where we may be going with this in the future
(i.e. processing some packets from interrupt context - perhaps).

  What I think you should do in your code is make the calls with
M_TRYWAIT (what you call M_WAIT) wherever they are possible and only
call with M_DONTWAIT where it's really not possible to wait. The
M_TRYWAIT flag does not imply "run slower than M_DONTWAIT," it just
means "try harder even if it takes a little longer, since we are able to
block."
 
> If we don't it seems to me the utility of the drain() fucnction is
> very very limited..
> 
> Regards
> 
> R
> 
> -- 
> Randall R. Stewart
> [EMAIL PROTECTED] 815-342-5222 (cell phone)

-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: m_reclaim and a protocol drain

2001-12-26 Thread Bosko Milekic


On Wed, Dec 26, 2001 at 06:13:40AM -0600, Randall Stewart wrote:
> > > 1) One of the primary things I thought the drain() functions
> > >help with is to ward off DOS attacks.
> > 
> >   Well, no, not really. They're just there to `help' out in any
> > starvation case, really.
> > 
> 
> This comment facinates me. The reason we made SACK's in SCTP
> revokeable is due to the potential DOS attack that someone
> can supposedly lauch if you don't allow the stack to revoke.
> 
> I can actually see the reason that Sally made the comments
> and had us change it so that SACK's are revokeable. However
> you argue to the contrary and I wonder which is correct.
> 
> If you do not allow revoking it is the same as if a protocol
> does not hold a drain() fucntion. A attacker could easily
> stuff a lot of out-of-order segments at you and thus
> fill up all your mbuf's or clusters (in my current testing
> case). This would then yeild a DOS since you could no longer
> receive any segments and leave you high and dry

  I agree that it is important to drain. I didn't say that it wasn't,
but merely that the drain routine in the mbuf code exists to `help out'
in the starvation case. Nothing prevents one from calling a specific
protocol drain routine explicitly from elsewhere. It is relatively
expensive to drain all the protocols for a M_DONTWAIT caller so perhaps
if the protocol expects draining to occur in tight situations, we can
have the drain routines called from a different context as well (i.e.
not only when mbuf allocations fail but also when, for example,
something noteworthy occurs in the protocol code).

> > > 2) If drivers all use clusters only and clusters can never
> > >call a drain() function, does this not leave both TCP and
> > >SCTP weak against an attack on the cluster side of the MBUF
> > >system?
> > 
> >   Well, firstly, all clusters are accompanied by mbufs. Secondly, as
> > mentionned above, -CURRENT drains in both cases.
> > 
> 
> Hmm.. I will look at updating this...

  That would be very much appreciated. :-) I feel very bad for not doing
this myself because I seem to not be tracking -STABLE much these days.
But this is totally my fault and if you don't pick it up, feel free to
submit a PR and I swear I will force myself to do it. :-)

> >   What I think you should do in your code is make the calls with
> > M_TRYWAIT (what you call M_WAIT) wherever they are possible and only
> > call with M_DONTWAIT where it's really not possible to wait. The
> > M_TRYWAIT flag does not imply "run slower than M_DONTWAIT," it just
> > means "try harder even if it takes a little longer, since we are able to
> > block."
> > 
> 
> A couple of comments here:
> 
> a) There is NO M_TRYWAIT in the 4.3 strain I am working on.. I am
>about to upgrade to 4.4. so maybe it will appear there :>

 Ah, no there is not. But M_TRYWAIT == better name given in -CURRENT to
M_WAIT. Basically, they're the same flag and have the same effect, but
the name M_TRYWAIT is more descriptive and thus used in -CURRENT (it
means "try harder and wait if necessary, but not necessarily forever").

> b) It is NOT my code that I am talking about. The issue is I am
>stuffing a LOT of out-of-order segments up the SCTP stack to
>get the drain() function called (so I can test it). Nothing
>ever calls the drain function and instead we lose packets at
>input since the drivers in the ethernet side of things (as you
>mention) do a M_DONTWAIT. My code uses M_WAIT wherever it can... 
>
>
> Hmm, maybe thats it... I need to somehow activate a condition where
> I send a large segment and do a M_WAIT... Probably easier to build
> a test ioctl to call the drain function though..

 Hm. Well, at any given time we may get a M_WAIT allocation and that
would do it. Note, however, that the drain routines _only_ get called
when there are *no* more mbufs (in -STABLE) or clusters (in -CURRENT)
left. That means that the mb_map has to be starved and that there are no
mbufs on the free lists. They do not get called unless there is total
starvation. That's why I mentionned above that perhaps it would be
worthwhile to have them called from a different context somewhere from
the protocol code when/if the protocol feels that they are necessary.

> R
> 
> > > If we don't it seems to me the utility of the drain() fucnction is
> > > very very limited..
> > >
> > > Regards
> > >
> > > R
> > >
> > > --
> > > Randall R. Stewart
> > > [EMAIL PROTECTED] 815-342-5222 (cell phone)

-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Getting rid of maxsockets.

2002-03-21 Thread Bosko Milekic


On Fri, Mar 22, 2002 at 02:59:04AM -0600, Mike Silbersack wrote:
> I've looked over vmstat -z with a UMA kernel, it's really nice to know
> that everything is coexisting together now.
> 
> There's one big target, though:  mbufs.  I know that Bosko put a lot of
> work into his new mbuf allocator, but if you could find a way to merge
> mbufs into the slab allocator the benefits would be huge.  Have you
> discussed doing this with Bosko yet?
> 
> Mike "Silby" Silbersack

   As I have previously mentionned, mb_alloc *is* a slab-like allocator
but I am prepared to remove large chunks of it and instead glue-into UMA
if this can be done in a relatively clean and elegant manner which,
after looking at UMA, I've decided that it probably can.

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Getting rid of maxsockets.

2002-03-21 Thread Bosko Milekic


On Thu, Mar 21, 2002 at 11:35:52PM -0500, Jeff Roberson wrote:
> On Fri, 22 Mar 2002, Mike Silbersack wrote:
> 
> > There's one big target, though:  mbufs.  I know that Bosko put a lot of
> > work into his new mbuf allocator, but if you could find a way to merge
> > mbufs into the slab allocator the benefits would be huge.  Have you
> > discussed doing this with Bosko yet?
> >
> > Mike "Silby" Silbersack
> >
> 
> We have talked about it quite a bit.  I'd love to remove the hard limit on
> mbufs.  I may do this soon, but I have other uma related work that will
> probably come before it.

  I'm not so sure I like this idea.  What would be better (and perhaps
what you meant) is: "be able to expand the size of the mbuf allocation
`pool' at runtime."  In any case, we should not jump to quick
conclusions with all data structures right away.  Instead, I propose
that we first glue-in mbuf allocations to UMA (not too difficult, given
that UMA provides an allocation routine stub).  If this is done properly
[without macro-performance loss] then it should be rather trivial to
bring in new functionality.

> Jeff

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: splimp() during panic?

2002-05-24 Thread Bosko Milekic


Archie Cobbs wrote:
> Hi,
> 
> I'm trying to debug a mbuf corruption bug in the kernel. I've added
> an mbuf sanity check routine which calls panic() if anything is amiss
> with the mbuf free list, etc. This function runs at splimp() and if/when
> it calls panic() the cpl is still at splimp().
> 
> My question is: does this guarantee that the mbuf free lists, etc. will
> not be modified between the time panic() is called and the time a core
> file is generated? For example, if an incoming packet causes a networking
> interrupt after panic() has been called but before the core file is
> written, will that interrupt be blocked when it calls splimp()?

  splimp() ensures that no driver handlers will be executed.  Further,
  dumpsys() is called from panic() at splhigh() which would also mean
  that none of those potentially troublesome handlers will run.

> I've been working under this assumption but it seems to not be
> valid, because I seem to be seeing panics for situations that are
> not true in the core file.

  Are you seeing invalid stuff from DDB but valid stuff from the core
  file?  Because if so, that's REALLY WIERD.  If you're just seeing two
  different but invalid things, then perhaps something is happening when
  Debugger() runs (is it possible that the cpl() is changed after
  or before a breakpoint()?).

> If this is not a valid assumption, is there an easy way to 'freeze'
> the mbuf free lists long enough to generate the core file when an
> inconsistency is found (other than adding the obvious hack)?

  To make doubly-sure, what you can do is just keep a variable 'foo'
  which you initialize to 0.  Before any mbuf free list manipulations,
  place a 'if (foo == 0)' check.  Atomically set foo to 1 before the
  panic.  See if the inconsistency changes.  If you're seeing garbage in
  both cases, but the garbage is inconsistent, perhaps there's a memory
  problem or the dump isn't working properly (I've never heard of
  anything like this before).

> Thanks,
> -Archie
> 
> ______
> Archie Cobbs * Packet Design * http://www.packetdesign.com

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: m_split() considered harmful

2002-05-31 Thread Bosko Milekic


On Fri, May 31, 2002 at 11:29:20AM -0700, Archie Cobbs wrote:
> Hi,
> 
> There is an inconsistency in the kernel mbuf code around the
> "m->m_ext.ext_size" field of an M_EXT mbuf.
> 
> At first glance you might assume that this field is the total amount
> of contiguous space available in the external buffer pointed to by
> "m->m_ext.ext_buf". For example, look at the M_TRAILINGSPACE() and
> MCLGET() macros. But alas, you know where assumptions lead... :-)
> 
> Now look at m_split(), in particular this code:
> 
> if (m->m_flags & M_EXT) {
>   n->m_flags |= M_EXT;
>   n->m_ext = m->m_ext;
>   if(!m->m_ext.ext_ref)
>   mclrefcnt[mtocl(m->m_ext.ext_buf)]++;
>   else
>   (*(m->m_ext.ext_ref))(m->m_ext.ext_buf,  
>   m->m_ext.ext_size);
>   m->m_ext.ext_size = 0; /* For Accounting XX danger */
>   n->m_data = m->m_data + len;
> } else {
> 
> Notice the 'XX' which is where the problem lies. The code is
> just recklessly setting m->m_ext.ext_size to zero to avoid some
> "accounting" problem. This has been there since revision 1.1/rgrimes.
> 
> This is comletely broken and violates the semantics of the
> "m->m_ext.ext_size" field implied by M_TRAILINGSPACE() et. al.

  Good catch.  Thanks Archie!

> Moreover, I've also done a search of every occurrence of "ext_size"
> and everywhere else in the kernel treats this feild with the same
> "external buffer size" semantics, and there is no "accounting" that
> is done with this field.
> 
> For an example of where this could cause problems, notice that
> a call to sbfree() on an mbuf that had gone through an m_split()
> could cause a "receive 1" panic (I've seen these occasionally over
> the years).
> 
> FYI, m_split() is used in these files:
> 
> netgraph/ng_ppp.c
> netinet6/ah_input.c
> netinet6/esp_input.c
> netinet6/frag6.c
> netsmb/smb_rq.c
> 
> If there are no objections, I will apply the patch below.
>
> Thanks,
> -Archie
> 
> __
> Archie Cobbs * Packet Design * http://www.packetdesign.com
> 
> --- kern/uipc_mbuf.c.orig Fri May 31 11:17:52 2002
> +++ kern/uipc_mbuf.c  Fri May 31 11:27:42 2002
> @@ -1194,6 +1194,10 @@
>   * Partition an mbuf chain in two pieces, returning the tail --
>   * all but the first len0 bytes.  In case of failure, it returns NULL and
>   * attempts to restore the chain to its original state.
> + *
> + * Note that the returned mbuf must be treated as read-only, because
> + * it will end up sharing an mbuf cluster with the original mbuf if the
> + * "breaking point" happens to lie within a cluster mbuf.
>   */

  Can you please apply this to -CURRENT first?  -CURRENT has the same
  problem.  Notice that in -CURRENT we have a M_WRITABLE() macro which
  will actually evaluate to false for both mbufs as they will be
  referring to a cluster that has a ref count of > 1 so this comment
  will be implicitly represented in the code.  Now all we have to do is
  actually start using the M_WRITABLE macro more often. :-)

>  struct mbuf *
>  m_split(m0, len0, wait)
> @@ -1247,7 +1251,6 @@
>   else
>   (*(m->m_ext.ext_ref))(m->m_ext.ext_buf,
>   m->m_ext.ext_size);
> - m->m_ext.ext_size = 0; /* For Accounting XX danger */
>   n->m_data = m->m_data + len;
>   } else {
>   bcopy(mtod(m, caddr_t) + len, mtod(n, caddr_t), remain);

  I don't remember why the ext_size here was this originally (as you
  mention, it was imported that way), but it certainly seems bogus and
  you catching it now is hopefully going to solve some really wierd and
  difficult-to-debug problems we've had involving mbufs over the years.

Woop!
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: m_split() considered harmful

2002-05-31 Thread Bosko Milekic


On Fri, May 31, 2002 at 12:54:39PM -0700, Julian Elischer wrote:
> 
> 
> On Fri, 31 May 2002, Bosko Milekic wrote:
> 
> > 
> >   I don't remember why the ext_size here was this originally (as you
> >   mention, it was imported that way), but it certainly seems bogus and
> >   you catching it now is hopefully going to solve some really wierd and
> >   difficult-to-debug problems we've had involving mbufs over the years.
> > 
> It was imported from FreeBSD-1.x by rod.
> 
> The size argument was for two reasons.
> 1/ to allow the M_LEADINGSPACE and M_TRAILINGSPACE calculations
> to be done if it was a single reference object.
> 2/ to allow the free function (in the case of non cluster external
> objects) to know what sized object they had in the case that they needed
> this information.

  Yeah, we still need the field to exist because of (2).  I just meant
  that we should remove the explicit setting of 0 of the field in
  m_split(), and that's what Archie suggested.

> I know because I added it, because I needed to do both of these at TRW in
> '90-'95 under MACH/BSD and when I moved the code to FreeBSD1.x it cam
> along.. there was no M_LEADINGSPACE/M_TRAILINGSPACE macro at that time..
> I did it in my code explicitly.
> 
> It was not used in standard code because only in my own protocol stack
> did I know that the damned object was not shared and writable..
> This has improved with time..
> 
> Having the size set to 0 however stopped users from hitting
> cases where the WRITABILITY of the ext objext has not been correctly 
> tracked as it always returns "not enough space" (sometimes -ve).

  Right now, in -CURRENT, we have a M_WRITABLE macro, as you noticed,
  that evalutes true unless one of these conditions is true:

  1) M_RDONLY is set in the mbuf
  2) mbuf has external buffer attached and ref. count on that buffer
  (e.g., cluster, whatever) is > 1.

  This macro (M_WRITABLE) _is_ currently used in M_LEADINGSPACE.
  If the mbuf being checked is not writable, M_LEADINGSPACE
  will evalute to 0. However, I noticed just now that it is
  not used in M_TRAILINGSPACE, where it probably should be.

M_TRAILINGSPACE should probably look like this:

#define M_TRAILINGSPACE(m) \
((m)->m_flags & M_EXT ? \
(M_WRITABLE(m) ? \
(m)->m_ext.ext_buf + (m)->m_ext.ext_size - \
((m)->m_data + (m)->m_len) : 0) : \
&(m)->m_dat[MLEN] - ((m)->m_data + (m)->m_len))

This is the same philosophy adopted by M_LEADINGSPACE.

I have no idea how I missed that when we did the M_WRITABLE stuff.  Or
did we leave it out for some specific reason?  Ian Dowse or David Malone
would know, as the WRITABLE code was thanks to them. :-)

> If we change this we need to audit the WRITABILTY..
> e.g. it is not checked in M_TRAILINGSPACE

 Right.

> Julian

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: m_split() considered harmful

2002-05-31 Thread Bosko Milekic


On Fri, May 31, 2002 at 01:25:09PM -0700, Archie Cobbs wrote:
[...]
> As a temporary saftey measure, I'll add M_WRITABLE(m) into the
> M_TRAILINGSPACE() macro. However, I see this as a temporary hack;
> the correct fix is to put the burden of writability on the caller.
> After all, M_TRAILINGSPACE() doesn't modify the mbuf data!
> 
> That is, what we really need is a more general audit for code that
> writes into mbufs that might be read-only -- and, as one special case
> of tha, code that calls M_TRAILINGSPACE()/M_LEADINGSPACE() before writing
> into an mbuf.

  Agreed.

[...]
> Updated patch below.
> 
> -Archie
[...]

  Patch looks good.

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Race condition with M_EXT ref count?

2002-06-03 Thread Bosko Milekic


  First of all, this issue is non-existant in -CURRENT, where all
  reference counting is done via a consistent interface that provides
  bus-locked reference count manipulations.

  Secondly, this may not be as much of a problem as you may think.
  Particularly if you consider the m_split() case, for example.  For
  example, if you're calling m_split() on an mbuf chain that may refer
  to cluster(s) where none of the clusters has a ref. count greater than
  1 -- which is usually the case anyway -- then this is fine;  since you
  have posession of the chain referring to those clusters, and
  presumably the chain isn't sitting in some queue somewhere (if it is,
  you'd have to be under the protection of that queue anyway - splimp()
  or whatever), then you're the only one who has posession of that
  cluster and you'll be the only candidate incrementing the ref. count,
  regardless of whether an interrupt comes in or not.  Perhaps the real
  solution would be to do something like:

  if (ref count > 1) {
do atomic inc.
  } else {
do regular inc.
  }

  Or, alternately, do what -CURRENT does (and it's probably simpler
  anyway) and just make it all atomic.  I'm sorry to say that I don't
  develop on -STABLE much anymore (namely, because I don't run it at
  home) so I cannot make this change any time soon.

On Mon, Jun 03, 2002 at 03:08:01PM -0700, Julian Elischer wrote:
> well, it's a valid problem form the point of view of some interrupt code
> changing the ext reference at teh same time that some other code is 
> planning on incrementing it, but while Giant is in place it's not
> likely to affect anything..
> 
> 
> I presume you are talking about 4.x however right?
> we can presume that each manipulator has a reference so it's not going to
> GA AWAY due to a zero reference being reached, so the link can be followed
> safely, which just elaves the atomicity of the addition operation.
> 
> who are the contenders?
> 1/ network mid-level code? (protected by: splnet and the BGL.
> Only one cpu at a time..
> 
> 2/ Interrupt code running at splimp
> probably freeing stuff after transmit. (receivers should have
> pre-allocated buffers)
> 
> it  IS possible that there could need to be an splimp() added but I am
> not clear on what part the 4.4 BGL (Big Giant Lock) plays in this..
> it may make it safe...
> 
> On Mon, 3 Jun 2002, Archie Cobbs wrote:
> 
> > Julian Elischer writes:
> > > Not denying that, just that JHB has a preliminary implementation
> > > he's been showing around...
> > > 
> > > > > this is YET ANOTHER case for the Atomic reference counting ABI that
> > > > > jhb has been talking about...
> > > > 
> > > > I was the initial person to request an atomic_t API.
> > 
> > You guys please stop changing the subject :-)
> > 
> > Can somebody confirm that they think this bug is real/valid?
> > 
> > Thanks,
> > -Archie
> > 
> > __
> > Archie Cobbs * Packet Design * http://www.packetdesign.com
> > 

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Race condition with M_EXT ref count?

2002-06-03 Thread Bosko Milekic


On Mon, Jun 03, 2002 at 04:01:29PM -0700, Archie Cobbs wrote:
> Bosko Milekic writes:
> >   Secondly, this may not be as much of a problem as you may think.
> >   Particularly if you consider the m_split() case, for example.  For
> >   example, if you're calling m_split() on an mbuf chain that may refer
> >   to cluster(s) where none of the clusters has a ref. count greater than
> >   1 -- which is usually the case anyway -- then this is fine;  since you
> >   have posession of the chain referring to those clusters, and
> >   presumably the chain isn't sitting in some queue somewhere (if it is,
> >   you'd have to be under the protection of that queue anyway - splimp()
> >   or whatever), then you're the only one who has posession of that
> 
> The *chain* won't be sitting in a queue, but there may be a different
> mbuf in a queue somehwere that points to the same cluster.

  I did say "if the refcount is exactly 1"   (which is often the
  case in there).

> Since mid-level code only increments the ref count, I think the
> worst that can happen is the ref count is incorrectly too high,
> which could only cause a memory leak rather than a crash.

  No, the worse case is that it is too low.

  increment:

  read
  inc
  write

  two increments, unprotected:

  count initially: 1
  
  read 1
  inc 1->2read 1
  write 2 inc 1->2
  write 2
  
  count finally: 2 (it should be 3 now)

  Two frees will now result in the cluster being freed instead of 3.
  The final free will be bogus.

> There's not way for the ref count to be too low, or for the cluster
> to be free'd twice, because all decrements are protected by splimp().
> 
> However, once you start adding in custom code then we may not be
> so lucky..
> 
> -Archie
> 
> ______
> Archie Cobbs * Packet Design * http://www.packetdesign.com

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Race condition with M_EXT ref count?

2002-06-03 Thread Bosko Milekic


On Mon, Jun 03, 2002 at 03:54:02PM -0700, Archie Cobbs wrote:
[...]
> If you don't mind, I'll let you deal with the -current case, since
> I'm not familiar enough with -current locking. Below is my proposed
> patch for -stable.

  -current does not have this "problem."

> It should be clear that in -stable at least, there can be a race
> between an interrupt handler that free's a cluster mbuf and mid-level
> code that is free'ing the same mbuf, for example.
> 
> -Archie
> 
> __
> Archie Cobbs * Packet Design * http://www.packetdesign.com
> 
> --- sys/kern/uipc_mbuf.c.orig Mon Jun  3 15:43:25 2002
> +++ sys/kern/uipc_mbuf.c  Mon Jun  3 15:52:28 2002
> @@ -707,11 +707,16 @@
>   n->m_len = min(len, m->m_len - off);
>   if (m->m_flags & M_EXT) {
>   n->m_data = m->m_data + off;
> - if(!m->m_ext.ext_ref)
> - mclrefcnt[mtocl(m->m_ext.ext_buf)]++;
> - else
> - (*(m->m_ext.ext_ref))(m->m_ext.ext_buf,
> - m->m_ext.ext_size);
> + if (m->m_ext.ext_ref == NULL) {
> + atomic_add_char(
> + &mclrefcnt[mtocl(m->m_ext.ext_buf)], 1);
> + } else {
> + int s = splimp();
> +
> + (*m->m_ext.ext_ref)(m->m_ext.ext_buf,
> + m->m_ext.ext_size);
> + splx(s);
> + }
>   n->m_ext = m->m_ext;
>   n->m_flags |= M_EXT;
>   } else
> @@ -754,11 +759,15 @@
>   n->m_len = m->m_len;
>   if (m->m_flags & M_EXT) {
>   n->m_data = m->m_data;
> - if(!m->m_ext.ext_ref)
> - mclrefcnt[mtocl(m->m_ext.ext_buf)]++;
> - else
> - (*(m->m_ext.ext_ref))(m->m_ext.ext_buf,
> - m->m_ext.ext_size);
> + if (m->m_ext.ext_ref == NULL)
> + atomic_add_char(&mclrefcnt[mtocl(m->m_ext.ext_buf)], 1);
> + else {
> + int s = splimp();
> +
> + (*m->m_ext.ext_ref)(m->m_ext.ext_buf,
> + m->m_ext.ext_size);
> + splx(s);
> + }
>   n->m_ext = m->m_ext;
>   n->m_flags |= M_EXT;
>   } else {
> @@ -777,11 +786,16 @@
>   n->m_len = m->m_len;
>   if (m->m_flags & M_EXT) {
>   n->m_data = m->m_data;
> - if(!m->m_ext.ext_ref)
> - mclrefcnt[mtocl(m->m_ext.ext_buf)]++;
> - else
> - (*(m->m_ext.ext_ref))(m->m_ext.ext_buf,
> - m->m_ext.ext_size);
> + if (m->m_ext.ext_ref == NULL) {
> + atomic_add_char(
> + &mclrefcnt[mtocl(m->m_ext.ext_buf)], 1);
> + } else {
> + int s = splimp();
> +
> + (*m->m_ext.ext_ref)(m->m_ext.ext_buf,
> + m->m_ext.ext_size);
> + splx(s);
> + }
>   n->m_ext = m->m_ext;
>   n->m_flags |= M_EXT;
>   } else {
> @@ -1125,11 +1139,15 @@
>   if (m->m_flags & M_EXT) {
>   n->m_flags |= M_EXT;
>   n->m_ext = m->m_ext;
> - if(!m->m_ext.ext_ref)
> - mclrefcnt[mtocl(m->m_ext.ext_buf)]++;
> - else
> - (*(m->m_ext.ext_ref))(m->m_ext.ext_buf,
> - m->m_ext.ext_size);
> + if (m->m_ext.ext_ref == NULL)
> + atomic_add_char(&mclrefcnt[mtocl(m->m_ext.ext_buf)], 1);
> + else {
> + int s = splimp();
> +
> + (*m->m_ext.ext_ref)(m->m_ext.ext_buf,
> + m->m_ext.ext_size);
> + splx(s);
> + }
>   m->m_ext.ext_size = 0; /* For Accounting XX danger */
>   n->m_data = m->m_data + len;
>   } else {

 It would be better if the thing was in a macro to allow for consistent
 ref. count manipulation, like in -CURRENT.  But whatever, I don't
 really care.

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: m->m_pkthdr.header

2002-06-07 Thread Bosko Milekic


On Fri, Jun 07, 2002 at 12:55:53PM -0700, Archie Cobbs wrote:
> Hi,
> 
> I'd like to get rid of this mbuf field "m->m_pkthdr.header".
> Here are my reasons:
> 
>  - It's broken. Consider this code:
> 
>  copy = m_copypacket(m);
>  m_freem(m);

  Indeed, this is a bug in m_copypacket().  It should copy the data
  over, but again, admittedly, this is a very dangerous practise!

  Ideally, what ought to happen is that the copying over needs to be
  done by the code calling m_copypacket(), right after the
  m_copypacket() call but before the m_freem() call.  The reason to do
  it from the caller code and not from m_copypacket() is that there is
  no reasonable way to determine, from m_copypacket(), whether or not
  one of our header/mbuf pointers points to a data inside the mbuf or
  whether it is safe to do the shallow pointer copy.

>Typically m->m_pkthdr.header points into the data area of 'm'.
>So now copy->m_pkthdr.header points at the same data, which has
>already been freed and possibly reused.
>
>  - It is an obscure thing most people don't know about and is
>ripe for causing hard to find bugs.
> 
>  - It's only used in two places, (a) for secret communication between
>the oltr driver and if_ether.c, and (b) as a temporary variable
>in ip_input.c:
> 
> $ find . -type f -print | xargs grep -w m_pkthdr.header
> ./contrib/dev/oltr/if_oltr.c:  m0->m_pkthdr.header = (void *)th;
> ./netinet/if_ether.c:th = (struct iso88025_header *)m->m_pkthdr.header;
> ./netinet/ip_input.c:m->m_pkthdr.header = ip;
> ./netinet/ip_input.c:#define GETIP(m)   ((struct ip*)((m)->m_pkthdr.header))
> 
> My proposal:
> 
>- Replace the obfuscating GETIP() macro in ip_input.c with a variable.
>- Rejigger the oltr driver to pass its "secret" information using
>  an auxillary mbuf instead of m->m_pkthdr.header.
> 
> Any comments/objections?

  None from here.  :-)

  Just the standard pointer, which you probably already know: -CURRENT
  first, please.

> Thanks,
> -Archie
> 
> __
> Archie Cobbs * Packet Design * http://www.packetdesign.com

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: m->m_pkthdr.header

2002-06-08 Thread Bosko Milekic


On Sat, Jun 08, 2002 at 09:16:06AM -0700, John Polstra wrote:
> In article <[EMAIL PROTECTED]>,
> Archie Cobbs  <[EMAIL PROTECTED]> wrote:
> > 
> > I'd like to get rid of this mbuf field "m->m_pkthdr.header".
> [...]
> > Any comments/objections?
> 
> Yes, please bear in mind that if you change the layout of the mbuf
> structures you will probably break binary compatibility with various
> commercial network drivers that are shipped in binary form.  That's a
> big pain for those of us who are forced to use such drivers.  For that
> reason I don't think you should make this kind of change in -stable.
> Current is obviously a different kettle of fish, but even there you
> should add fillers to replace any structure elements you remove, if
> possible.

  While I agree that -STABLE is a more sensitive fish (which is partly
  why I really don't enjoy dealing with it -- this argument always comes
  up), I think that these types of changes absolutely need to go into
  -CURRENT.  We shouldn't have to worry about "breaking binary
  compatibility" in our development releases.  I can almost guarantee
  that a lot of stuff isn't going to work without a rebuild when people
  decide to go from 4.x to 5.0 anyway.  Placing dummy-fillers is just
  annoying.

> John
> -- 
>   John Polstra
>   John D. Polstra & Co., Inc.        Seattle, Washington USA
>   "Disappointment is a good sign of basic intelligence."  -- Chögyam Trungpa

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: new zero copy sockets snapshot

2002-06-18 Thread Bosko Milekic


On Tue, Jun 18, 2002 at 10:36:36PM -0600, Kenneth D. Merry wrote:
> 
> I've released a new zero copy sockets snapshot, against -current from June
> 18th, 2002.
> 
> http://people.FreeBSD.org/~ken/zero_copy
> 
> The fixes that went into this snapshot:
> 
>  - Take mutex locking out of ti_attach(), it isn't really needed.
>As long as we can assume that probes of successive ti(4) instances
>happen sequentially, we'll be safe in doing this.  Thanks to John
>Baldwin for pointing out the solution to that problem.  (The lock in
>ti_attach() was causing all sorts of WITNESS warnings when
>bus_setup_intr() was called.)
> 
>  - Added a new routine, vm_object_allocate_wait().  This is a variant of
>vm_object_allocate() that allows the user to specify whether the
>uma_zalloc() call inside vm_object_allocate_wait() is called with
>M_WAITOK or M_NOWAIT.  This eliminates a WITNESS warning caused when
>jumbo_vm_init() calls vm_object_allocate() with the jumbo lock held, and
>potentially gives other callers the option of eliminating the mandatory
>wait on the uma_zalloc() call.

  I think this problem was fixed in recent -CURRENT by JeffR.  Notably,
  the VM system should not allow itself to recurse on itself when called
  with M_NOWAIT.

>(vm_object_allocate() now just calls vm_object_allocate_wait() with the
>proper argument.)
> 
> With those fixes, plus several fixes that have gone into -current over the
> past week or so, the zero copy sockets code runs without any WITNESS
> warnings at all now.
> 
> Ken
> -- 
> Kenneth Merry
> [EMAIL PROTECTED]
> 
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
> 

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: new zero copy sockets snapshot

2002-06-20 Thread Bosko Milekic


On Thu, Jun 20, 2002 at 12:25:58PM -0400, Andrew Gallatin wrote:
[...]
>  > > Do you think it would be feasable to glue in a new jumbo (10K?)
>  > > allocator on top of the existing mbuf and mcl allocators using the
>  > > existing mechanisms and the existing MCLBYTES > PAGE_SIZE support
>  > > (but broken out into separte functions and macros)?
>  > 
>  >   Assuming that you can still play those VM tricks with the pages spit
>  > out by mb_alloc (kern/subr_mbuf.c in -CURRENT), then this wouldn't be a
>  > problem at all.  It's easy to add a new fixed-size type allocation to
>  > mb_alloc.  In fact, it would be beneficial.  mb_alloc uses per-CPU
>  > caches and also makes mbuf and cluster allocations share the same
>  > per-CPU lock.  What could be done is that the jumbo buffer allocations
>  > could share the same lock as well (since they will likely usually be
>  > allocated right after an mbuf is).  This would give us jumbo-cluster
>  > support, but it would only be useful for devices clued enough to break
>  > up the cluster into PAGE_SIZE chunks and do scatter/gather.  For most
>  > worthy gigE devices, I don't think this should be a problem.
> 
> I'm a bit worried about other devices.. Tradidtionally, mbufs have
> never crossed page boundaries so most drivers never bother to check
> for a transmit mbuf crossing a page boundary.  Using physically
> discontigous mbufs could lead to a lot of subtle data corruption.

  I assume here that when you say "mbuf" you mean "jumbo buffer attached
to an mbuf."  In that case, yeah, all that we need to make sure of is
that the driver knows that it's dealing with non-physically-contiguous
pages.  For what concerns regular 2K mbuf clusters as well as the 256
byte mbufs themselves, they never cross page boundaries so this should
not be a problem for those drivers that do not use jumbo clusters.

> One question.  I've observed some really anomolous behaviour under
> -stable with my Myricom GM driver (2Gb/s + 2Gb/s link speed, Dual 1GHz
> pIII).  When I use 4K mbufs for receives, the best speed I see is
> about 1300Mb/sec.  However, if I use private 9K physically contiguous
> buffers I see 1850Mb/sec (iperf TCP).
> 
> The obvious conclusion is that there's a lot of overhead in setting up
> the DMA engines, but that's not the case; we have a fairly quick chain
> dma engine.  I've provided a "control" by breaking my contiguous
> buffers down into 4K chunks so that I do the same number of DMAs in
> both cases and I still see ~1850 Mb/sec for the 9K buffers.  
> 
> A coworker suggested that the problem was that when doing copyouts to
> userspace, the PIII was doing speculative reads and loading the cache
> with the next page.  However, we then start copying from a totally
> different address using discontigous buffers, so we effectively take
> 2x the number of cache misses we'd need to.  Does that sound
> reasonable to you?  I need to try malloc'ing virtually contigous and
> physically discontigous buffers & see if I get the same (good)
> performance...

  I believe that the Intel chips do "virtual page caching" and that the
logic that does the virtual -> physical address translation sits between
the L2 cache and RAM.  If that is indeed the case, then your idea of
testing with virtually contiguous pages is a good one.
  Unfortunately, I don't know if the PIII is doing speculative
cache-loads, but it could very well be the case.  If it is and if in
fact the chip does caching based on virtual addresses, then providing it
with virtually contiguous address space may yield better results.  If
you try this, please let me know.  I'm extremely interested in seeing
the results!

> Cheers,
> 
> Drew

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Should we keep a cache of mbuf+cluster ready for use ?

2002-06-29 Thread Bosko Milekic


On Sat, Jun 29, 2002 at 03:46:31PM -0700, Jeffrey Hsu wrote:
> So, what you want is something like a
>   MGETHCL(m, how, type)
>   MHCLFREE(m)
> interface which first looks in a combined freelist before the individual
> mbuf and cluster freelists.  I think it's a good idea.

  I would prefer to see an interface that just grabs both a cluster and
  an mbuf from their respective per-CPU caches (in -CURRENT) while only
  grabbing the lock once, if at all this is that important to you. [*]

  In -CURRENT right now, clusters and mbufs are allocated from
  different per-CPU cache lists but they _share_ per-CPU
  locks, therefore this is entirely do-able.  Having yet another cache
  that holds linked cluster + mbufs would really complicate allocations.

  [*] Admittedly, Alfred was pushing for exactly this not long ago, but
  I just didn't feel like doing it because I felt that there are better
  things to worry about than a little unlock + relock, when we're just
  dropping and re-acquiring the exact same lock, I don't see it as too
  big of a deal.  However, I'm really not sure as to how smart
  architectures like Intel are with bus-locked instructions and data
  cache use.  If they're stupid about it (i.e., if they don't look at
  caches during a bus-locked cycle), then it could be worth it to avoid
  that extra unlock/relock between allocations.  If they're fairly smart
  about it, it's really not worth it.
  
  Also, I should mention that in his original message, Luigi mentions
  that MGETHDR and MCLGET are rather long macros, but in -CURRENT they
  are functions so there is not really much added code cache pollution.

Regards, 
--
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Should we keep a cache of mbuf+cluster ready for use ?

2002-06-29 Thread Bosko Milekic


On Sat, Jun 29, 2002 at 04:11:20PM -0700, Alfred Perlstein wrote:
> If the pool is per-device-softc then it doesn't need locks and will
> be a lot more efficient than even grabbing from the per cpu pool.

  We have an allocator that does per-CPU allocations, we don't need to
  add additional layers of caches on top of that.  The only thing you
  would save would be avoiding having to lock and unlock a cache.
  Since the lock is a per-CPU lock (because the caches are per-CPU), the
  contention on it is very low unless, of course, you get pre-empted
  somewhere during the allocation and then migrate CPUs; but this is
  not that likely to happen, really (not right now, anyway).
  
> Also, (/me does a little dance) TOLD YOU SO ABOUT THE ALLOCATION CODE!

  Huh?  If you're referring to the grouped mbuf + cluster allocation
  interface, I haven't yet admitted that it is worth it.  I specifically
  said that it depends on how the architecture deals with bus-locked
  instructions and cache use.  In other words, it depends on how
  expensive a unlock+relock operation is when we're dealing with the
  same lock.  If we're allowed to go to the cache during the bus-locked
  unlock+relock ops, then we should be getting cache hits so strictly
  speaking, them being there is not such a big deal.  But again, I have
  no idea what happens with cache interactions during a bus-locked
  instruction.

> hah! :)
> 
> -- 
> -Alfred Perlstein [[EMAIL PROTECTED]]
> 'Instead of asking why a piece of software is using "1970s technology,"
>  start asking why software is ignoring 30 years of accumulated wisdom.'
> Tax deductible donations for FreeBSD: http://www.freebsdfoundation.org/
> 
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
> 

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?)

2002-07-02 Thread Bosko Milekic


 There are several problems with your "simple" code that make it simpler
 than what's in -CURRENT and in -STABLE and therefore yield [obviously]
 seemingly better times.

 1) sleeping in the case of a system exhausted of mbufs and clusters is
 now broken because code that frees to this "combined" list will not
 generate wake ups;  even if you build this functionality in, you would
 have to code the wakeup for a "special" case where one obtains both a
 cluster and an mbuf; it'll get complicated.

 2) no per-CPU caches, you bring contention (in -CURRENT) from per-CPU
 back to global, which you probably don't care about now, but which will
 matter when Giant goes completely.

 3) you violate write/read producer/consumer model.  In -CURRENT, we try
 to not have the code doing the freeing write to any part of the mbuf,
 thus hopefully avoiding cache invalidation from occuring during a free
 that happens from a thread that doesn't usually write to the mbuf at
 all.

 I think that because of (1) and (2), especially, and because the kernel
 in -CURRENT is the way it is right now, you're getting better
 performance numbers.  Try exhausting the system and seeing how far you
 get.  I suspect you'll eventually just hang (because you'll allocate a
 bunch of stuff violating the real interface and then won't be able to
 wakeup and Do The Right Thing).

 While, as I said before, I agree that we should have an interface that
 allocates both an mbuf and a cluster, I don't think that this is the
 right approach.  I believe that the correct approach would be to couple
 the mbuf and cluster allocation under a single per-CPU lock/unlock
 operation, which isn't too difficult to do.
 
On Tue, Jul 02, 2002 at 12:10:50AM -0700, Luigi Rizzo wrote:
> [Bcc to -current as relevant there]
> 
> As a followup to my question below, i did some simple experiment
> on a -current box acting as a router, using two "em" cards and
> DEVICE_POLLING (just for stability).
> 
> The whole of the code is basically below -- the receiving
> side tries to grab the mbuf+cluster from the free pool first,
> and falls back to the standard method on failure; the sending
> side tries to attach the freed buffer to the free pool
> if it is the case.
> 
> There is a simplification here with respect to what could
> go into an m_freem()/mgethdr_cluster() pair because the driver
> is already holding a lock, on Giant in this case, so you do not need
> further locking. Still, to come to the data points:
> 
>   CURRENT STABLE (*)
> 
> fastfwd=1, pool_max=0 276kpps 365 kpps
> fastfwd=1, pool_max=50355kpps 383 kpps
> 
> fastfwd=0, pool_max=0 195 pps 142 kpps
> fastfwd=0, pool_max=50227 kpps146 kpps
> 
> (*) This version of STABLE that I am using for comparison has some
> proprietary optimizations which make it a bit faster than normal.
> However it still uses the old ipfw code, which is invoked when
> fastfwd=0, and is significantly slower than the new one.
> 
> Now this really seems to call for adding this interface into the
> mbuf subsystem. I believe we have to find a name for the allocator
> (the deallocator might well go into m_freem(), depending on how we
> implement the locking) and whether it makes sense to lock
> mgethdr_cluster() under per-cpu locks or under Giant, or even let
> the caller make sure that it holds the proper lock before trying
> to invoke the procedure (as i expect the "producers" or "consumers"
> of these pairs to be located in the network stack, chances are that
> they already hold a lock on Giant).
> 
> 
>   cheers
>   luigi
> 
> 
> The code:
> 
>   struct mbuf *em_pool;
> 
>   static int  em_pool_max = 50;
>   SYSCTL_INT(_hw, OID_AUTO, em_pool_max, CTLFLAG_RW,
>   &em_pool_max,0,"max size of mbuf pool");
>   static int em_pool_now;
>   SYSCTL_INT(_hw, OID_AUTO, em_pool_now, CTLFLAG_RD,
>   &em_pool_now,0,"Current size of mbuf pool");
> 
>   ... in em_get_buf() ..
> 
>   if (em_pool) {
> mp = em_pool;
> em_pool = mp->m_nextpkt;
> em_pool_now--;
> goto have_it;
> }
> 
>   ... in em_clean_transmit_interrupts() ...
>   if ((m = tx_buffer->m_head)) {
>   if (em_pool_now < em_pool_max &&
>   m->m_next == NULL &&
>   m->m_flags & M_EXT &&
>   M_WRITABLE(m) ) {

Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?)

2002-07-02 Thread Bosko Milekic


On Tue, Jul 02, 2002 at 08:04:23AM -0400, Bosko Milekic wrote:
> 
>  There are several problems with your "simple" code that make it simpler
>  than what's in -CURRENT and in -STABLE and therefore yield [obviously]
>  seemingly better times.
> 
>  1) sleeping in the case of a system exhausted of mbufs and clusters is
>  now broken because code that frees to this "combined" list will not
>  generate wake ups;  even if you build this functionality in, you would
>  have to code the wakeup for a "special" case where one obtains both a
>  cluster and an mbuf; it'll get complicated.
> 
>  2) no per-CPU caches, you bring contention (in -CURRENT) from per-CPU
>  back to global, which you probably don't care about now, but which will
>  matter when Giant goes completely.
> 
>  3) you violate write/read producer/consumer model.  In -CURRENT, we try
>  to not have the code doing the freeing write to any part of the mbuf,
>  thus hopefully avoiding cache invalidation from occuring during a free
>  that happens from a thread that doesn't usually write to the mbuf at
>  all.

   Forgot, also, in -CURRENT:

   4) breaks framework in mb_alloc that groups pages of mbufs and
   clusters together allowing them to be freed in the future, if/when we
   decide to implement lazy freeing;  it would do the same thing if we
   mvoed mbuf allocations to UMA, because grouping together mbufs and
   clusters on a separate list will violate that scheme.

>  I think that because of (1) and (2), especially, and because the kernel
>  in -CURRENT is the way it is right now, you're getting better
>  performance numbers.  Try exhausting the system and seeing how far you
>  get.  I suspect you'll eventually just hang (because you'll allocate a
>  bunch of stuff violating the real interface and then won't be able to
>  wakeup and Do The Right Thing).
> 
>  While, as I said before, I agree that we should have an interface that
>  allocates both an mbuf and a cluster, I don't think that this is the
>  right approach.  I believe that the correct approach would be to couple
>  the mbuf and cluster allocation under a single per-CPU lock/unlock
>  operation, which isn't too difficult to do.
>  
> On Tue, Jul 02, 2002 at 12:10:50AM -0700, Luigi Rizzo wrote:
> > [Bcc to -current as relevant there]
> > 
> > As a followup to my question below, i did some simple experiment
> > on a -current box acting as a router, using two "em" cards and
> > DEVICE_POLLING (just for stability).
> > 
> > The whole of the code is basically below -- the receiving
> > side tries to grab the mbuf+cluster from the free pool first,
> > and falls back to the standard method on failure; the sending
> > side tries to attach the freed buffer to the free pool
> > if it is the case.
> > 
> > There is a simplification here with respect to what could
> > go into an m_freem()/mgethdr_cluster() pair because the driver
> > is already holding a lock, on Giant in this case, so you do not need
> > further locking. Still, to come to the data points:
> > 
> > CURRENT STABLE (*)
> > 
> > fastfwd=1, pool_max=0   276kpps 365 kpps
> > fastfwd=1, pool_max=50  355kpps 383 kpps
> > 
> > fastfwd=0, pool_max=0   195 pps 142 kpps
> > fastfwd=0, pool_max=50  227 kpps146 kpps
> > 
> > (*) This version of STABLE that I am using for comparison has some
> > proprietary optimizations which make it a bit faster than normal.
> > However it still uses the old ipfw code, which is invoked when
> > fastfwd=0, and is significantly slower than the new one.
> > 
> > Now this really seems to call for adding this interface into the
> > mbuf subsystem. I believe we have to find a name for the allocator
> > (the deallocator might well go into m_freem(), depending on how we
> > implement the locking) and whether it makes sense to lock
> > mgethdr_cluster() under per-cpu locks or under Giant, or even let
> > the caller make sure that it holds the proper lock before trying
> > to invoke the procedure (as i expect the "producers" or "consumers"
> > of these pairs to be located in the network stack, chances are that
> > they already hold a lock on Giant).
> > 
> > 
> > cheers
> > luigi
> > 
> > 
> > The code:
> > 
> > struct mbuf *em_pool;
> > 
> > static int  em_pool_max = 50;
> > SYSCTL_INT(_hw, OID_AUTO, em_pool_max, CTLFLAG_RW,
> > &em_pool_max,0,&

Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?)

2002-07-02 Thread Bosko Milekic


On Tue, Jul 02, 2002 at 10:12:22AM -0700, Luigi Rizzo wrote:
> On Tue, Jul 02, 2002 at 01:01:35PM -0400, Bosko Milekic wrote:
> ...
> >   The reason I said "seemingly" is because of the fact that I don't
> >   think that it is your "grouping" of both clusters and mbufs that
> >   specifically causes the perf. increase to the extent you showed with
> >   your tests.  What you're doing in the code you posted is also avoiding
> 
> could be, i don't know... but this is exactly why we both agree
> that we better provide an interface so that we can work out in
> parallel improvements in the driver and improvements in the mbuf
> code.
> 
> >   1) Remove the malloc() for external ref. count allocations; instead,
> >   store the ref. count at the end of the cluster like I did when I took
> 
> this seems to violate one of your points about cache pollution!
> but i am totally neutral on this one.

  The slot that contains the reference count line will be invalidated in
  either case so this is something that cannot be avoided in both cases
  (the malloc() case is doubly-worse, even), so if there's no way to
  avoid the invalidation (it's a per-cluster reference count after all),
  there's not much I can do about it.

> >   2) Introduce an interface that will allocate and free:
> >  (i) an mbuf with a cluster attached to it;
> >  (ii) an M_PKTHDR mbuf with a cluster attached to it;
> >   However, this interface would wrap the appropriate alloc/free
> >   routines, although it would take care to not drop the allocation
> >   lock between the allocations.  I don't suspect this to be too
> >   difficult to do.
> 
> fine with me.

  Ok, so we agree on the interface and you've agreed to compromise on
  the solution.  Now I just have to deliver. :-)  I'll send you a patch
  by the end of the week.  I've decided to haul in one of my -CURRENT
  test boxes to work and get some work done here, hopefully.

> >   It is not safe if you use it too often.  The allocator was designed to
> >   allocate, it HAS caches, it doesn't need for everyone under the Sun to
> >   start keeping their own caches on top of that.
> 
> which is what happens when they realise they can do things faster :)

  This is really bad philosophy.  It is my understanding that to Do The
  Right thing, we ought to make certain compromises in design.  We need
  to allow for certain functionality to exist and we cannot do that with
  per-driver hacked-up "solutions."  Sure, I can hack up a small little
  per-softc list from which I may even be able to do allocations from
  WITHOUT a lock, but that'll introduce a whole new level of complexity
  (think about mbufs getting misplaced, etc.)  With huge SMP-related
  changes, we cannot afford anymore of these spaghetti-like changes. 

> >   Here's what happens:
> > 
> >   Consumers A and B each keep their own "pools" of mbufs+clusters.
> >   ...
> 
> look, this is exactly what happens with network interfaces. If
> they fail to allocate a new mbuf, they keep recycling
> the one they have from the receive queue instead of freeing it.

  Yes, but that's because they _need_ an mbuf, they can't get one, so
  they re-use one.  If you build a local pool in which you store UNUSED
  mbufs, with no real idea of when they'll be used - only with the
  assumption/hope that you'll use them "soon" - and if you do this in
  several different places in the kernel, you are bound to hit brokeness
  under heavy load, when you need to block from other locations to get
  your network buffers and cannot do that because someone else is
  "assuming" they'll need a bunch "sometime soon."

  This is why we have per-CPU caches and a global cache.  The per-CPU
  caches load themselves accordingly, and they'll give you what you're
  looking for when you need it, from a cache.  Sure, the caches are a
  tad bit more expensive to twiddle, but this is the compromise we make
  in order to ensure that the system knows about our caches and is able
  to cope even under heavy load.

>   cheers
>   luigi

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?)

2002-07-02 Thread Bosko Milekic
n pool.

  A allocates and allocates and allocates from local pool.

  C comes in and allocates "normally" through the allocator.

  C cannot find a cluster and is willing to block, so it does.

  A frees.  No wakeup.

  A frees.  No wakeup.

  A frees.  No wakeup.

  ...

  maybe D frees and generates wakeup, maybe not. 
  
> #2  given that all i am asking for is defining an interface for this
> type of optimization, when the time comes that Giant goes, and
> assuming I am not retired yet, it will still change nothing for
> the non-SMP case, and for the SMP case we can run our performance
> tests and if it proves to be useful we can put in whatever
> is necessary.

  I already told you that I think that the interface is OK; I don't
  think that keeping yet another freelist for this sort of thing is OK.
  The mbufs and clusters will be on their respective free lists anyway,
  there really isn't a point of keeping yet another list around, if you
  read the compromise above and assume that the malloc() will disappear
  from the common cluster allocation case.

> #3  easy to fix (assuming this is really a problem) -- just have the
> free pool organized as a list of pointers to the mbufs, rather than
> using the m_nextpkt field as the link field. Because the free
> list is short and fixed size you can even use an array to
> hold it.

  This is what the existing cache lists do anyway.  Again, there is no
  point of doing this AGAIN.

> #4  same answer of #2: if/when... we can easily fix things in just
> one place. Until then, we are just pessimizing things for no
> good reason.
> 
> In terms of API, since we both agree that we need one (and we do
> need one, otherwise driver writers will just go ahead and do their
> own, replicating work in each driver, as I am sure I^hthey will
> start doing now), may I suggest that we export two-- one where the
> caller is assumed to already hold a lock on the CPU it is running
> on, and one which also does the necessary locking upfront and calls
> the latter afterwards ? This way the former will be usable right
> now and efficiently from all contexts which run under Giant, and
> easy to upgrade when/if we need it.
> 
> And, I am sorry to say... you raise some valid points, but without
> performance numbers to justify them, it is hard to tell how relevant
> they are in practice.
> 
> I have made too many mistakes of this kind myself to believe my
> intuition.
>
>   cheers
>   luigi
> 
> On Tue, Jul 02, 2002 at 09:03:54AM -0400, Bosko Milekic wrote:
> 
> > >  There are several problems with your "simple" code that make it simpler
> > >  than what's in -CURRENT and in -STABLE and therefore yield [obviously]
> > >  seemingly better times.
> > > 
> > >  1) sleeping in the case of a system exhausted of mbufs and clusters is
> > >  now broken because code that frees to this "combined" list will not
> > >  generate wake ups;  even if you build this functionality in, you would
> > >  have to code the wakeup for a "special" case where one obtains both a
> > >  cluster and an mbuf; it'll get complicated.
> > > 
> > >  2) no per-CPU caches, you bring contention (in -CURRENT) from per-CPU
> > >  back to global, which you probably don't care about now, but which will
> > >  matter when Giant goes completely.
> > > 
> > >  3) you violate write/read producer/consumer model.  In -CURRENT, we try
> > >  to not have the code doing the freeing write to any part of the mbuf,
> > >  thus hopefully avoiding cache invalidation from occuring during a free
> > >  that happens from a thread that doesn't usually write to the mbuf at
> > >  all.
> > 
> >Forgot, also, in -CURRENT:
> > 
> >4) breaks framework in mb_alloc that groups pages of mbufs and
> >clusters together allowing them to be freed in the future, if/when we
> >decide to implement lazy freeing;  it would do the same thing if we
> >mvoed mbuf allocations to UMA, because grouping together mbufs and
> >clusters on a separate list will violate that scheme.
> > 
> > >  I think that because of (1) and (2), especially, and because the kernel
> > >  in -CURRENT is the way it is right now, you're getting better
> > >  performance numbers.  Try exhausting the system and seeing how far you
> > >  get.  I suspect you'll eventually just hang (because you'll allocate a
> > >  bunch of stuff violating the real interface and then won't be able to
> > >  wakeup and Do The Right Thing).
> > > 
> > >  While, as I said before, I agree that we should have an interface that
> > >  allocates both an mbuf and a cluster, I don't think that this is the
> > >  right approach.  I believe that the correct approach would be to couple
> > >  the mbuf and cluster allocation under a single per-CPU lock/unlock
> > >  operation, which isn't too difficult to do.
> > 1
> 
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
> 

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: virtually contig jumbo mbufs (was Re: new zero copy sockets snapshot)

2002-07-05 Thread Bosko Milekic


[ -current trimmed ]

On Fri, Jul 05, 2002 at 08:08:47AM -0400, Andrew Gallatin wrote:
> Would this be easier or harder than simple, physically contiguous
> buffers?  I think that its only worth doing if its easier to manage at
> the system level, otherwise you might as well use physically
> contiguous mbufs.  My main goal is to see the per-driver cache's of
> physical memory disappear ;)

 It would be much easier.  The problem with getting physically
 contiguous memory is that shortly after the system gets going, memory
 becomes fragmented.  So, from a system's perspective, it's very hard to
 get physically contiguous pages.  This is why you see most drivers that
 actually do this sort of thing pre-allocate a pool of such beasts early
 on during boot up.  Unfortunately, this means that they'll be eating up
 a lot of memory (some may stay unused forever) at a very early stage.

 As for the guarantee that the data region will start at a page
 boundary, yes I can ensure that as long as you don't tamper with
 offsetting the m_data field of the mbuf after the allocator hands it to
 you.  That is, I can guarantee this:

 [ mbuf]
 [  ]
 [ m_data -]--->[ jumbo buf ]
[ (page 1)  ]
[---]
[ (page 2)  ]
[---]
[ (page 3)  ]   

 So, as you see, it is deffinately do-able to have all the jumbo bufs
 start at a page boundary; however, it may be more worthwhile to have
 some of them start later.  We would have to implement it first and we
 would have to do some measurements to see what really works best.

 What I hate about jumbo bufs is that there's a lot of wastage in the
 last (3rd page).  I can't exactly use the last half of that last page
 for, say, a 2K cluster, because then I wouldn't be respecting the
 bucket "infrastructure" in mb_alloc that allows easy implementation of
 page freeing.  Pretty much the only "realistic" thing I could do is
 allocate jumbo bufs in groups of 3 or 4; this would ensure much less
 wastage but would mean that not all of them would start at page
 boundaries.

> Drew

Regards,
--
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: virtually contig jumbo mbufs (was Re: new zero copy sockets snapshot)

2002-07-05 Thread Bosko Milekic


On Fri, Jul 05, 2002 at 10:14:05AM -0400, Andrew Gallatin wrote:
> I think this would be fine, But we'd need to know more about the
> hardware limitations of the popular GiGE boards out there.  We know
> Tigon-II can handle 4 scatters, but are there any that can handle 3
> but not four?

  Why would you need 4?  I can absolutely guarantee that a jumbo buf
  will not go over 3 pages (on i386, and 2 pages on alpha).

> Drew

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: virtually contig jumbo mbufs (was Re: new zero copy sockets snapshot)

2002-07-05 Thread Bosko Milekic


On Fri, Jul 05, 2002 at 10:45:50AM -0400, Andrew Gallatin wrote:
> 
> Bosko Milekic writes:
>  > 
>  > On Fri, Jul 05, 2002 at 10:14:05AM -0400, Andrew Gallatin wrote:
>  > > I think this would be fine, But we'd need to know more about the
>  > > hardware limitations of the popular GiGE boards out there.  We know
>  > > Tigon-II can handle 4 scatters, but are there any that can handle 3
>  > > but not four?
>  > 
>  >   Why would you need 4?  I can absolutely guarantee that a jumbo buf
>  >   will not go over 3 pages (on i386, and 2 pages on alpha).
>  > 
> 
> Perhaps I misread your earlier message.  I thought you wanted to be
> free to align them on something other than a page boundary, so as to
> not waste so much space.  In that case you'd need 4 scatters on i386,
> right? 

  All I meant was that I wouldn't guarantee that the buffer _begins_ at
  a page boundary.  However, it still should only span at most 3 pages
  on i386.  Let's say for the sake of argument that we won't need more
  than 9212 bytes for a jumbo buffer.  Let's make the buffer 4 bytes
  bigger so that we'll have space for a ref. counter for it too (how
  convenient), so we'll allocate buffers of 9212 + 4 = 9216 bytes.  It's
  time to allocate from the map: I'll try to grab 9 pages for a total
  of 4 jumbo buffers each spanning at most 3 pages (this is i386).  I'll
  start the first buffer at the beginning of the first page and it'll
  end ((9216 - 8192) = 1024) bytes into the third page.  I'll then start
  the second jumbo buffer there and finish it 2048 bytes into 5th page.
  I would then start the 3rd jumbo buffer there and finish it 3072 bytes
  into the 7th page.  The last jumbo buf would start there and finish at
  the end of the 9th page.  If my calculations are correct, that means
  that each jumbo buffer would span exactly 3 pages, but only the first
  one in the grouping would begin at a page boundary.  There is also
  minimum wastage here.  Briefly:

  Assuming that size_of_buf >= 2 * PAGE_SIZE, this is the above:
  
  PAGE_SIZE % (size_of_buf - 2 * PAGE_SIZE) == 0

  We pick size_of_buf as small as possible but enough to accomodate the
  MTU and, and if possible, the ref. counter.  I think 9216 should be a
  good number.  At the same time, we try to make:

  bufs_per_bucket = PAGE_SIZE (integer DIV) (size_of_buf - 2 * PAGE_SIZE)

  bufs_per_bucket fairly reasonable (i.e., not too big).  In the above
  scenario, size_of_buf = 9216 and bufs_per_bucket = 4.

  Admittedly, the numbers above were cooked up, but I think that they're
  fairly reasonable.
  
> Drew
 
Cheers,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: virtually contig jumbo mbufs (was Re: new zero copy sockets snapshot)

2002-07-05 Thread Bosko Milekic


On Fri, Jul 05, 2002 at 09:45:01AM -0700, John Polstra wrote:
> The BCM570x chips (bge driver) definitely need a single physically
> contiguous buffer for each received packet.

  This is totally ridiculous for gigE hardware, IMO.  Do you know of
  other cards that can't do scatter gather DMA?

> John
> -- 
>   John Polstra
>   John D. Polstra & Co., Inc.Seattle, Washington USA
>   "Disappointment is a good sign of basic intelligence."  -- Chögyam Trungpa

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: the incredible shrinking socket

2002-07-08 Thread Bosko Milekic


On Mon, Jul 08, 2002 at 10:16:36AM -0700, Luigi Rizzo wrote:
> On Mon, Jul 08, 2002 at 11:56:22AM -0500, Mike Silbersack wrote:
> ...
> > Speaking of competition, someone should go look at this:
> > 
> > http://mail-index.netbsd.org/current-users/2002/07/03/0011.html
> 
> UDP sockets have the same problem... i posted patches for that
> case around dec.2000 which i never ended up committing.

  I spent a half-hour trying to dig for that thread.  Do you recall what
  the subject of it was?  When I saw this come up on DaemonNews, the
  first thing that came to mind was seeing that discussion; but now I
  can't find it through the mailing list search thing on the website.
  :-(

>   cheers
>   luigi
> 
> > Sounds interesting.
> > 
> > Mike "Silby" Silbersack

Cheers,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: the incredible shrinking socket

2002-07-08 Thread Bosko Milekic


On Mon, Jul 08, 2002 at 12:37:57PM -0500, Jonathan Lemon wrote:
> On Mon, Jul 08, 2002 at 10:29:37AM -0700, Luigi Rizzo wrote:
> > On Mon, Jul 08, 2002 at 01:22:14PM -0400, Bosko Milekic wrote:
> > ...
> > > > > Speaking of competition, someone should go look at this:
> > > > > 
> > > > > http://mail-index.netbsd.org/current-users/2002/07/03/0011.html
> > > > 
> > > > UDP sockets have the same problem... i posted patches for that
> > > > case around dec.2000 which i never ended up committing.
> > > 
> > >   I spent a half-hour trying to dig for that thread.  Do you recall what
> > >   the subject of it was?  When I saw this come up on DaemonNews, the
> > 
> > it was "[patch] fast sbappend*, please try..."
> > see
> > 
> > 
>http://docs.freebsd.org/cgi/getmsg.cgi?fetch=366972+0+archive/2001/freebsd-net/20010211.freebsd-net
> > 
> > jlemon had an amended patch for that.
> > I think we should revisit this keeping in mind the tcp case as well.
> 
> I still have the amended patch in my tree, I'll dig it out this week.

  Luigi also mentionned at the end of the discussion that it would be
  worthwhile to - besides for just keeping a pointer to the last mbuf in
  the sockbuf - keep a pointer to the last mbuf in the packet.  Maybe
  this pointer could be stashed in the m_pkthdr struct.

> -- 
> Jonathan

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



mbuf external buffer reference counters

2002-07-11 Thread Bosko Milekic


Hi,

  Right now, in -CURRENT, there is this hack that I introduced that
  basically just allocates a ref. counter for external buffers attached
  to mbufs with malloc(9).  What this means is that if you do something
  like allocate an mbuf and then a cluster, there's a malloc() call that
  is made to allocate a small (usually 4-byte) reference counter for it.

  That sucks, and even -STABLE doesn't do this. I changed it this way
  a long time ago for simplicity's sake and since then I've been meaning
  to do something better here.  The idea was, for mbuf CLUSTERS, to
  stash the counter at the end of the 2K buffer area, and to make
  MCLBYTES = 2048 - sizeof(refcount), which should be more than enough,
  theoretically, for all cluster users.  This is by far the easiest
  solution (I had it implemented about 10 months ago) and it worked
  great.

  The purpose of this Email is to find out if anyone has concrete
  information on why this wouldn't work (if they think it wouldn't).
  So, if someone has an example of some broken code somewhere that
  wouldn't like this, please point it out to me now before I go off and
  do this again and commit it.

Thanks,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-11 Thread Bosko Milekic


On Thu, Jul 11, 2002 at 01:38:02PM -0700, Luigi Rizzo wrote:
> Hi,
> 
> certainly removing the malloc will improve performance a lot.
> 
> As I already mentioned to Bosko, in principle the available area
> in ext.buffers is irrelevant, and i do not believe this will break
> anything (and if it does, it will be easy to fix in the kernel),
> but some applications might decide to do writes in multiple of 1K
> and trimming away the refcount area might easily result in suboptimal
> allocation of storage within the kernel.

  Can you elaborate on the sub-optimal performance comment with,
  perhaps, an example?  I'm sorry but I'm sometimes slow to understand
  these days (lack of sleep).  Just recently, I was wondering why a
  10megabit card was being outperformed by a 100megabit card, where the
  first was plugged into a 10/100 dual speed hub. D'oh. :-(

> I'd probably try to use the same technique as -stable (i.e. have
> a separate array for all refcounts), if that is not too bad
> from other points of view.

  The problem with this approach is that I'm probably going to be
  allocating jumbo bufs from the same map, in which case you would have
  huge `gaps' in your address <-> ref. count location map and, as a
  result, a lot of wastage.  Plus the jumbo bufs will actually store the
  counter at the end of themselves, so it would be nice if clusters did
  the same.  I don't mind either way, but it's the first reason that
  compells me to stash them at the end of the cluster.

>   cheers
>   luigi
> 
> On Thu, Jul 11, 2002 at 04:20:26PM -0400, Bosko Milekic wrote:
> > 
> > Hi,
> > 
> >   Right now, in -CURRENT, there is this hack that I introduced that
> >   basically just allocates a ref. counter for external buffers attached
> >   to mbufs with malloc(9).  What this means is that if you do something
> >   like allocate an mbuf and then a cluster, there's a malloc() call that
> >   is made to allocate a small (usually 4-byte) reference counter for it.
> > 
> >   That sucks, and even -STABLE doesn't do this. I changed it this way
> >   a long time ago for simplicity's sake and since then I've been meaning
> >   to do something better here.  The idea was, for mbuf CLUSTERS, to
> >   stash the counter at the end of the 2K buffer area, and to make
> >   MCLBYTES = 2048 - sizeof(refcount), which should be more than enough,
> >   theoretically, for all cluster users.  This is by far the easiest
> >   solution (I had it implemented about 10 months ago) and it worked
> >   great.
> > 
> >   The purpose of this Email is to find out if anyone has concrete
> >   information on why this wouldn't work (if they think it wouldn't).
> >   So, if someone has an example of some broken code somewhere that
> >   wouldn't like this, please point it out to me now before I go off and
> >   do this again and commit it.
> > 
> > Thanks,
> > -- 
> > Bosko Milekic
> > [EMAIL PROTECTED]
> > [EMAIL PROTECTED]

  Thanks for your feedback.

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-11 Thread Bosko Milekic


On Thu, Jul 11, 2002 at 01:56:08PM -0700, Luigi Rizzo wrote:
> example: userland does an 8KB write, in the old case this requires
> 4 clusters, with the new one you end up using 4 clusters and stuff
> the remaining 16 bytes in a regular mbuf, then depending on the
> relative producer-consumer speed the next write will try to fill
> the mbuf and attach a new cluster, and so on... and when TCP hits
> these data-in-mbuf blocks will have to copy rather than reference
> the data blocks...
> 
> Maybe it is irrelevant for performance, maybe it is not,
> i am not sure.

  I see what you're saying.  I think that what this means is simply that
  the `optimal' chunk of data to send is just a different size, so
  instead of it being 8192 bytes, it'll be something like 8180 bytes or
  something (to account for the counters).  So, in other words, it
  really depends on the frequency of exact 8192 sized sends in userland
  applications.

  This is a good observation if we're going to be doing benchmarking,
  but I'm not sure whether the repercussions are that important (unless,
  as I said, there's a lot of applications that send exactly 8192
  byte chunks?).  Basically, what we're doing is shifting the optimal
  send size when using exactly 4 clusters, in this case, to (8192 - 16)
  bytes.  We can still send with exactly 4 clusters, it's just that the
  optimal send size is a little different, that's all (this produces a
  small shift in block send benchmark curves, usually).

> >   The problem with this approach is that I'm probably going to be
> >   allocating jumbo bufs from the same map, in which case you would have
> >   huge `gaps' in your address <-> ref. count location map and, as a
> 
> how huge ? and do you really need to use the same map rather than
> two different ones ?

  Well, I can use a different map, I guess (I use a different map for
  mbufs in order to not let huge cluster allocations eat up all of the
  address space reserved for mbufs).  However, it seems that jumbo bufs
  and clusters are logically equivalent (your driver will either use one
  or the other) so it would make sense to have them share the same
  `chunk' of address space.

  As for the gaps, they are quite huge.  I think we calculated a week or
  so ago when discussing jumbo bufs that we would probably end up
  allocating them in chunks of 3 or 4 at a time.  So that would mean at
  least ~9 page 'holes' in the address space from which clusters are
  allocated, so that would mean ~18 counters wasted, at least, for every
  hole.  With the number of jumbo bufs we would have, that can really
  add up.

>   cheers
>   luigi
> 

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-11 Thread Bosko Milekic


On Thu, Jul 11, 2002 at 11:00:56PM +0200, Juan Francisco Rodriguez Hervella wrote:
> First of all, let me say that Im newbie with these topics, 
> I only know a bit about mbufs, but I dont understand how
> can an application trim away the refcount if the size is
> MCLBYTES = 2040 - sizeof(refcount).
> 
> Don't this limit the room which an app. can write ?

 No.  It just changes it.  Also, it's not the application trimming away
 anything, and the size is not 2040, it's 2048. :-)

 FWIW, the application has no idea about mbufs or clusters.  These are
 buffers and structures only the kernel knows about.

> Thanks.
> 
> JFRH

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-11 Thread Bosko Milekic


On Thu, Jul 11, 2002 at 07:31:17PM -0400, Kelly Yancey wrote:
> >   This is a good observation if we're going to be doing benchmarking,
> >   but I'm not sure whether the repercussions are that important (unless,
> >   as I said, there's a lot of applications that send exactly 8192
> >   byte chunks?).  Basically, what we're doing is shifting the optimal
> >   send size when using exactly 4 clusters, in this case, to (8192 - 16)
> >   bytes.  We can still send with exactly 4 clusters, it's just that the
> >   optimal send size is a little different, that's all (this produces a
> >   small shift in block send benchmark curves, usually).
> > 
> 
>   Are you kidding?  Benchmarks, presumably like every other piece of
> software produced by someone trying to get the most performance out of
> the system, are more likely to have power-of-two write buffers.  Are you
> willing to risk that they didn't also just happen to pick a multiple of
> 2^11?
> 
>   Yes, it seems elegant to put the counters in the space that is normally
> unused for receive mbuf clusters, but you can't just blow off Luigi's
> point regarding the send side.

 First of all, I'm not "blowing off" anyone's comments.  I don't
 appreciate the fact that you're eagerly instructing me to "not blow off
 comments" (which I didn't do to begin with) without providing any more
 constructive feedback.

 All I pointed out was that the optimal block size is merely changed
 from an exact 2k, 4k, 8k, etc. to something slightly smaller.  What
 point are *you* trying to put across?  Tell me what's bad about that
 or, better: 
 
 Do you have a better suggestion to make?  What do *you* suggest we do
 with the external ref. counts?  Please, spare me the flame bait.  I
 wasn't being confrontational when I answered Luigi's post and I don't
 need anyone turning this into something confrontational.  Thanks.

>   Kelly
> 
> --
> Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-11 Thread Bosko Milekic


On Thu, Jul 11, 2002 at 04:10:32PM -0700, Julian Elischer wrote:
> Don't forget that "external" does not neccesarily mean "cluster".
> I still consider the method used in (hmm was it NetBSD or OSF/1?)
> to be very good..
>
> mbufs that referred to the same object were linked together.
> I forget the details exactly.  maybe someone else can remember..
> it did it without ref counts somehow..

 Yes, this is in NetBSD still and it is very elegant.  I remember
 looking at this a long time ago but to be honest, the reason I didn't
 implement it then first escaped me.  However, thanks to David Malone's
 awesome commit messages, I found it:

 rev 1.53 of sys/sys/mbuf.h, extract:
[...]
 "NetBSD's system of linked lists of mbufs was cosidered, but Alfred
 felt it would have locking issues when the kernel was made more
 SMP friendly."
[...]

 I think it's almost clear now that there are, in fact, no SMP issues
 with it (we don't do per-cluster locking, or anything ridiculous like
 that), so unless Alfred has the reason again, I'll consider that method
 again instead.  Thanks for the constructive feedback.

Regards, 
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-11 Thread Bosko Milekic


I'm sorry.  I should have waited before hitting the "send" button.

I've had a long and [shitty] day and I shouldn't have blew it off here.

Sorry.

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-12 Thread Bosko Milekic

On Fri, Jul 12, 2002 at 12:10:41AM -0700, Alfred Perlstein wrote:
> * Julian Elischer <[EMAIL PROTECTED]> [020712 00:00] wrote:
> > 
> > 
> > On Thu, 11 Jul 2002, Alfred Perlstein wrote:
> > > 
> > > That's true, but could someone explain how one can safely and
> > > effeciently manipulate such a structure in an SMP environment?
> > 
> > what does NetBSD do for that?
> 
> They don't!
> 
>  *** waves skull staff exasperatedly ***
> 
> RORWLRLRLLRL

 Again, Alfred is right. :-)

 I can't think of a way to ensure that the owner of the other mbuf
 doesn't manipulate its two forward/backward pointers while we're
 manipulating ours.  The only way that springs to mind is to have them
 protected by a mutex, but:

 1) that would be very expensive and would bloat the mbuf structure a
LOT;

 2) we would probably run into lock order reversal problems.

 I see now what Alfred meant when he made his original comment.

-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-12 Thread Bosko Milekic


On Fri, Jul 12, 2002 at 04:26:53AM -0700, Jon Mini wrote:
> On Thu, Jul 11, 2002 at 11:41:04PM -0700, Alfred Perlstein wrote:
> > > That's a cool idea.. haven't looked at NetBSD but am imagining the
> > > mbufs would be linked in a 'ring'. This works because you never
> > > care how many references are, just whether there's one or more than
> > > one, and this is easy to tell by examining the ring pointer.
> > > I.e., you never have to iterate through the entire ring.
> > 
> > That's true, but could someone explain how one can safely and
> > effeciently manipulate such a structure in an SMP environment?
> > 
> > I'm not saying it's impossible, I'm just saying it didn't seem
> > intuative to me back then, as well as now.
> 
> I'm probably speaking out of turn here (I have no idea what structure you
> all are talking about), but a monodirectional ring can be safely modified
> with a compare-and-exchange atomic operation.

 The jist of the problem is that when you want to say, remove yourself
 from the list, you have to:

 1) your "next"'s back pointer to your "back" pointer
 2) your "Prev"'s next pointer to your "next" pointer

 So that's two operations but for all you know your "next" or your
 "back" may be doing the same thing to you at the same time.  As far as
 I know, you can't (intuitively) figure out a way to do both of these
 atomically. i.e., maybe you'll set your next's back pointer to whatever
 you have in `back' but then your `back' guy will set your back pointer
 to whatever he has in `back' and then your next guy's back pointer will
 be invalid, for example.

 So I guess that what we're dealing with isn't really a
 "monodirectional" ring.  Right?
 
> -- 
> Jonathan Mini <[EMAIL PROTECTED]>
> http://www.freebsd.org/

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-12 Thread Bosko Milekic


On Fri, Jul 12, 2002 at 11:03:45AM -0700, John Polstra wrote:
> I've been out of town and I realize I'm coming into this thread late
> and that it has evolved a bit.  But I still think it's worthwhile to
> point out a very big problem with the idea of putting the reference
> count at the end of each mbuf cluster.  It would have disastrous
> consequences for performance because of cache effects.  Bear with me
> through a little bit of arithmetic.
> 
> Consider a typical PIII CPU that has a 256 kbyte 4-way set-associative
> L2 cache with 32-byte cache lines.  4-way means that there are 4
> different cache lines associated with each address.  Each group of 4
> is called a set, and each set covers 32 bytes of the address space
> (the cache line size).
> 
> The total number of sets is:
> 
> 256 kbytes / 32 bytes per line / 4 lines per set = 2048 sets
> 
> and as mentioned above, each set covers 32 bytes.
> 
> The cache wraps around every 256 kbytes / 4-way = 64 kbytes of address
> space.  In other words, if address N maps onto a given set, then
> addresses N + 64k, N + 128k, etc. all map onto the same set.
> 
> An mbuf cluster is 2 kbytes and all mbuf clusters are well-aligned.
> So the wrap around of the cache occurs every 64 kbytes / 2 kbytes per
> cluster = 32 clusters.  To put it another way, all of the reference
> counts would be sharing (i.e., competing for) the same 32 cache sets
> and they would never utilize the remaining 2061 sets at all.  Only
> 1.56% of the cache (32 sets / 2048 sets) would be usable for the
> reference counts.  This means there would be a lot of cache misses as
> reference count updates caused other reference counts to be flushed
> from the cache.
> 
> These cache effects are huge, and they are growing all the time as CPU
> speeds increase while RAM speeds remain relatively constant.

  I've thought about the cache issue with regards to the ref. counts
  before, actually, and initially, I also thought the exact same thing
  as you bring up here.  However, there are a few things you need to
  remember:

 1) SMP; counters are typically referenced by several different threads
 which may be running on different CPUs at any given point in time, and
 this means that we'll probably end up having corresponding cache lines
 invalidated back and forth anyway;

 2) Using more cache lines may not be better overall, we may be doing
 write-backs of other data already there; in any case, we would really
 have to measure this;

 3) By far the most important: all modifications to the ref. count are
 atomic, bus-locked, ops.  I spoke to Peter a little about this and
 although I'm not 100% sure, we think that bus-locked
 fetch-inc/dec-stores need the bus anyway.  If that's the case,
 then we really don't care about whether or not they get cached, right?

> John
> -- 
>   John Polstra
>   John D. Polstra & Co., Inc.    Seattle, Washington USA
>   "Disappointment is a good sign of basic intelligence."  -- Chögyam Trungpa

 Thanks for the cool infos. and feedback.

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-12 Thread Bosko Milekic


On Fri, Jul 12, 2002 at 06:55:37PM -0400, Andrew Gallatin wrote:
[...]

FWIW, BSD/OS also does similar to -STABLE.

[...]
> I agree with John about where to put the refcnts: I think we should
> have a big hunk of memory for the refcnts like in -stable.  My
> understanding is that the larger virtually contig mbufs are the only
> thing that would cause a problem for this, or is that incorrect?
> If so, then why not just put their counter elsewhere?
> 
> One concrete example against putting the refcnts into the cluster is
> that it would cause NFS servers & clients to use 25% more mbufs for a
> typical 8K read or write request.

  If we decide to allocate jumbo bufs from their own seperate map as
  well then we have no wastage for the counters for clusters if we keep
  them in a few pages, like in -STABLE, and it should all work out fine.

  For the jumbo bufs I still maintain that we should keep the counter
  for them at the end of the buf because the math works out (see my post
  in that thread with the math example) and because their total size is
  not a power of 2 anyway.  They'll also be more randomly spread out and
  use more cache slots.

> Drew

Regards,
-- 
Bosko Milekic
[EMAIL PROTECTED]
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Rename of MSIZE kernel option..

2002-10-14 Thread Bosko Milekic


Not that my opinion really holds much weight with you guys but for what
it's worth, I think the change would be gratuitist.

1. MSIZE has been around forever.

2. The argument that sys/sys/mbuf.h should have MSIZE removed/changed
because some other code may use it is fallacious.  The "other code"
should be careful not to use a constant that has been so named for the
longest time I can recall.

On Mon, Oct 14, 2002 at 04:15:12PM -0400, John Baldwin wrote:
> Would people be open to renaming the 'MSIZE' kernel option to something
> more specific such as 'MBUF_SIZE' or 'MBUFSIZE'?  Using 'MSIZE' can
> break other places in the kernel.  For example, ISA device ivars have
> an ivar for the size of a memory resource called 'MSIZE' and the kernel
> option causes breakage in src/sys/isa/isavar.h:
> 
> ISA_ACCESSOR(msize, MSIZE, int)
> 
> when ISA_ACCESSOR is properly defined via __BUS_ACCESSOR() rather than
> homerolling a private copy of __BUS_ACCESSOR().  For now I've fixed it
> to rename the ISA ivar to ISA_IVAR_MEMSIZE but MSIZE seems to be too
> generic a name for a kernel option and it would be nice to avoid this
> problem in the future.
> 
> -- 
> 
> John Baldwin <[EMAIL PROTECTED]>  <><  http://www.FreeBSD.org/~jhb/
> "Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

-- 
Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: @stake advisory: etherleak

2003-01-07 Thread Bosko Milekic

On Tue, Jan 07, 2003 at 02:15:02PM -0800, Nate Lawson wrote:
> The short of it is that if a tx packet is < 64 bytes (min ethernet frame
> len), data can be leaked if the driver transmits 64 bytes.  It seems our
> use of mbufs would prevent leakage but I haven't examined any drivers to
> verify this.
> 
> http://www.atstake.com/research/advisories/2003/atstake_etherleak_report.pdf
> 
> -Nate

  Heh!  This seems like a pretty long paper for what it's worth.  I can
  see how we could have the same "problem" but keep in mind the scope of
  the issue (the paper blows it out of proportion, I think, for someone
  who may not necessarily understand the issue too well): we're talking
  about very small packets here, under 64 bytes.  So the "leaked
  information" will be AT MOST 63 bytes, but practically somewhere
  around 10 bytes and would be very occasional.  Not to mention that
  it's totally undefined and random.  An "attacker" might as well just
  rely on temperature to guess at how to interpret what he/she's seeing
  in those few bytes.  The data in our case is probably DMA'd straight
  out of the mbuf's data region so what you'll probably find in there is
  just randomness from something before, not necessarily network data.

  Moreover, as the paper does mention, although the RFCs do appear to
  specify that for packets that are padded to the minimum ethernet
  packet size the padded bytes should be zeroed, it's pretty ambiguous
  as to where the zero-ing should happen.  As noted, some cards will do
  the zeroing onboard.

  If you really see this as a 'security problem' then feel free to zero
  the padded bytes manually in the problematic drivers.  Doing it
  universally would be too much in my opinion, especially given that
  some hardware takes care of it itself.

-- 
Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: @stake advisory: etherleak

2003-01-08 Thread Bosko Milekic

On Tue, Jan 07, 2003 at 06:09:23PM -0800, Nate Lawson wrote:
> On Tue, 7 Jan 2003, Bosko Milekic wrote:
> > On Tue, Jan 07, 2003 at 02:15:02PM -0800, Nate Lawson wrote:
> >   Not to mention that
> >   it's totally undefined and random.
> 
> Well, you have the guarantee that it's network data since the min mbuf
> size is 128 bytes and a packet must be less than 64 bytes to trigger this.

  No, not really.  Yes, the mbufs come from a vmmap that is reserved for
  mbufs and clusters but it doesn't mean that the underlying physical
  memory contains network data, necessarily.  It is the case if you're
  getting an mbuf from the allocator cache, but as soon as you make that
  call to vm you no longer have that guarantee.

> >  An "attacker" might as well just
> >   rely on temperature to guess at how to interpret what he/she's seeing
> >   in those few bytes.  The data in our case is probably DMA'd straight
> >   out of the mbuf's data region so what you'll probably find in there is
> >   just randomness from something before, not necessarily network data.
> 
> Since the mbuf pool is statically allocated at boot, it's likely only mbuf
> hdrs or contents would leak this way.  Still, this is data leakage even
> though it's a small channel.

  No.  You are leaking contents but it is not necessarily network data.
  It may be more often than not, but there is no way to determine what
  the nature/source of the data is remotely.

> >   Moreover, as the paper does mention, although the RFCs do appear to
> >   specify that for packets that are padded to the minimum ethernet
> >   packet size the padded bytes should be zeroed, it's pretty ambiguous
> >   as to where the zero-ing should happen.  As noted, some cards will do
> >   the zeroing onboard.
> 
> For cards that don't, it's the driver's responsibility to do this.  Some
> do (xe) and some don't (ie).
> 
> >   If you really see this as a 'security problem' then feel free to zero
> >   the padded bytes manually in the problematic drivers.  Doing it
> >   universally would be too much in my opinion, especially given that
> >   some hardware takes care of it itself.
> 
> I believe this should be done for the same reason that pages allocated to
> the user are pre-zeroed.

  OK, but not in a universal fashion.  I think drivers with this problem
  should implement it locally.  Not all drivers have the problem because
  not all network cards "forget" to pad the data with zeros.

> -Nate

-- 
Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Performance tuning hints of gigabit networking?

2003-03-03 Thread Bosko Milekic


You're not running out of mbufs or clusters, you're out of RAM.
Don't bump up nmbclusters anymore because you don't need to; instead,
add more RAM.

On Wed, Feb 26, 2003 at 10:05:51PM +0900, CHOI Junho wrote:
> 
> Hi,
> 
> I am looking for a good resource for kernel tuning on very high
> bandwidth HTTP servers(avg 500Mbit/sec, peak 950Mbit/sec). Today I
> faced very unusual situation with 950Mbit/sec bandwidth!
> 
> > netstat -m
> 16962/93488/262144 mbufs in use (current/peak/max):
> 16962 mbufs allocated to data
> 16952/65536/65536 mbuf clusters in use (current/peak/max)
> 15 Kbytes allocated to network (14% of mb_map in use)
> 512627 requests for memory denied
> 2614 requests for memory delayed
> 0 calls to protocol drain routines
> 
> I set kern.ipc.nmbclusters=65536, but it overflowed. This is P-IV Xeon
> 1.8G, 2GB RAM, and one Intel 1000baseSX(em driver) machine running
> 4.7-RELEASE-pX. This server is running only one service, HTTP. I use
> thttpd, since apache doesn't work in such a high load. thttpd is highly
> amazing, just give <1 load in any time.
> 
> Once I tried to increase kern.ipc.nmbclusters to 131072 or
> higher(multiple of 65536 or 32768, tuning(7) only cites about 32768
> case..), it fails to boot kernel when 262144, or kernel panic in
> somewhat higher load when 131072, so I gave up other changes and fall
> back to 65536.
> 
> What is a good way to calcurate this value safely? Here is another
> hint, /etc/sysctl.conf:
> 
> net.link.ether.inet.log_arp_wrong_iface=0
> kern.ipc.maxsockbuf=2048000
> kern.ipc.somaxconn=4096
> kern.ipc.maxsockets=6
> kern.maxfiles=65536
> kern.maxfilesperproc=32768
> net.inet.tcp.rfc1323=1
> net.inet.tcp.delayed_ack=0
> net.inet.tcp.sendspace=65535
> net.inet.tcp.recvspace=65535
> net.inet.udp.recvspace=65535
> net.inet.udp.maxdgram=57344
> net.inet.icmp.drop_redirect=1
> net.inet.icmp.log_redirect=1
> net.inet.ip.redirect=0
> net.inet6.ip6.redirect=0
> net.link.ether.inet.max_age=1200
> net.inet.ip.sourceroute=0 
> net.inet.ip.accept_sourceroute=0 
> net.inet.icmp.bmcastecho=0
> net.inet.icmp.maskrepl=0
> net.inet.tcp.inflight_enable=1
> 
> kernel configuration is not specially tuned, except DEVICE_POLLING and
> HZ=2000.
> 
> --
> CHOI Junho <http://www.kr.FreeBSD.org/~cjh> KFUG 
> FreeBSD Project     Web Data Bank 
> Key fingerprint = 1369 7374 A45F F41A F3C0  07E3 4A01 C020 E602 60F5
> 
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
> 

-- 
Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message


Re: mbuf cache

2003-03-04 Thread Bosko Milekic

On Wed, Mar 05, 2003 at 01:12:55AM +0200, Petri Helenius wrote:
> > > Any comments on the high cpu consumption of mb_free? Or any other places
> > > where I should look to improve performance?
> >
> >   What do you mean "high cpu consumption?"  The common case of mb_free()
> >   is this:
> 
> According to profiling mb_free takes 18.9% of all time consumed in kernel and is
> almost double to next cpu consuming function. Since I´m looking how to optimize
> the system, it´s usually a good idea to start looking where most CPU is spent.
> 
> For example, I´m wondering if mbufs get unneccessarily freed and allocated or thrown
> around different buckets because of the tunables are wrong. I´m not saying that
> there must be something wrong with the code itself.
> 
> For example what does "in use" mean below: There is no way there is enough
> traffic on the system to allocate 7075 mbufs when this netstat -m was taken.
> 
> mbuf usage:
> GEN cache:  0/0 (in use/in pool)
> CPU #0 cache:   7075/8896 (in use/in pool)
> CPU #1 cache:   1119/4864 (in use/in pool)
> Total:  8194/13760 (in use/in pool)
> Mbuf cache high watermark: 8192
> Mbuf cache low watermark: 128
> 
> 
> Pete

  This does look odd... maybe there's a leak somewhere... does "in use"
  go back down to a much lower number eventually?  What kind of test are
  you running?  "in pool" means that that's the number in the cache
  while "in use" means that that's the number out of the cache
  currently being used by the system; but if you're telling me that
  there's no way usage could be that high while you ran the netstat,
  either there's a serious leak somewhere or I got the stats wrong
  (anyone else notice irregular stats?)

  Another thing I find odd about those stats is that you've set the high
  watermark to 8192, which means that in the next free, you should be
  moving buckets to the general cache... see if that's really
  happening...  The low watermark doesn't affect anything right now.

  Can you give me more details on the exact type of test you're running?
  Let's move this to -current instead of -current and -net please (feel
  free to trim the one you want), getting 3 copies of the same
  message all the time is kinda annoying. :-(

-- 
Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message


Re: mbuf cache

2003-03-04 Thread Bosko Milekic

On Tue, Mar 04, 2003 at 11:34:11PM +0200, Petri Helenius wrote:
> 
> I did some profiling on -CURRENT from a few days back, and I think I haven´t
> figured the new tunables out or the code is not doing what it´s supposed to
> or I´m asking more than it is supposed to do but it seems that mb_free
> is being quite wasteful...
> 
> Any pointers to how the new high/low watermark tunables should be used?
> 
> Is it normal that after almost all traffic has been stopped there is still 8k+
> mbufs in "cache"?
> 
> Pete
 
  Yes, it's normal.  The commit log clearly states that the new
  watermarks do nothing for now.  I have a patch that changes that but I
  haven't committed it yet because I left for vacation last Sunday and I
  only returned early this Monday.  Since then, I've been too busy to
  clean up and commit it.  In about a week or so you should notice a
  difference.

-- 
Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message


Re: mbuf cache

2003-03-04 Thread Bosko Milekic

On Wed, Mar 05, 2003 at 12:24:27AM +0200, Petri Helenius wrote:
> >  
> >   Yes, it's normal.  The commit log clearly states that the new
> >   watermarks do nothing for now.  I have a patch that changes that but I
> >   haven't committed it yet because I left for vacation last Sunday and I
> >   only returned early this Monday.  Since then, I've been too busy to
> >   clean up and commit it.  In about a week or so you should notice a
> >   difference.
> > 
> Any comments on the high cpu consumption of mb_free? Or any other places
> where I should look to improve performance?

  What do you mean "high cpu consumption?"  The common case of mb_free()
  is this:

  bucket = mb_list->ml_btable[MB_BUCKET_INDX(m, mb_list)];
  owner = bucket->mb_owner & ~MB_BUCKET_FREE;
  switch (owner) {
  ...
  default:
  cnt_lst = MB_GET_PCPU_LIST_NUM(mb_list, owner);
  MB_PUT_OBJECT(m, bucket, cnt_lst);
  MB_MBTYPES_DEC(cnt_lst, type, 1);
  if (bucket->mb_owner & MB_BUCKET_FREE) {
SLIST_INSERT_HEAD(&(cnt_lst->mb_cont.mc_bhead),
bucket,
mb_blist);
bucket->mb_owner = cnt_lst->mb_cont.mc_numowner;
  }
  }

  That's the common path, modulo a couple checks on whether or not the
  container being freed to needs to be moved or whether we're in a
  starvation situation.  The only thing to be done, possibly, is make
  the routine smaller by moving those couple of actions to seperate
  routines, but I'm not clear that this is very useful, given that
  mb_free()'s usage is restricted to only inside subr_mbuf.c

> Pete

-- 
Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message


Re: Zero Copy Sockets?

2003-06-22 Thread Bosko Milekic

On Sun, Jun 22, 2003 at 12:46:26PM -0700, George V. Neville-Neil wrote:
> Hi,
> 
>   I'm reading over the internals of the network stack in
>   -CURRENT and I'm wondering if the Zero Copy stuff is actually
>   in use yet.
> 
> Thanks,
> George

  Yes.  But your driver needs to support it.  Currently, the only driver
  that does is if_ti in src/sys/pci/if_ti*.[ch]

  sendfile(2) is also zero-copy for the most part (for the file part).
  That's been in use for a long time.  src/sys/kern/uipc_syscalls.c

Regards,
-- 
Bosko Milekic  *  [EMAIL PROTECTED]  *  [EMAIL PROTECTED]
TECHNOkRATIS Consulting Services  *  http://www.technokratis.com/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Fine grained locking at the socket level?

2003-06-22 Thread Bosko Milekic

On Sun, Jun 22, 2003 at 04:58:54PM -0700, George V. Neville-Neil wrote:
> Hi,
> 
>   It would seem that splnet() and frieds now simply return 0,
>   which I figure is part of making the code look like it used
>   to.  What I'm wondering is why the Giant lock is still used in
>   the socket layer?  I thought sockets had had fine grained
>   locking applied to them.  Am I confused?  I'm looking at the
>   bits du jour (-CURRENT).
> 
> Thanks,
> George

  The short answer is: we're not done.

  The long answer is: we're not done. :-)  We can't simply unwind Giant
  just anywhere yet because there is still code in other layers that
  requires Giant.

Cheers,
-- 
Bosko Milekic  *  [EMAIL PROTECTED]  *  [EMAIL PROTECTED]
TECHNOkRATIS Consulting Services  *  http://www.technokratis.com/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Kernel tuning for large maxsockets

2003-07-16 Thread Bosko Milekic

On Tue, Jul 15, 2003 at 05:22:55PM -0400, Scot Loach wrote:
> Currently, whenever maxsockets is increased, this causes kernel memory to be
> preallocated for each type of pcb (tcp, udp, raw, divert).  The number of
> pcbs preallocated for each of these is always the same as maxsockets.
> 
> This is probably a waste of memory for raw sockets and divert sockets, since
> they would not normally be used in large numbers.   A large server could
> save kvm by reducing the number of divert and raw pcbs preallocated.  For
> example, on a machine configured for 200,000 maxsockets we would save 75MB
> of kvm out of a total of 262MB that would have been preallocated.  This kvm
> savings can be used to increase maxsockets even more.

  What version of FreeBSD are you talking about here?  In -current, the
  pcbs come off of zones which are capped at a maximum w.r.t.
  maxsockets.  The kva space comes out of kmem_map and the objects are
  kept cached in their respective zones.  One thing to note is that the
  zones are setup with UMA_ZONE_NOFREE, so the pages wired down for pcb
  use are probably never unwired.

  I don't know why, exactly, UMA_ZONE_NOFREE is required for all of the
  pcb zones in the netinet code.

> Is there any reason I should not modify the kernel code to only let a small,
> fixed number of raw and divert pcbs be preallocated instead of having them
> scale with maxsockets?
> 
> Next, does this seem like a generally useful thing that could be rolled back
> into the source tree?  I could make this a kernel option or a tunable sysctl
> variable.
> 
> thanks
> 
> Scot Loach

-- 
Bosko Milekic  *  [EMAIL PROTECTED]  *  [EMAIL PROTECTED]
TECHNOkRATIS Consulting Services  *  http://www.technokratis.com/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


  1   2   >