date:20130115

Re: Some questions about the new TCP congestion control code

2013-01-15 Thread Lawrence Stewart

Hi John,

On 01/15/13 08:04, John Baldwin wrote:
> I was looking at TCP congestion control at work recently and noticed a few 

Poor you ;)

> "odd" things in the current code.  First, there is this chunk of code in 
> cc_ack_received() in tcp_input.c:
> 
> static void inline
> cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type)
> {
>   INP_WLOCK_ASSERT(tp->t_inpcb);
> 
>   tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th);
>   if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd))
>   tp->ccv->flags |= CCF_CWND_LIMITED;
>   else
>   tp->ccv->flags &= ~CCF_CWND_LIMITED;
> 
> 
> Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not 
> integers, so the call to min() results in truncation on 64-bit hosts.

Good catch, but I don't think it matters in practice as neither snd_cwnd
or snd_wnd will grow past the 32-bit boundary.

> It should probably be ulmin() instead.  However, this line seems to be a 
> really 
> obfuscated way to just write:
> 
>   if (tp->snd_cwnd <= tp->snd_wnd)

You are correct, though I'd argue the meaning of the existing code as
written is clearer compared to your suggested change.

> If that is correct, I would vote for changing this to use the much simpler 
> logic.

Agreed. While I find the existing code slightly clearer in meaning, it's
not significant enough to warrant keeping it as is when your suggested
change is simpler, fixes a bug and achieves the same thing. Happy for
you to change it or I can do it if you prefer.

> Secondly, in the particular case I was investigating at work (restart of an 
> idle connnection), the newreno congestion control code in 8.x and later uses 
> a 
> different algorithm than in 7.  Specifically, in 7 TCP would reuse the same 
> logic used for an initial cwnd (honoring ss_fltsz).  In 8 this no longer 
> happens (instead, 2 is hardcoded).  A guess at a possible fix might look 
> something like this:
> 
> Index: cc_newreno.c
> ===
> --- cc_newreno.c  (revision 243660)
> +++ cc_newreno.c  (working copy)
> @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv)
>   if (V_tcp_do_rfc3390)
>   rw = min(4 * CCV(ccv, t_maxseg),
>   max(2 * CCV(ccv, t_maxseg), 4380));
> +#if 1
>   else
>   rw = CCV(ccv, t_maxseg) * 2;
> +#else
> + /* XXX: This is missing a lot of stuff that used to be in 7. */
> +#ifdef INET6
> + else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) :
> + in_localaddr(CCV(ccv, t_inpcb->inp_faddr
> +#else
> + else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr)))
> +#endif
> + rw = V_ss_fltsz_local * CCV(ccv, t_maxseg);
> + else
> + rw = V_ss_fltsz * CCV(ccv, t_maxseg);
> +#endif
>  
>   CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd));
>  }
> 
> (But using the #else clause instead of the current #if 1 code).  Was this 
> change in 8 intentional?

It was. Unlike connection initialisation which still honours ss_fltsz in
cc_conn_init(), restarting an idle connection based on ss_fltsz seemed
particularly dubious and as such was omitted from the refactored code.

The ultimate goal was to remove the ss_fltsz hack completely and
implement a smarter mechanism, but that hasn't quite happened yet. The
removal of ss_fltsz from 10.x without providing a replacement mechanism
is not ideal and should probably be addressed.

I'm guessing you're not using rfc3390 because you want to override the
initial window based on specific local knowledge of the path between
sender and receiver?

Cheers,
Lawrence
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: if_vr(4) and DFE520-TX

2013-01-15 Thread Aleksandr Rybalko

On Mon, 14 Jan 2013 15:15:53 +0900
YongHyeon PYUN  wrote:

> On Sat, Jan 12, 2013 at 06:49:13PM +0400, Ruslan Makhmatkhanov wrote:
> > Ok, I got some details. It's an DFE-520TX (/C1 or rev. C1). I crafted an 
> > patch attached, but whenever kldloading the modified if_vr, I got this:
> > 
> > kernel: vr0:  port 0xd100-0xd1ff 
> > mem 0xf7c11000-0xf7c110ff irq 19 at device 0.0 on pci4
> > kernel: vr0: Quirks: 0x0
> > kernel: vr0: Revision: 0x10
> > kernel: vr0: reset never completed!
> > kernel: vr0: attaching PHYs failed
> > kernel: device_attach: vr0 attach returned 6
> > kernel: vr0:  port 0xd000-0xd0ff 
> > mem 0xf7c1-0xf7c100ff irq 16 at device 1.0 on pci4
> > kernel: vr0: Quirks: 0x0
> > kernel: vr0: Revision: 0x10
> > kernel: vr0: reset never completed!
> > kernel: vr0: attaching PHYs failed
> > kernel: device_attach: vr0 attach returned 6
> > 
> > I also tried to apply VR_Q_NEEDALIGN quirk, but nothing is changed. Any 
> > hints?
> 
> I recall D-Link was one of notorious vendor which used to
> completely change its chip set in later revisions without notice. So
> I'm afraid the controller you have may not be a VIA manufactured
> one.
> Could you take a picture of the chip set of controller and let
> others see it? I guess it could be a RealTek 8139 or 8139C+.

Not so hard, just check board H/W revision also :)

> 
> > 
> > 
> > Ruslan Makhmatkhanov wrote on 12.01.2013 15:26:
> > >
> > >Here is also verbose boot log for what it's worth:
> > >http://pastebin.com/SnivrtFr
> > >
> > >Please keep me in cc:, I'm not subscribed. Thanks.
> > >
> > >Ruslan Makhmatkhanov wrote on 12.01.2013 11:28:
> > >>Hello,
> > >>
> > >>I bought two D-link DFE520-TX ethernet adapters that supposed to work
> > >>with if_vr(4) according to man-page. But the driver cannot attach
> > >>(tested in 9.1-R and pfSense 2.0.2/2.1 (8.1-R and 8.3-R respectively)).
> > >>
> > >>none2@pci0:4:0:0:class=0x02 card=0x11031186 chip=0x42001186
> > >>rev=0x10 hdr=0x00
> > >> vendor = 'D-Link System Inc'
> > >> class  = network
> > >> subclass   = ethernet
> > >>
> > >>Can please anybody suggest proper changes for
> > >>/sys/dev/vr/if_vrreg.h|if_vr.c (pci ids would be enought, right?) to
> > >>test if it works. Thanks in advance.
> > >
> > 
> > 
> > -- 
> > Regards,
> > Ruslan
> > 
> > Tinderboxing kills... the drives.
> 
> > diff -uN vr.orig/if_vr.c vr/if_vr.c
> > --- vr.orig/if_vr.c 2013-01-12 13:19:28.0 +0400
> > +++ vr/if_vr.c  2013-01-12 18:42:52.0 +0400
> > @@ -138,6 +138,9 @@
> > { DELTA_VENDORID, DELTA_DEVICEID_RHINE_II,
> > VR_Q_NEEDALIGN,
> > "Delta Electronics Rhine II 10/100BaseTX" },
> > +   { DLINK_VENDORID, DLINK_DEVICEID_RHINE_II,
> > +   0,
> > +"D-Link System Inc 4200 10/100BaseTX" },
> > { ADDTRON_VENDORID, ADDTRON_DEVICEID_RHINE_II,
> > VR_Q_NEEDALIGN,
> > "Addtron Technology Rhine II 10/100BaseTX" },
> > diff -uN vr.orig/if_vrreg.h vr/if_vrreg.h
> > --- vr.orig/if_vrreg.h  2013-01-12 13:19:28.0 +0400
> > +++ vr/if_vrreg.h   2013-01-12 14:29:26.0 +0400
> > @@ -557,6 +557,16 @@
> >  #define DELTA_DEVICEID_RHINE_II0x1320
> >  
> >  /*
> > + * D-Link System Inc device ID.
> > + */
> > +#define DLINK_VENDORID  0x1186
> > +
> > +/*
> > + * D-Link System Inc device IDs.
> > + */
> > +#define DLINK_DEVICEID_RHINE_II 0x4200
> > +
> > +/*
> >   * Addtron vendor ID.
> >   */
> >  #define ADDTRON_VENDORID   0x4033
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


-- 
Aleksandr Rybalko 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

2013-01-15 Thread Lawrence Stewart

On 01/15/13 07:50, John Baldwin wrote:
> The constants used for TCP and UDP socket options (TCP_NODELAY, etc.) are 
> currently defined as hex values that are individual bits.  However, socket 
> options are never masked together, they are used as a simple enumeration of 
> discrete values.  Using a bitmask forces us to run out of bits and makes it 
> harder for vendors to try to use a high range of values for local custom 
> options (hoping that they never conflict with a new option value added in 
> stock FreeBSD).

Yup. Should we be explicitly #defining the boundary between "bits
reserved for FreeBSD" and "bits for private vendor use"?

> The socket options in  do use bitmasks for the low bits because 
> they map directly to bits so_options, but then they start a simple 
> enumeration 
> at 0x1000.  TCP and UDP socket options do not directly map to bits in a flags 
> field in the PCB (e.g. TF_NODELAY != TCP_NODELAY).  I would like to change 
> the 
> representation of the constants to be decimal instead of hex and encourage 
> new 
> options to fill in the gaps between the existing values.  This would preserve 
> the existing ABI but keep things more sane in the future (I believe).  The 
> diff is this:
> 
> Index: netinet/tcp.h
> ===
> --- netinet/tcp.h (revision 245225)
> +++ netinet/tcp.h (working copy)
> @@ -151,18 +151,18 @@
>  /*
>   * User-settable options (used with setsockopt).
>   */
> -#define  TCP_NODELAY 0x01/* don't delay send to coalesce packets 
> */
> +#define  TCP_NODELAY 1   /* don't delay send to coalesce packets 
> */
>  #if __BSD_VISIBLE
> -#define  TCP_MAXSEG  0x02/* set maximum segment size */
> -#define TCP_NOPUSH   0x04/* don't push last block of write */
> -#define TCP_NOOPT0x08/* don't use TCP options */
> -#define TCP_MD5SIG   0x10/* use MD5 digests (RFC2385) */
> -#define  TCP_INFO0x20/* retrieve tcp_info structure */
> -#define  TCP_CONGESTION  0x40/* get/set congestion control algorithm 
> */
> -#define  TCP_KEEPINIT0x80/* N, time to establish connection */
> -#define  TCP_KEEPIDLE0x100   /* L,N,X start keeplives after this 
> period */
> -#define  TCP_KEEPINTVL   0x200   /* L,N interval between keepalives */
> -#define  TCP_KEEPCNT 0x400   /* L,N number of keepalives before 
> close */
> +#define  TCP_MAXSEG  2   /* set maximum segment size */
> +#define TCP_NOPUSH   4   /* don't push last block of write */
> +#define TCP_NOOPT8   /* don't use TCP options */
> +#define TCP_MD5SIG   16  /* use MD5 digests (RFC2385) */
> +#define  TCP_INFO32  /* retrieve tcp_info structure */
> +#define  TCP_CONGESTION  64  /* get/set congestion control algorithm 
> */
> +#define  TCP_KEEPINIT128 /* N, time to establish connection */
> +#define  TCP_KEEPIDLE256 /* L,N,X start keeplives after this 
> period */
> +#define  TCP_KEEPINTVL   512 /* L,N interval between keepalives */
> +#define  TCP_KEEPCNT 1024/* L,N number of keepalives before 
> close */
>  
>  #define  TCP_CA_NAME_MAX 16  /* max congestion control name length */
>  
> Index: netinet/udp.h
> ===
> --- netinet/udp.h (revision 245225)
> +++ netinet/udp.h (working copy)
> @@ -48,7 +48,7 @@
>  /* 
>   * User-settable options (used with setsockopt).
>   */
> -#define  UDP_ENCAP   0x01
> +#define  UDP_ENCAP   1
>  
>  
>  /*
> 

Thumbs up from me, modulo a potential tweak to define the boundary for
FreeBSD/private use.

Cheers,
Lawrence
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: [SOLVED] if_vr(4) and DFE520-TX [working with patched if_rl]

2013-01-15 Thread Ruslan Makhmatkhanov


YongHyeon PYUN wrote on 15.01.2013 10:51:


Hmm, I don't get it.
Diff inlined again.

Index: sys/pci/if_rlreg.h
===
--- sys/pci/if_rlreg.h  (revision 245199)
+++ sys/pci/if_rlreg.h  (working copy)
@@ -1048,6 +1048,11 @@ struct rl_softc {
  #define   DLINK_DEVICEID_530TXPLUS0x1300

  /*
+ * D-Link DFE-520TX rev. C1 device ID
+ */
+#defineDLINK_DEVICEID_520TX_REVC1  0x4200
+
+/*
   * D-Link DFE-5280T device ID
   */
  #define   DLINK_DEVICEID_528T 0x4300
Index: sys/pci/if_rl.c
===
--- sys/pci/if_rl.c (revision 245199)
+++ sys/pci/if_rl.c (working copy)
@@ -148,6 +148,8 @@ static const struct rl_type rl_devs[] = {
"Delta Electronics 8139 10/100BaseTX" },
{ ADDTRON_VENDORID, ADDTRON_DEVICEID_8139, RL_8139,
"Addtron Technology 8139 10/100BaseTX" },
+   { DLINK_VENDORID, DLINK_DEVICEID_520TX_REVC1, RL_8139,
+   "D-Link DFE-520TX (rev. C1) 10/100BaseTX" },
{ DLINK_VENDORID, DLINK_DEVICEID_530TXPLUS, RL_8139,
"D-Link DFE-530TX+ 10/100BaseTX" },
{ DLINK_VENDORID, DLINK_DEVICEID_690TXD, RL_8139,




Hooray! It is working with if_rl with your patch (loader tunable isn't 
used). Thanks a lot for this!


Can this be committed and merged to 8/9?

rl1@pci0:4:1:0:	class=0x02 card=0x11031186 chip=0x42001186 rev=0x10 
hdr=0x00

vendor = 'D-Link System Inc'
class  = network
subclass   = ethernet


rl1: flags=8843 metric 0 mtu 1500
options=2008
ether 90:94:e4:82:d5:e6
inet 192.168.0.208 netmask 0xff00 broadcast 192.168.0.255
nd6 options=29
media: Ethernet autoselect (100baseTX )
status: active

Ping and all other working fine.

--
Regards,
Ruslan

Tinderboxing kills... the drives.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: To SMP or not to SMP

2013-01-15 Thread Konstantin Belousov

On Mon, Jan 14, 2013 at 04:12:09PM -0600, Bryan Venteicher wrote:
> 
> 
> - Original Message -
> > From: "John Baldwin" 
> > To: freebsd-net@freebsd.org
> > Cc: "Konstantin Belousov" , "Bryan Venteicher" 
> > , "Peter Jeremy"
> > 
> > Sent: Monday, January 14, 2013 3:57:58 PM
> > Subject: Re: To SMP or not to SMP
> > 
> > On Monday, January 14, 2013 4:07:56 pm Konstantin Belousov wrote:
> > > On Mon, Jan 14, 2013 at 03:07:50PM -0500, John Baldwin wrote:
> > > > On Sunday, January 13, 2013 1:15:13 am Bryan Venteicher wrote:
> > > > > 
> > > > > - Original Message -
> > > > > > From: "John Baldwin" 
> > > > > > To: freebsd-net@freebsd.org
> > > > > > Cc: "Barney Cordoba" , "Peter
> > > > > > Jeremy"
> > 
> > > > > > Sent: Friday, January 11, 2013 9:39:17 AM
> > > > > > Subject: Re: To SMP or not to SMP
> > > > > > 
> > > > > > On Thursday, January 10, 2013 02:36:59 PM Peter Jeremy wrote:
> > > > > > > On 2013-Jan-07 18:25:58 -0800, Barney Cordoba
> > > > > > > 
> > > > > > wrote:
> > > > > > > >I have a situation where I have to run 9.1 on an old
> > > > > > > >single core
> > > > > > > >box. Does anyone have a handle on whether it's better to
> > > > > > > >build a
> > > > > > > >non
> > > > > > > >SMP kernel or to just use a standard SMP build with just
> > > > > > > >the one
> > > > > > > >core?
> > > > > > > 
> > > > > > > Another input for this decision is kern/173322.  Currently
> > > > > > > on x86,
> > > > > > > atomic operations within kernel modules are implemented
> > > > > > > using calls
> > > > > > > to code in the kernel, which do or don't use lock prefixes
> > > > > > > depending
> > > > > > > on whethur the kernel was built as SMP.  My proposed change
> > > > > > > changes
> > > > > > > kernel modules to inline atomic operations but always
> > > > > > > include lock
> > > > > > > prefixes (effectively reverting r4).  I'm appreciate
> > > > > > > anyone who
> > > > > > > feels like testing the impact of this change.
> > > > > > 
> > > > > > Presumably a locked atomic op is cheaper than a function call
> > > > > > then?
> > > > > >  The
> > > > > > current setup assumes the opposite.
> > > > > > 
> > > > > > I think we should actually do this for atomics in modules on
> > > > > > x86:
> > > > > > 
> > > > > > 1) If a module is built standalone, it should do whichever is
> > > > > > cheaper:
> > > > > >a function call or always use "LOCK".
> > > > > > 
> > > > > > 2) If a module is built as part of the kernel build, it
> > > > > > should use
> > inlined
> > > > > >atomics that match what the kernel does.  Thus, modules
> > > > > >built with
> > a
> > > > > >non-SMP kernel would use inlined atomic ops that do not
> > > > > >use LOCK.
> > We
> > > > > >have a way to detect this now (some HAVE_FOO #define added
> > > > > >in the
> > past
> > > > > >few years) that we didn't back when this bit of atomic.h
> > > > > >was
> > > > > >written.
> > > > > >
> > > > > 
> > > > > It would be nice to have the LOCK variants available even on UP
> > > > > kernels in non-hackish way. For VirtIO, we need to handle an
> > > > > guest
> > > > > UP kernel running on an SMP host. Whether this is an #define
> > > > > that
> > > > > forces the SMP atomics to be inlined, or if they're exposed
> > > > > with
> > > > > an _smp suffix.
> > > Could you please, clarify why does UP kernel needs it ?
> > > Shouldn't the hypervisor context switching provide neccessary
> > > serialization
> > > anyway ?
> > 
> > I thought this, too, but in the case of virtio you are presumably
> > sychronizing with other threads in the hypervisor itself which might
> > be running concurrently on another physical CPU.
> > 
> 
> Yes, that is the case to be concerned about. Although, thinking
> about this a bit more, in VirtIO (at least the current spec), all
> the shared fields are updated by either the host or guest, not
> both, so a UP kernel can get by without the LOCK, correct?
> 
I did not see the spec, so I cannot argue. On the other hands, the
barriers have nothing to do with shared access to the same memory
location. Barriers prevent seeing paradoxical results from memory
accesses, in particular, they ensure that compiler and hardware do
not reorder the memory access sequences in the unwanted way.

That said, I think that a model where some self-contained blob is
only writen by one agent, and only read by another, does not require
any barriers on x86 at all. The architecture specifies that the only
reordering the hardware is allowed to not hide are reads which could
pass writes.


pgpwn9yo8SvPq.pgp
Description: PGP signature

static kernel with mod_cc?

2013-01-15 Thread Eggert, Lars

Hi,

mod_cc(4) says:

 Algorithm modules can be compiled into the kernel or loaded as
 kernel modules using the kld(4) facility.

Maybe I'm dense, but I can't figure out how to statically compile mod_cc 
modules into the kernel? (I'm using a PAE kernel w/o modules.)

Hints appreciated.

Thanks,
Lars
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: static kernel with mod_cc?

2013-01-15 Thread Lawrence Stewart

Hi Lars,

On 01/15/13 23:47, Eggert, Lars wrote:
> Hi,
> 
> mod_cc(4) says:
> 
> Algorithm modules can be compiled into the kernel or loaded as kernel
> modules using the kld(4) facility.
> 
> Maybe I'm dense, but I can't figure out how to statically compile
> mod_cc modules into the kernel? (I'm using a PAE kernel w/o
> modules.)
> 
> Hints appreciated.

You're not dense - the build glue to allow an algorithm to be specified
in a kernel config file doesn't exist. It probably should.

The hacky way to achieve what you want would be to edit
/sys/conf/files and manually add a line like so below the
cc_newreno.c line:

netinet/cc/cc_.c  optional inet | inet6

That will compile the module into the kernel, assuming "options INET" or
"options INET6" is in your kernel config file.

Cheers,
Lawrence
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: static kernel with mod_cc?

2013-01-15 Thread Eggert, Lars

Hi,

On Jan 15, 2013, at 14:09, Lawrence Stewart  wrote:
> You're not dense - the build glue to allow an algorithm to be specified
> in a kernel config file doesn't exist.

ah, that explains it. I guess it doesn't exist for siftr either?

> The hacky way to achieve what you want would be to edit
> /sys/conf/files and manually add a line like so below the
> cc_newreno.c line:
> 
> netinet/cc/cc_.c  optional inet | inet6
> 
> That will compile the module into the kernel, assuming "options INET" or
> "options INET6" is in your kernel config file.

Thanks, will try!

Lars
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: To SMP or not to SMP

2013-01-15 Thread Bryan Venteicher



- Original Message -
> From: "Konstantin Belousov" 
> To: "Bryan Venteicher" 
> Cc: "John Baldwin" , "Peter Jeremy" , 
> freebsd-net@freebsd.org
> Sent: Tuesday, January 15, 2013 4:42:16 AM
> Subject: Re: To SMP or not to SMP
> 
> On Mon, Jan 14, 2013 at 04:12:09PM -0600, Bryan Venteicher wrote:
> > 
> > 
> > - Original Message -
> > > From: "John Baldwin" 
> > > To: freebsd-net@freebsd.org
> > > Cc: "Konstantin Belousov" , "Bryan
> > > Venteicher" , "Peter Jeremy"
> > > 
> > > Sent: Monday, January 14, 2013 3:57:58 PM
> > > Subject: Re: To SMP or not to SMP
> > > 
> > > On Monday, January 14, 2013 4:07:56 pm Konstantin Belousov wrote:
> > > > On Mon, Jan 14, 2013 at 03:07:50PM -0500, John Baldwin wrote:
> > > > > On Sunday, January 13, 2013 1:15:13 am Bryan Venteicher
> > > > > wrote:
> > > > > > 
> > > > > > - Original Message -
> > > > > > > From: "John Baldwin" 
> > > > > > > To: freebsd-net@freebsd.org
> > > > > > > Cc: "Barney Cordoba" , "Peter
> > > > > > > Jeremy"
> > > 
> > > > > > > Sent: Friday, January 11, 2013 9:39:17 AM
> > > > > > > Subject: Re: To SMP or not to SMP
> > > > > > > 
> > > > > > > On Thursday, January 10, 2013 02:36:59 PM Peter Jeremy
> > > > > > > wrote:
> > > > > > > > On 2013-Jan-07 18:25:58 -0800, Barney Cordoba
> > > > > > > > 
> > > > > > > wrote:
> > > > > > > > >I have a situation where I have to run 9.1 on an old
> > > > > > > > >single core
> > > > > > > > >box. Does anyone have a handle on whether it's better
> > > > > > > > >to
> > > > > > > > >build a
> > > > > > > > >non
> > > > > > > > >SMP kernel or to just use a standard SMP build with
> > > > > > > > >just
> > > > > > > > >the one
> > > > > > > > >core?
> > > > > > > > 
> > > > > > > > Another input for this decision is kern/173322.
> > > > > > > >  Currently
> > > > > > > > on x86,
> > > > > > > > atomic operations within kernel modules are implemented
> > > > > > > > using calls
> > > > > > > > to code in the kernel, which do or don't use lock
> > > > > > > > prefixes
> > > > > > > > depending
> > > > > > > > on whethur the kernel was built as SMP.  My proposed
> > > > > > > > change
> > > > > > > > changes
> > > > > > > > kernel modules to inline atomic operations but always
> > > > > > > > include lock
> > > > > > > > prefixes (effectively reverting r4).  I'm
> > > > > > > > appreciate
> > > > > > > > anyone who
> > > > > > > > feels like testing the impact of this change.
> > > > > > > 
> > > > > > > Presumably a locked atomic op is cheaper than a function
> > > > > > > call
> > > > > > > then?
> > > > > > >  The
> > > > > > > current setup assumes the opposite.
> > > > > > > 
> > > > > > > I think we should actually do this for atomics in modules
> > > > > > > on
> > > > > > > x86:
> > > > > > > 
> > > > > > > 1) If a module is built standalone, it should do
> > > > > > > whichever is
> > > > > > > cheaper:
> > > > > > >a function call or always use "LOCK".
> > > > > > > 
> > > > > > > 2) If a module is built as part of the kernel build, it
> > > > > > > should use
> > > inlined
> > > > > > >atomics that match what the kernel does.  Thus,
> > > > > > >modules
> > > > > > >built with
> > > a
> > > > > > >non-SMP kernel would use inlined atomic ops that do
> > > > > > >not
> > > > > > >use LOCK.
> > > We
> > > > > > >have a way to detect this now (some HAVE_FOO #define
> > > > > > >added
> > > > > > >in the
> > > past
> > > > > > >few years) that we didn't back when this bit of
> > > > > > >atomic.h
> > > > > > >was
> > > > > > >written.
> > > > > > >
> > > > > > 
> > > > > > It would be nice to have the LOCK variants available even
> > > > > > on UP
> > > > > > kernels in non-hackish way. For VirtIO, we need to handle
> > > > > > an
> > > > > > guest
> > > > > > UP kernel running on an SMP host. Whether this is an
> > > > > > #define
> > > > > > that
> > > > > > forces the SMP atomics to be inlined, or if they're exposed
> > > > > > with
> > > > > > an _smp suffix.
> > > > Could you please, clarify why does UP kernel needs it ?
> > > > Shouldn't the hypervisor context switching provide neccessary
> > > > serialization
> > > > anyway ?
> > > 
> > > I thought this, too, but in the case of virtio you are presumably
> > > sychronizing with other threads in the hypervisor itself which
> > > might
> > > be running concurrently on another physical CPU.
> > > 
> > 
> > Yes, that is the case to be concerned about. Although, thinking
> > about this a bit more, in VirtIO (at least the current spec), all
> > the shared fields are updated by either the host or guest, not
> > both, so a UP kernel can get by without the LOCK, correct?
> > 
> I did not see the spec, so I cannot argue. On the other hands, the
> barriers have nothing to do with shared access to the same memory
> location. Barriers prevent seeing paradoxical results from memory
> accesses, in particular, they ensure that compiler and hardware do
> not reorder the memory ac

Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

2013-01-15 Thread John Baldwin

On Tuesday, January 15, 2013 3:49:33 am Lawrence Stewart wrote:
> On 01/15/13 07:50, John Baldwin wrote:
> > The constants used for TCP and UDP socket options (TCP_NODELAY, etc.) are 
> > currently defined as hex values that are individual bits.  However, socket 
> > options are never masked together, they are used as a simple enumeration of 
> > discrete values.  Using a bitmask forces us to run out of bits and makes it 
> > harder for vendors to try to use a high range of values for local custom 
> > options (hoping that they never conflict with a new option value added in 
> > stock FreeBSD).
> 
> Yup. Should we be explicitly #defining the boundary between "bits
> reserved for FreeBSD" and "bits for private vendor use"?

Oh, we could if you wanted.  I'm using 0x1000 locally for both TCP and UDP,
but those are completely arbitrary values.  Saner ones might be 0x800 if
we want to do that explicitly.  We could perhaps just say that is true for all
socket option levels (that is, just define one SO_VENDOR constant or some such
but say it applies to all levels)?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Some questions about the new TCP congestion control code

2013-01-15 Thread John Baldwin

On Tuesday, January 15, 2013 3:29:51 am Lawrence Stewart wrote:
> Hi John,
> 
> On 01/15/13 08:04, John Baldwin wrote:
> > I was looking at TCP congestion control at work recently and noticed a few 
> 
> Poor you ;)
> 
> > "odd" things in the current code.  First, there is this chunk of code in 
> > cc_ack_received() in tcp_input.c:
> > 
> > static void inline
> > cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type)
> > {
> > INP_WLOCK_ASSERT(tp->t_inpcb);
> > 
> > tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th);
> > if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd))
> > tp->ccv->flags |= CCF_CWND_LIMITED;
> > else
> > tp->ccv->flags &= ~CCF_CWND_LIMITED;
> > 
> > 
> > Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not 
> > integers, so the call to min() results in truncation on 64-bit hosts.
> 
> Good catch, but I don't think it matters in practice as neither snd_cwnd
> or snd_wnd will grow past the 32-bit boundary.

I have a psyhcotic case using cc_cubic where it seems to grow without bound,
though that is a bug in and of itself (and this change did not fix that
issue).  I ended up not using cc_cubic (more below) and haven't been able
to track down the root cause of the delay.  I can probably provide a test case
to reproduce this if you are interested.

> > It should probably be ulmin() instead.  However, this line seems to be a 
> > really 
> > obfuscated way to just write:
> > 
> > if (tp->snd_cwnd <= tp->snd_wnd)
> 
> You are correct, though I'd argue the meaning of the existing code as
> written is clearer compared to your suggested change.
> 
> > If that is correct, I would vote for changing this to use the much simpler 
> > logic.
> 
> Agreed. While I find the existing code slightly clearer in meaning, it's
> not significant enough to warrant keeping it as is when your suggested
> change is simpler, fixes a bug and achieves the same thing. Happy for
> you to change it or I can do it if you prefer.

I'll leave that to you, thanks.

> > Secondly, in the particular case I was investigating at work (restart of an 
> > idle connnection), the newreno congestion control code in 8.x and later 
> > uses a 
> > different algorithm than in 7.  Specifically, in 7 TCP would reuse the same 
> > logic used for an initial cwnd (honoring ss_fltsz).  In 8 this no longer 
> > happens (instead, 2 is hardcoded).  A guess at a possible fix might look 
> > something like this:
> > 
> > Index: cc_newreno.c
> > ===
> > --- cc_newreno.c(revision 243660)
> > +++ cc_newreno.c(working copy)
> > @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv)
> > if (V_tcp_do_rfc3390)
> > rw = min(4 * CCV(ccv, t_maxseg),
> > max(2 * CCV(ccv, t_maxseg), 4380));
> > +#if 1
> > else
> > rw = CCV(ccv, t_maxseg) * 2;
> > +#else
> > +   /* XXX: This is missing a lot of stuff that used to be in 7. */
> > +#ifdef INET6
> > +   else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) :
> > +   in_localaddr(CCV(ccv, t_inpcb->inp_faddr
> > +#else
> > +   else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr)))
> > +#endif
> > +   rw = V_ss_fltsz_local * CCV(ccv, t_maxseg);
> > +   else
> > +   rw = V_ss_fltsz * CCV(ccv, t_maxseg);
> > +#endif
> >  
> > CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd));
> >  }
> > 
> > (But using the #else clause instead of the current #if 1 code).  Was this 
> > change in 8 intentional?
> 
> It was. Unlike connection initialisation which still honours ss_fltsz in
> cc_conn_init(), restarting an idle connection based on ss_fltsz seemed
> particularly dubious and as such was omitted from the refactored code.
> 
> The ultimate goal was to remove the ss_fltsz hack completely and
> implement a smarter mechanism, but that hasn't quite happened yet. The
> removal of ss_fltsz from 10.x without providing a replacement mechanism
> is not ideal and should probably be addressed.
> 
> I'm guessing you're not using rfc3390 because you want to override the
> initial window based on specific local knowledge of the path between
> sender and receiver?

Correct, in 7.x we had cranked ss_fltsz up to a really high number to prevent
the congestion window from collapsing when the connection was idle.  We have
a bit of a unique workload in that we are using TCP to reliably forward a
latency-sensitive datagram stream across a WAN connection with high bandwidth
and high RTT.  Most of congestion control seems tuned to bulk transfers rather
than this sort of use case.  The solution we have settled on here is to add a
new TCP socket option to disable idle handling so that when an idle connection
restarts it keeps its prior congestion window.

One other thing I noticed which is may or may not be odd during this, is that
if you have a connection with TCP_NODELAY enabled and you fill your cwnd and
then you get

Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

2013-01-15 Thread John Baldwin

On Monday, January 14, 2013 5:17:12 pm Alfred Perlstein wrote:
> On 1/14/13 4:56 PM, John Baldwin wrote:
> > On Monday, January 14, 2013 4:42:16 pm Alfred Perlstein wrote:
> >> Wouldn't a comment over the code suffice?
> >>
> >> Something like your email as a header would actually work very nicely!
> >>
> >> I think just using decimal would be more confusing than explicitly
> >> calling it out like:
> >>
> >> /* begin enumerated (not bitmask) socket option specifiers */
> >> #defineTCP_MAXSEG  0x02/* set maximum segment size */
> >> #define TCP_NOPUSH 0x04/* don't push last block of write */
> >> #define TCP_NOOPT  0x08/* don't use TCP options */
> >> #define TCP_MD5SIG 0x10/* use MD5 digests (RFC2385) */
> >> /* end enumerated socket option specifiers */
> > I have a patch I'll post next which will add a new option as '3'.  I think 
> > that
> > will make it more obvious and avoid having new options follow the old 
> > pattern.
> >
> Any objection to adding the contents of that email as a comment 
> section?  It really would help.

We don't generally do this for other enumerations like ioctl values as it
is generally obvious to the reader.  I think for UDP having one constant
called '1' should be obvious enough.  TCP might indeed warrany a comment
since it has several existing values that are powers of 2.  How about
this:

/*
 * User-settable options (used with setsockopt).  These are discrete
 * values and should not be masked together.  Many values appear to be
 * bitmasks for legacy reasons.
 */


-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

2013-01-15 Thread Eitan Adler

On 14 January 2013 15:50, John Baldwin  wrote:
> Using a bitmask forces us to run out of bits and makes it
> harder for vendors to try to use a high range of values for local custom
> options (hoping that they never conflict with a new option value added in
> stock FreeBSD).

We should explicitly decide and #define the boundary value for custom options.


-- 
Eitan Adler
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

two problems in dev/e1000/if_lem.c::lem_handle_rxtx()

2013-01-15 Thread Luigi Rizzo

Hi,
i found a couple of problems in
dev/e1000/if_lem.c::lem_handle_rxtx() ,
(compare with dev/e1000/if_em.c::em_handle_que() for better understanding):

1. in if_em.c::em_handle_que(), when em_rxeof() exceeds the
  rx_process_limit, the task is rescheduled so it can complete the work.
  Conversely, in if_lem.c::lem_handle_rxtx() the lem_rxeof() is
  only run once, and if there are more pending packets the only
  chance to drain them is to receive (many) more interrupts.

  This is a relatively serious problem, because the receiver has
  a hard time recovering.

  I'd like to commit a fix to this same as it is done in e1000.

2. in if_em.c::em_handle_que(), interrupts are reenabled unconditionally,
   whereas lem_handle_rxtx() only enables them if IFF_DRV_RUNNING is set.

   I cannot really tell what is the correct way here, so I'd like
   to put a comment there unless there is a good suggestion on
   what to do.

   Accesses to the intr register are race-prone anyways
   (disabled in fastintr, enabled in the rxtx task without
   holding any lock, and generally accessed under EM_CORE_LOCK
   in other places), and presumably enabling/disabling the
   interrupts around activations of the taks is just an
   optimization (and on a VM, it is actually a pessimization
   due to the huge cost of VM exits).

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: [SOLVED] if_vr(4) and DFE520-TX [working with patched if_rl]

2013-01-15 Thread YongHyeon PYUN

On Tue, Jan 15, 2013 at 01:04:49PM +0400, Ruslan Makhmatkhanov wrote:
> YongHyeon PYUN wrote on 15.01.2013 10:51:
> >
> >Hmm, I don't get it.
> >Diff inlined again.
> >
> >Index: sys/pci/if_rlreg.h
> >===
> >--- sys/pci/if_rlreg.h   (revision 245199)
> >+++ sys/pci/if_rlreg.h   (working copy)
> >@@ -1048,6 +1048,11 @@ struct rl_softc {
> >  #defineDLINK_DEVICEID_530TXPLUS0x1300
> >
> >  /*
> >+ * D-Link DFE-520TX rev. C1 device ID
> >+ */
> >+#define DLINK_DEVICEID_520TX_REVC1  0x4200
> >+
> >+/*
> >   * D-Link DFE-5280T device ID
> >   */
> >  #defineDLINK_DEVICEID_528T 0x4300
> >Index: sys/pci/if_rl.c
> >===
> >--- sys/pci/if_rl.c  (revision 245199)
> >+++ sys/pci/if_rl.c  (working copy)
> >@@ -148,6 +148,8 @@ static const struct rl_type rl_devs[] = {
> > "Delta Electronics 8139 10/100BaseTX" },
> > { ADDTRON_VENDORID, ADDTRON_DEVICEID_8139, RL_8139,
> > "Addtron Technology 8139 10/100BaseTX" },
> >+{ DLINK_VENDORID, DLINK_DEVICEID_520TX_REVC1, RL_8139,
> >+"D-Link DFE-520TX (rev. C1) 10/100BaseTX" },
> > { DLINK_VENDORID, DLINK_DEVICEID_530TXPLUS, RL_8139,
> > "D-Link DFE-530TX+ 10/100BaseTX" },
> > { DLINK_VENDORID, DLINK_DEVICEID_690TXD, RL_8139,
> >>
> 
> Hooray! It is working with if_rl with your patch (loader tunable isn't 
> used). Thanks a lot for this!
> 
> Can this be committed and merged to 8/9?
> 

Yes, committed in r245485.
I will MFC to stable 9/8 after a week.

> rl1@pci0:4:1:0:   class=0x02 card=0x11031186 chip=0x42001186 
> rev=0x10 hdr=0x00
> vendor = 'D-Link System Inc'
> class  = network
> subclass   = ethernet
> 
> 
> rl1: flags=8843 metric 0 mtu 1500
>   options=2008
>   ether 90:94:e4:82:d5:e6
>   inet 192.168.0.208 netmask 0xff00 broadcast 192.168.0.255
>   nd6 options=29
>   media: Ethernet autoselect (100baseTX )
>   status: active
> 
> Ping and all other working fine.
> 

Thanks for testing!

> -- 
> Regards,
> Ruslan
> 
> Tinderboxing kills... the drives.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: two problems in dev/e1000/if_lem.c::lem_handle_rxtx()

2013-01-15 Thread Jack Vogel

OK, will look at this as soon as I can.

Jack


On Tue, Jan 15, 2013 at 5:23 PM, Luigi Rizzo  wrote:

> Hi,
> i found a couple of problems in
> dev/e1000/if_lem.c::lem_handle_rxtx() ,
> (compare with dev/e1000/if_em.c::em_handle_que() for better understanding):
>
> 1. in if_em.c::em_handle_que(), when em_rxeof() exceeds the
>   rx_process_limit, the task is rescheduled so it can complete the work.
>   Conversely, in if_lem.c::lem_handle_rxtx() the lem_rxeof() is
>   only run once, and if there are more pending packets the only
>   chance to drain them is to receive (many) more interrupts.
>
>   This is a relatively serious problem, because the receiver has
>   a hard time recovering.
>
>   I'd like to commit a fix to this same as it is done in e1000.
>
> 2. in if_em.c::em_handle_que(), interrupts are reenabled unconditionally,
>whereas lem_handle_rxtx() only enables them if IFF_DRV_RUNNING is set.
>
>I cannot really tell what is the correct way here, so I'd like
>to put a comment there unless there is a good suggestion on
>what to do.
>
>Accesses to the intr register are race-prone anyways
>(disabled in fastintr, enabled in the rxtx task without
>holding any lock, and generally accessed under EM_CORE_LOCK
>in other places), and presumably enabling/disabling the
>interrupts around activations of the taks is just an
>optimization (and on a VM, it is actually a pessimization
>due to the huge cost of VM exits).
>
> cheers
> luigi
>
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: static kernel with mod_cc?

2013-01-15 Thread Lawrence Stewart

On 01/16/13 02:12, Eggert, Lars wrote:
> Hi,
> 
> On Jan 15, 2013, at 14:09, Lawrence Stewart  wrote:
>> You're not dense - the build glue to allow an algorithm to be specified
>> in a kernel config file doesn't exist.
> 
> ah, that explains it. I guess it doesn't exist for siftr either?

Correct, though I believe it should be trivial to fix for both siftr and
the mod_cc modules. I might have a go and post a patch for testing.

Cheers,
Lawrence
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Some questions about the new TCP congestion control code

Re: if_vr(4) and DFE520-TX

Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

Re: [SOLVED] if_vr(4) and DFE520-TX [working with patched if_rl]

Re: To SMP or not to SMP

static kernel with mod_cc?

Re: static kernel with mod_cc?

Re: static kernel with mod_cc?

Re: To SMP or not to SMP

Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

Re: Some questions about the new TCP congestion control code

Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

two problems in dev/e1000/if_lem.c::lem_handle_rxtx()

Re: [SOLVED] if_vr(4) and DFE520-TX [working with patched if_rl]

Re: two problems in dev/e1000/if_lem.c::lem_handle_rxtx()

Re: static kernel with mod_cc?

17 matches

Site Navigation

Mail list logo

Footer information