Re: Some questions about the new TCP congestion control code
Hi John, On 01/15/13 08:04, John Baldwin wrote: > I was looking at TCP congestion control at work recently and noticed a few Poor you ;) > "odd" things in the current code. First, there is this chunk of code in > cc_ack_received() in tcp_input.c: > > static void inline > cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type) > { > INP_WLOCK_ASSERT(tp->t_inpcb); > > tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th); > if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd)) > tp->ccv->flags |= CCF_CWND_LIMITED; > else > tp->ccv->flags &= ~CCF_CWND_LIMITED; > > > Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not > integers, so the call to min() results in truncation on 64-bit hosts. Good catch, but I don't think it matters in practice as neither snd_cwnd or snd_wnd will grow past the 32-bit boundary. > It should probably be ulmin() instead. However, this line seems to be a > really > obfuscated way to just write: > > if (tp->snd_cwnd <= tp->snd_wnd) You are correct, though I'd argue the meaning of the existing code as written is clearer compared to your suggested change. > If that is correct, I would vote for changing this to use the much simpler > logic. Agreed. While I find the existing code slightly clearer in meaning, it's not significant enough to warrant keeping it as is when your suggested change is simpler, fixes a bug and achieves the same thing. Happy for you to change it or I can do it if you prefer. > Secondly, in the particular case I was investigating at work (restart of an > idle connnection), the newreno congestion control code in 8.x and later uses > a > different algorithm than in 7. Specifically, in 7 TCP would reuse the same > logic used for an initial cwnd (honoring ss_fltsz). In 8 this no longer > happens (instead, 2 is hardcoded). A guess at a possible fix might look > something like this: > > Index: cc_newreno.c > === > --- cc_newreno.c (revision 243660) > +++ cc_newreno.c (working copy) > @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv) > if (V_tcp_do_rfc3390) > rw = min(4 * CCV(ccv, t_maxseg), > max(2 * CCV(ccv, t_maxseg), 4380)); > +#if 1 > else > rw = CCV(ccv, t_maxseg) * 2; > +#else > + /* XXX: This is missing a lot of stuff that used to be in 7. */ > +#ifdef INET6 > + else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) : > + in_localaddr(CCV(ccv, t_inpcb->inp_faddr > +#else > + else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr))) > +#endif > + rw = V_ss_fltsz_local * CCV(ccv, t_maxseg); > + else > + rw = V_ss_fltsz * CCV(ccv, t_maxseg); > +#endif > > CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd)); > } > > (But using the #else clause instead of the current #if 1 code). Was this > change in 8 intentional? It was. Unlike connection initialisation which still honours ss_fltsz in cc_conn_init(), restarting an idle connection based on ss_fltsz seemed particularly dubious and as such was omitted from the refactored code. The ultimate goal was to remove the ss_fltsz hack completely and implement a smarter mechanism, but that hasn't quite happened yet. The removal of ss_fltsz from 10.x without providing a replacement mechanism is not ideal and should probably be addressed. I'm guessing you're not using rfc3390 because you want to override the initial window based on specific local knowledge of the path between sender and receiver? Cheers, Lawrence ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: if_vr(4) and DFE520-TX
On Mon, 14 Jan 2013 15:15:53 +0900 YongHyeon PYUN wrote: > On Sat, Jan 12, 2013 at 06:49:13PM +0400, Ruslan Makhmatkhanov wrote: > > Ok, I got some details. It's an DFE-520TX (/C1 or rev. C1). I crafted an > > patch attached, but whenever kldloading the modified if_vr, I got this: > > > > kernel: vr0: port 0xd100-0xd1ff > > mem 0xf7c11000-0xf7c110ff irq 19 at device 0.0 on pci4 > > kernel: vr0: Quirks: 0x0 > > kernel: vr0: Revision: 0x10 > > kernel: vr0: reset never completed! > > kernel: vr0: attaching PHYs failed > > kernel: device_attach: vr0 attach returned 6 > > kernel: vr0: port 0xd000-0xd0ff > > mem 0xf7c1-0xf7c100ff irq 16 at device 1.0 on pci4 > > kernel: vr0: Quirks: 0x0 > > kernel: vr0: Revision: 0x10 > > kernel: vr0: reset never completed! > > kernel: vr0: attaching PHYs failed > > kernel: device_attach: vr0 attach returned 6 > > > > I also tried to apply VR_Q_NEEDALIGN quirk, but nothing is changed. Any > > hints? > > I recall D-Link was one of notorious vendor which used to > completely change its chip set in later revisions without notice. So > I'm afraid the controller you have may not be a VIA manufactured > one. > Could you take a picture of the chip set of controller and let > others see it? I guess it could be a RealTek 8139 or 8139C+. Not so hard, just check board H/W revision also :) > > > > > > > Ruslan Makhmatkhanov wrote on 12.01.2013 15:26: > > > > > >Here is also verbose boot log for what it's worth: > > >http://pastebin.com/SnivrtFr > > > > > >Please keep me in cc:, I'm not subscribed. Thanks. > > > > > >Ruslan Makhmatkhanov wrote on 12.01.2013 11:28: > > >>Hello, > > >> > > >>I bought two D-link DFE520-TX ethernet adapters that supposed to work > > >>with if_vr(4) according to man-page. But the driver cannot attach > > >>(tested in 9.1-R and pfSense 2.0.2/2.1 (8.1-R and 8.3-R respectively)). > > >> > > >>none2@pci0:4:0:0:class=0x02 card=0x11031186 chip=0x42001186 > > >>rev=0x10 hdr=0x00 > > >> vendor = 'D-Link System Inc' > > >> class = network > > >> subclass = ethernet > > >> > > >>Can please anybody suggest proper changes for > > >>/sys/dev/vr/if_vrreg.h|if_vr.c (pci ids would be enought, right?) to > > >>test if it works. Thanks in advance. > > > > > > > > > -- > > Regards, > > Ruslan > > > > Tinderboxing kills... the drives. > > > diff -uN vr.orig/if_vr.c vr/if_vr.c > > --- vr.orig/if_vr.c 2013-01-12 13:19:28.0 +0400 > > +++ vr/if_vr.c 2013-01-12 18:42:52.0 +0400 > > @@ -138,6 +138,9 @@ > > { DELTA_VENDORID, DELTA_DEVICEID_RHINE_II, > > VR_Q_NEEDALIGN, > > "Delta Electronics Rhine II 10/100BaseTX" }, > > + { DLINK_VENDORID, DLINK_DEVICEID_RHINE_II, > > + 0, > > +"D-Link System Inc 4200 10/100BaseTX" }, > > { ADDTRON_VENDORID, ADDTRON_DEVICEID_RHINE_II, > > VR_Q_NEEDALIGN, > > "Addtron Technology Rhine II 10/100BaseTX" }, > > diff -uN vr.orig/if_vrreg.h vr/if_vrreg.h > > --- vr.orig/if_vrreg.h 2013-01-12 13:19:28.0 +0400 > > +++ vr/if_vrreg.h 2013-01-12 14:29:26.0 +0400 > > @@ -557,6 +557,16 @@ > > #define DELTA_DEVICEID_RHINE_II0x1320 > > > > /* > > + * D-Link System Inc device ID. > > + */ > > +#define DLINK_VENDORID 0x1186 > > + > > +/* > > + * D-Link System Inc device IDs. > > + */ > > +#define DLINK_DEVICEID_RHINE_II 0x4200 > > + > > +/* > > * Addtron vendor ID. > > */ > > #define ADDTRON_VENDORID 0x4033 > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" -- Aleksandr Rybalko ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks
On 01/15/13 07:50, John Baldwin wrote: > The constants used for TCP and UDP socket options (TCP_NODELAY, etc.) are > currently defined as hex values that are individual bits. However, socket > options are never masked together, they are used as a simple enumeration of > discrete values. Using a bitmask forces us to run out of bits and makes it > harder for vendors to try to use a high range of values for local custom > options (hoping that they never conflict with a new option value added in > stock FreeBSD). Yup. Should we be explicitly #defining the boundary between "bits reserved for FreeBSD" and "bits for private vendor use"? > The socket options in do use bitmasks for the low bits because > they map directly to bits so_options, but then they start a simple > enumeration > at 0x1000. TCP and UDP socket options do not directly map to bits in a flags > field in the PCB (e.g. TF_NODELAY != TCP_NODELAY). I would like to change > the > representation of the constants to be decimal instead of hex and encourage > new > options to fill in the gaps between the existing values. This would preserve > the existing ABI but keep things more sane in the future (I believe). The > diff is this: > > Index: netinet/tcp.h > === > --- netinet/tcp.h (revision 245225) > +++ netinet/tcp.h (working copy) > @@ -151,18 +151,18 @@ > /* > * User-settable options (used with setsockopt). > */ > -#define TCP_NODELAY 0x01/* don't delay send to coalesce packets > */ > +#define TCP_NODELAY 1 /* don't delay send to coalesce packets > */ > #if __BSD_VISIBLE > -#define TCP_MAXSEG 0x02/* set maximum segment size */ > -#define TCP_NOPUSH 0x04/* don't push last block of write */ > -#define TCP_NOOPT0x08/* don't use TCP options */ > -#define TCP_MD5SIG 0x10/* use MD5 digests (RFC2385) */ > -#define TCP_INFO0x20/* retrieve tcp_info structure */ > -#define TCP_CONGESTION 0x40/* get/set congestion control algorithm > */ > -#define TCP_KEEPINIT0x80/* N, time to establish connection */ > -#define TCP_KEEPIDLE0x100 /* L,N,X start keeplives after this > period */ > -#define TCP_KEEPINTVL 0x200 /* L,N interval between keepalives */ > -#define TCP_KEEPCNT 0x400 /* L,N number of keepalives before > close */ > +#define TCP_MAXSEG 2 /* set maximum segment size */ > +#define TCP_NOPUSH 4 /* don't push last block of write */ > +#define TCP_NOOPT8 /* don't use TCP options */ > +#define TCP_MD5SIG 16 /* use MD5 digests (RFC2385) */ > +#define TCP_INFO32 /* retrieve tcp_info structure */ > +#define TCP_CONGESTION 64 /* get/set congestion control algorithm > */ > +#define TCP_KEEPINIT128 /* N, time to establish connection */ > +#define TCP_KEEPIDLE256 /* L,N,X start keeplives after this > period */ > +#define TCP_KEEPINTVL 512 /* L,N interval between keepalives */ > +#define TCP_KEEPCNT 1024/* L,N number of keepalives before > close */ > > #define TCP_CA_NAME_MAX 16 /* max congestion control name length */ > > Index: netinet/udp.h > === > --- netinet/udp.h (revision 245225) > +++ netinet/udp.h (working copy) > @@ -48,7 +48,7 @@ > /* > * User-settable options (used with setsockopt). > */ > -#define UDP_ENCAP 0x01 > +#define UDP_ENCAP 1 > > > /* > Thumbs up from me, modulo a potential tweak to define the boundary for FreeBSD/private use. Cheers, Lawrence ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [SOLVED] if_vr(4) and DFE520-TX [working with patched if_rl]
YongHyeon PYUN wrote on 15.01.2013 10:51: Hmm, I don't get it. Diff inlined again. Index: sys/pci/if_rlreg.h === --- sys/pci/if_rlreg.h (revision 245199) +++ sys/pci/if_rlreg.h (working copy) @@ -1048,6 +1048,11 @@ struct rl_softc { #define DLINK_DEVICEID_530TXPLUS0x1300 /* + * D-Link DFE-520TX rev. C1 device ID + */ +#defineDLINK_DEVICEID_520TX_REVC1 0x4200 + +/* * D-Link DFE-5280T device ID */ #define DLINK_DEVICEID_528T 0x4300 Index: sys/pci/if_rl.c === --- sys/pci/if_rl.c (revision 245199) +++ sys/pci/if_rl.c (working copy) @@ -148,6 +148,8 @@ static const struct rl_type rl_devs[] = { "Delta Electronics 8139 10/100BaseTX" }, { ADDTRON_VENDORID, ADDTRON_DEVICEID_8139, RL_8139, "Addtron Technology 8139 10/100BaseTX" }, + { DLINK_VENDORID, DLINK_DEVICEID_520TX_REVC1, RL_8139, + "D-Link DFE-520TX (rev. C1) 10/100BaseTX" }, { DLINK_VENDORID, DLINK_DEVICEID_530TXPLUS, RL_8139, "D-Link DFE-530TX+ 10/100BaseTX" }, { DLINK_VENDORID, DLINK_DEVICEID_690TXD, RL_8139, Hooray! It is working with if_rl with your patch (loader tunable isn't used). Thanks a lot for this! Can this be committed and merged to 8/9? rl1@pci0:4:1:0: class=0x02 card=0x11031186 chip=0x42001186 rev=0x10 hdr=0x00 vendor = 'D-Link System Inc' class = network subclass = ethernet rl1: flags=8843 metric 0 mtu 1500 options=2008 ether 90:94:e4:82:d5:e6 inet 192.168.0.208 netmask 0xff00 broadcast 192.168.0.255 nd6 options=29 media: Ethernet autoselect (100baseTX ) status: active Ping and all other working fine. -- Regards, Ruslan Tinderboxing kills... the drives. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: To SMP or not to SMP
On Mon, Jan 14, 2013 at 04:12:09PM -0600, Bryan Venteicher wrote: > > > - Original Message - > > From: "John Baldwin" > > To: freebsd-net@freebsd.org > > Cc: "Konstantin Belousov" , "Bryan Venteicher" > > , "Peter Jeremy" > > > > Sent: Monday, January 14, 2013 3:57:58 PM > > Subject: Re: To SMP or not to SMP > > > > On Monday, January 14, 2013 4:07:56 pm Konstantin Belousov wrote: > > > On Mon, Jan 14, 2013 at 03:07:50PM -0500, John Baldwin wrote: > > > > On Sunday, January 13, 2013 1:15:13 am Bryan Venteicher wrote: > > > > > > > > > > - Original Message - > > > > > > From: "John Baldwin" > > > > > > To: freebsd-net@freebsd.org > > > > > > Cc: "Barney Cordoba" , "Peter > > > > > > Jeremy" > > > > > > > > Sent: Friday, January 11, 2013 9:39:17 AM > > > > > > Subject: Re: To SMP or not to SMP > > > > > > > > > > > > On Thursday, January 10, 2013 02:36:59 PM Peter Jeremy wrote: > > > > > > > On 2013-Jan-07 18:25:58 -0800, Barney Cordoba > > > > > > > > > > > > > wrote: > > > > > > > >I have a situation where I have to run 9.1 on an old > > > > > > > >single core > > > > > > > >box. Does anyone have a handle on whether it's better to > > > > > > > >build a > > > > > > > >non > > > > > > > >SMP kernel or to just use a standard SMP build with just > > > > > > > >the one > > > > > > > >core? > > > > > > > > > > > > > > Another input for this decision is kern/173322. Currently > > > > > > > on x86, > > > > > > > atomic operations within kernel modules are implemented > > > > > > > using calls > > > > > > > to code in the kernel, which do or don't use lock prefixes > > > > > > > depending > > > > > > > on whethur the kernel was built as SMP. My proposed change > > > > > > > changes > > > > > > > kernel modules to inline atomic operations but always > > > > > > > include lock > > > > > > > prefixes (effectively reverting r4). I'm appreciate > > > > > > > anyone who > > > > > > > feels like testing the impact of this change. > > > > > > > > > > > > Presumably a locked atomic op is cheaper than a function call > > > > > > then? > > > > > > The > > > > > > current setup assumes the opposite. > > > > > > > > > > > > I think we should actually do this for atomics in modules on > > > > > > x86: > > > > > > > > > > > > 1) If a module is built standalone, it should do whichever is > > > > > > cheaper: > > > > > >a function call or always use "LOCK". > > > > > > > > > > > > 2) If a module is built as part of the kernel build, it > > > > > > should use > > inlined > > > > > >atomics that match what the kernel does. Thus, modules > > > > > >built with > > a > > > > > >non-SMP kernel would use inlined atomic ops that do not > > > > > >use LOCK. > > We > > > > > >have a way to detect this now (some HAVE_FOO #define added > > > > > >in the > > past > > > > > >few years) that we didn't back when this bit of atomic.h > > > > > >was > > > > > >written. > > > > > > > > > > > > > > > > It would be nice to have the LOCK variants available even on UP > > > > > kernels in non-hackish way. For VirtIO, we need to handle an > > > > > guest > > > > > UP kernel running on an SMP host. Whether this is an #define > > > > > that > > > > > forces the SMP atomics to be inlined, or if they're exposed > > > > > with > > > > > an _smp suffix. > > > Could you please, clarify why does UP kernel needs it ? > > > Shouldn't the hypervisor context switching provide neccessary > > > serialization > > > anyway ? > > > > I thought this, too, but in the case of virtio you are presumably > > sychronizing with other threads in the hypervisor itself which might > > be running concurrently on another physical CPU. > > > > Yes, that is the case to be concerned about. Although, thinking > about this a bit more, in VirtIO (at least the current spec), all > the shared fields are updated by either the host or guest, not > both, so a UP kernel can get by without the LOCK, correct? > I did not see the spec, so I cannot argue. On the other hands, the barriers have nothing to do with shared access to the same memory location. Barriers prevent seeing paradoxical results from memory accesses, in particular, they ensure that compiler and hardware do not reorder the memory access sequences in the unwanted way. That said, I think that a model where some self-contained blob is only writen by one agent, and only read by another, does not require any barriers on x86 at all. The architecture specifies that the only reordering the hardware is allowed to not hide are reads which could pass writes. pgpwn9yo8SvPq.pgp Description: PGP signature
static kernel with mod_cc?
Hi, mod_cc(4) says: Algorithm modules can be compiled into the kernel or loaded as kernel modules using the kld(4) facility. Maybe I'm dense, but I can't figure out how to statically compile mod_cc modules into the kernel? (I'm using a PAE kernel w/o modules.) Hints appreciated. Thanks, Lars ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: static kernel with mod_cc?
Hi Lars, On 01/15/13 23:47, Eggert, Lars wrote: > Hi, > > mod_cc(4) says: > > Algorithm modules can be compiled into the kernel or loaded as kernel > modules using the kld(4) facility. > > Maybe I'm dense, but I can't figure out how to statically compile > mod_cc modules into the kernel? (I'm using a PAE kernel w/o > modules.) > > Hints appreciated. You're not dense - the build glue to allow an algorithm to be specified in a kernel config file doesn't exist. It probably should. The hacky way to achieve what you want would be to edit /sys/conf/files and manually add a line like so below the cc_newreno.c line: netinet/cc/cc_.c optional inet | inet6 That will compile the module into the kernel, assuming "options INET" or "options INET6" is in your kernel config file. Cheers, Lawrence ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: static kernel with mod_cc?
Hi, On Jan 15, 2013, at 14:09, Lawrence Stewart wrote: > You're not dense - the build glue to allow an algorithm to be specified > in a kernel config file doesn't exist. ah, that explains it. I guess it doesn't exist for siftr either? > The hacky way to achieve what you want would be to edit > /sys/conf/files and manually add a line like so below the > cc_newreno.c line: > > netinet/cc/cc_.c optional inet | inet6 > > That will compile the module into the kernel, assuming "options INET" or > "options INET6" is in your kernel config file. Thanks, will try! Lars ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: To SMP or not to SMP
- Original Message - > From: "Konstantin Belousov" > To: "Bryan Venteicher" > Cc: "John Baldwin" , "Peter Jeremy" , > freebsd-net@freebsd.org > Sent: Tuesday, January 15, 2013 4:42:16 AM > Subject: Re: To SMP or not to SMP > > On Mon, Jan 14, 2013 at 04:12:09PM -0600, Bryan Venteicher wrote: > > > > > > - Original Message - > > > From: "John Baldwin" > > > To: freebsd-net@freebsd.org > > > Cc: "Konstantin Belousov" , "Bryan > > > Venteicher" , "Peter Jeremy" > > > > > > Sent: Monday, January 14, 2013 3:57:58 PM > > > Subject: Re: To SMP or not to SMP > > > > > > On Monday, January 14, 2013 4:07:56 pm Konstantin Belousov wrote: > > > > On Mon, Jan 14, 2013 at 03:07:50PM -0500, John Baldwin wrote: > > > > > On Sunday, January 13, 2013 1:15:13 am Bryan Venteicher > > > > > wrote: > > > > > > > > > > > > - Original Message - > > > > > > > From: "John Baldwin" > > > > > > > To: freebsd-net@freebsd.org > > > > > > > Cc: "Barney Cordoba" , "Peter > > > > > > > Jeremy" > > > > > > > > > > Sent: Friday, January 11, 2013 9:39:17 AM > > > > > > > Subject: Re: To SMP or not to SMP > > > > > > > > > > > > > > On Thursday, January 10, 2013 02:36:59 PM Peter Jeremy > > > > > > > wrote: > > > > > > > > On 2013-Jan-07 18:25:58 -0800, Barney Cordoba > > > > > > > > > > > > > > > wrote: > > > > > > > > >I have a situation where I have to run 9.1 on an old > > > > > > > > >single core > > > > > > > > >box. Does anyone have a handle on whether it's better > > > > > > > > >to > > > > > > > > >build a > > > > > > > > >non > > > > > > > > >SMP kernel or to just use a standard SMP build with > > > > > > > > >just > > > > > > > > >the one > > > > > > > > >core? > > > > > > > > > > > > > > > > Another input for this decision is kern/173322. > > > > > > > > Currently > > > > > > > > on x86, > > > > > > > > atomic operations within kernel modules are implemented > > > > > > > > using calls > > > > > > > > to code in the kernel, which do or don't use lock > > > > > > > > prefixes > > > > > > > > depending > > > > > > > > on whethur the kernel was built as SMP. My proposed > > > > > > > > change > > > > > > > > changes > > > > > > > > kernel modules to inline atomic operations but always > > > > > > > > include lock > > > > > > > > prefixes (effectively reverting r4). I'm > > > > > > > > appreciate > > > > > > > > anyone who > > > > > > > > feels like testing the impact of this change. > > > > > > > > > > > > > > Presumably a locked atomic op is cheaper than a function > > > > > > > call > > > > > > > then? > > > > > > > The > > > > > > > current setup assumes the opposite. > > > > > > > > > > > > > > I think we should actually do this for atomics in modules > > > > > > > on > > > > > > > x86: > > > > > > > > > > > > > > 1) If a module is built standalone, it should do > > > > > > > whichever is > > > > > > > cheaper: > > > > > > >a function call or always use "LOCK". > > > > > > > > > > > > > > 2) If a module is built as part of the kernel build, it > > > > > > > should use > > > inlined > > > > > > >atomics that match what the kernel does. Thus, > > > > > > >modules > > > > > > >built with > > > a > > > > > > >non-SMP kernel would use inlined atomic ops that do > > > > > > >not > > > > > > >use LOCK. > > > We > > > > > > >have a way to detect this now (some HAVE_FOO #define > > > > > > >added > > > > > > >in the > > > past > > > > > > >few years) that we didn't back when this bit of > > > > > > >atomic.h > > > > > > >was > > > > > > >written. > > > > > > > > > > > > > > > > > > > It would be nice to have the LOCK variants available even > > > > > > on UP > > > > > > kernels in non-hackish way. For VirtIO, we need to handle > > > > > > an > > > > > > guest > > > > > > UP kernel running on an SMP host. Whether this is an > > > > > > #define > > > > > > that > > > > > > forces the SMP atomics to be inlined, or if they're exposed > > > > > > with > > > > > > an _smp suffix. > > > > Could you please, clarify why does UP kernel needs it ? > > > > Shouldn't the hypervisor context switching provide neccessary > > > > serialization > > > > anyway ? > > > > > > I thought this, too, but in the case of virtio you are presumably > > > sychronizing with other threads in the hypervisor itself which > > > might > > > be running concurrently on another physical CPU. > > > > > > > Yes, that is the case to be concerned about. Although, thinking > > about this a bit more, in VirtIO (at least the current spec), all > > the shared fields are updated by either the host or guest, not > > both, so a UP kernel can get by without the LOCK, correct? > > > I did not see the spec, so I cannot argue. On the other hands, the > barriers have nothing to do with shared access to the same memory > location. Barriers prevent seeing paradoxical results from memory > accesses, in particular, they ensure that compiler and hardware do > not reorder the memory ac
Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks
On Tuesday, January 15, 2013 3:49:33 am Lawrence Stewart wrote: > On 01/15/13 07:50, John Baldwin wrote: > > The constants used for TCP and UDP socket options (TCP_NODELAY, etc.) are > > currently defined as hex values that are individual bits. However, socket > > options are never masked together, they are used as a simple enumeration of > > discrete values. Using a bitmask forces us to run out of bits and makes it > > harder for vendors to try to use a high range of values for local custom > > options (hoping that they never conflict with a new option value added in > > stock FreeBSD). > > Yup. Should we be explicitly #defining the boundary between "bits > reserved for FreeBSD" and "bits for private vendor use"? Oh, we could if you wanted. I'm using 0x1000 locally for both TCP and UDP, but those are completely arbitrary values. Saner ones might be 0x800 if we want to do that explicitly. We could perhaps just say that is true for all socket option levels (that is, just define one SO_VENDOR constant or some such but say it applies to all levels)? -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some questions about the new TCP congestion control code
On Tuesday, January 15, 2013 3:29:51 am Lawrence Stewart wrote: > Hi John, > > On 01/15/13 08:04, John Baldwin wrote: > > I was looking at TCP congestion control at work recently and noticed a few > > Poor you ;) > > > "odd" things in the current code. First, there is this chunk of code in > > cc_ack_received() in tcp_input.c: > > > > static void inline > > cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type) > > { > > INP_WLOCK_ASSERT(tp->t_inpcb); > > > > tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th); > > if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd)) > > tp->ccv->flags |= CCF_CWND_LIMITED; > > else > > tp->ccv->flags &= ~CCF_CWND_LIMITED; > > > > > > Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not > > integers, so the call to min() results in truncation on 64-bit hosts. > > Good catch, but I don't think it matters in practice as neither snd_cwnd > or snd_wnd will grow past the 32-bit boundary. I have a psyhcotic case using cc_cubic where it seems to grow without bound, though that is a bug in and of itself (and this change did not fix that issue). I ended up not using cc_cubic (more below) and haven't been able to track down the root cause of the delay. I can probably provide a test case to reproduce this if you are interested. > > It should probably be ulmin() instead. However, this line seems to be a > > really > > obfuscated way to just write: > > > > if (tp->snd_cwnd <= tp->snd_wnd) > > You are correct, though I'd argue the meaning of the existing code as > written is clearer compared to your suggested change. > > > If that is correct, I would vote for changing this to use the much simpler > > logic. > > Agreed. While I find the existing code slightly clearer in meaning, it's > not significant enough to warrant keeping it as is when your suggested > change is simpler, fixes a bug and achieves the same thing. Happy for > you to change it or I can do it if you prefer. I'll leave that to you, thanks. > > Secondly, in the particular case I was investigating at work (restart of an > > idle connnection), the newreno congestion control code in 8.x and later > > uses a > > different algorithm than in 7. Specifically, in 7 TCP would reuse the same > > logic used for an initial cwnd (honoring ss_fltsz). In 8 this no longer > > happens (instead, 2 is hardcoded). A guess at a possible fix might look > > something like this: > > > > Index: cc_newreno.c > > === > > --- cc_newreno.c(revision 243660) > > +++ cc_newreno.c(working copy) > > @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv) > > if (V_tcp_do_rfc3390) > > rw = min(4 * CCV(ccv, t_maxseg), > > max(2 * CCV(ccv, t_maxseg), 4380)); > > +#if 1 > > else > > rw = CCV(ccv, t_maxseg) * 2; > > +#else > > + /* XXX: This is missing a lot of stuff that used to be in 7. */ > > +#ifdef INET6 > > + else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) : > > + in_localaddr(CCV(ccv, t_inpcb->inp_faddr > > +#else > > + else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr))) > > +#endif > > + rw = V_ss_fltsz_local * CCV(ccv, t_maxseg); > > + else > > + rw = V_ss_fltsz * CCV(ccv, t_maxseg); > > +#endif > > > > CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd)); > > } > > > > (But using the #else clause instead of the current #if 1 code). Was this > > change in 8 intentional? > > It was. Unlike connection initialisation which still honours ss_fltsz in > cc_conn_init(), restarting an idle connection based on ss_fltsz seemed > particularly dubious and as such was omitted from the refactored code. > > The ultimate goal was to remove the ss_fltsz hack completely and > implement a smarter mechanism, but that hasn't quite happened yet. The > removal of ss_fltsz from 10.x without providing a replacement mechanism > is not ideal and should probably be addressed. > > I'm guessing you're not using rfc3390 because you want to override the > initial window based on specific local knowledge of the path between > sender and receiver? Correct, in 7.x we had cranked ss_fltsz up to a really high number to prevent the congestion window from collapsing when the connection was idle. We have a bit of a unique workload in that we are using TCP to reliably forward a latency-sensitive datagram stream across a WAN connection with high bandwidth and high RTT. Most of congestion control seems tuned to bulk transfers rather than this sort of use case. The solution we have settled on here is to add a new TCP socket option to disable idle handling so that when an idle connection restarts it keeps its prior congestion window. One other thing I noticed which is may or may not be odd during this, is that if you have a connection with TCP_NODELAY enabled and you fill your cwnd and then you get
Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks
On Monday, January 14, 2013 5:17:12 pm Alfred Perlstein wrote: > On 1/14/13 4:56 PM, John Baldwin wrote: > > On Monday, January 14, 2013 4:42:16 pm Alfred Perlstein wrote: > >> Wouldn't a comment over the code suffice? > >> > >> Something like your email as a header would actually work very nicely! > >> > >> I think just using decimal would be more confusing than explicitly > >> calling it out like: > >> > >> /* begin enumerated (not bitmask) socket option specifiers */ > >> #defineTCP_MAXSEG 0x02/* set maximum segment size */ > >> #define TCP_NOPUSH 0x04/* don't push last block of write */ > >> #define TCP_NOOPT 0x08/* don't use TCP options */ > >> #define TCP_MD5SIG 0x10/* use MD5 digests (RFC2385) */ > >> /* end enumerated socket option specifiers */ > > I have a patch I'll post next which will add a new option as '3'. I think > > that > > will make it more obvious and avoid having new options follow the old > > pattern. > > > Any objection to adding the contents of that email as a comment > section? It really would help. We don't generally do this for other enumerations like ioctl values as it is generally obvious to the reader. I think for UDP having one constant called '1' should be obvious enough. TCP might indeed warrany a comment since it has several existing values that are powers of 2. How about this: /* * User-settable options (used with setsockopt). These are discrete * values and should not be masked together. Many values appear to be * bitmasks for legacy reasons. */ -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks
On 14 January 2013 15:50, John Baldwin wrote: > Using a bitmask forces us to run out of bits and makes it > harder for vendors to try to use a high range of values for local custom > options (hoping that they never conflict with a new option value added in > stock FreeBSD). We should explicitly decide and #define the boundary value for custom options. -- Eitan Adler ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
two problems in dev/e1000/if_lem.c::lem_handle_rxtx()
Hi, i found a couple of problems in dev/e1000/if_lem.c::lem_handle_rxtx() , (compare with dev/e1000/if_em.c::em_handle_que() for better understanding): 1. in if_em.c::em_handle_que(), when em_rxeof() exceeds the rx_process_limit, the task is rescheduled so it can complete the work. Conversely, in if_lem.c::lem_handle_rxtx() the lem_rxeof() is only run once, and if there are more pending packets the only chance to drain them is to receive (many) more interrupts. This is a relatively serious problem, because the receiver has a hard time recovering. I'd like to commit a fix to this same as it is done in e1000. 2. in if_em.c::em_handle_que(), interrupts are reenabled unconditionally, whereas lem_handle_rxtx() only enables them if IFF_DRV_RUNNING is set. I cannot really tell what is the correct way here, so I'd like to put a comment there unless there is a good suggestion on what to do. Accesses to the intr register are race-prone anyways (disabled in fastintr, enabled in the rxtx task without holding any lock, and generally accessed under EM_CORE_LOCK in other places), and presumably enabling/disabling the interrupts around activations of the taks is just an optimization (and on a VM, it is actually a pessimization due to the huge cost of VM exits). cheers luigi ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [SOLVED] if_vr(4) and DFE520-TX [working with patched if_rl]
On Tue, Jan 15, 2013 at 01:04:49PM +0400, Ruslan Makhmatkhanov wrote: > YongHyeon PYUN wrote on 15.01.2013 10:51: > > > >Hmm, I don't get it. > >Diff inlined again. > > > >Index: sys/pci/if_rlreg.h > >=== > >--- sys/pci/if_rlreg.h (revision 245199) > >+++ sys/pci/if_rlreg.h (working copy) > >@@ -1048,6 +1048,11 @@ struct rl_softc { > > #defineDLINK_DEVICEID_530TXPLUS0x1300 > > > > /* > >+ * D-Link DFE-520TX rev. C1 device ID > >+ */ > >+#define DLINK_DEVICEID_520TX_REVC1 0x4200 > >+ > >+/* > > * D-Link DFE-5280T device ID > > */ > > #defineDLINK_DEVICEID_528T 0x4300 > >Index: sys/pci/if_rl.c > >=== > >--- sys/pci/if_rl.c (revision 245199) > >+++ sys/pci/if_rl.c (working copy) > >@@ -148,6 +148,8 @@ static const struct rl_type rl_devs[] = { > > "Delta Electronics 8139 10/100BaseTX" }, > > { ADDTRON_VENDORID, ADDTRON_DEVICEID_8139, RL_8139, > > "Addtron Technology 8139 10/100BaseTX" }, > >+{ DLINK_VENDORID, DLINK_DEVICEID_520TX_REVC1, RL_8139, > >+"D-Link DFE-520TX (rev. C1) 10/100BaseTX" }, > > { DLINK_VENDORID, DLINK_DEVICEID_530TXPLUS, RL_8139, > > "D-Link DFE-530TX+ 10/100BaseTX" }, > > { DLINK_VENDORID, DLINK_DEVICEID_690TXD, RL_8139, > >> > > Hooray! It is working with if_rl with your patch (loader tunable isn't > used). Thanks a lot for this! > > Can this be committed and merged to 8/9? > Yes, committed in r245485. I will MFC to stable 9/8 after a week. > rl1@pci0:4:1:0: class=0x02 card=0x11031186 chip=0x42001186 > rev=0x10 hdr=0x00 > vendor = 'D-Link System Inc' > class = network > subclass = ethernet > > > rl1: flags=8843 metric 0 mtu 1500 > options=2008 > ether 90:94:e4:82:d5:e6 > inet 192.168.0.208 netmask 0xff00 broadcast 192.168.0.255 > nd6 options=29 > media: Ethernet autoselect (100baseTX ) > status: active > > Ping and all other working fine. > Thanks for testing! > -- > Regards, > Ruslan > > Tinderboxing kills... the drives. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: two problems in dev/e1000/if_lem.c::lem_handle_rxtx()
OK, will look at this as soon as I can. Jack On Tue, Jan 15, 2013 at 5:23 PM, Luigi Rizzo wrote: > Hi, > i found a couple of problems in > dev/e1000/if_lem.c::lem_handle_rxtx() , > (compare with dev/e1000/if_em.c::em_handle_que() for better understanding): > > 1. in if_em.c::em_handle_que(), when em_rxeof() exceeds the > rx_process_limit, the task is rescheduled so it can complete the work. > Conversely, in if_lem.c::lem_handle_rxtx() the lem_rxeof() is > only run once, and if there are more pending packets the only > chance to drain them is to receive (many) more interrupts. > > This is a relatively serious problem, because the receiver has > a hard time recovering. > > I'd like to commit a fix to this same as it is done in e1000. > > 2. in if_em.c::em_handle_que(), interrupts are reenabled unconditionally, >whereas lem_handle_rxtx() only enables them if IFF_DRV_RUNNING is set. > >I cannot really tell what is the correct way here, so I'd like >to put a comment there unless there is a good suggestion on >what to do. > >Accesses to the intr register are race-prone anyways >(disabled in fastintr, enabled in the rxtx task without >holding any lock, and generally accessed under EM_CORE_LOCK >in other places), and presumably enabling/disabling the >interrupts around activations of the taks is just an >optimization (and on a VM, it is actually a pessimization >due to the huge cost of VM exits). > > cheers > luigi > ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: static kernel with mod_cc?
On 01/16/13 02:12, Eggert, Lars wrote: > Hi, > > On Jan 15, 2013, at 14:09, Lawrence Stewart wrote: >> You're not dense - the build glue to allow an algorithm to be specified >> in a kernel config file doesn't exist. > > ah, that explains it. I guess it doesn't exist for siftr either? Correct, though I believe it should be trivial to fix for both siftr and the mod_cc modules. I might have a go and post a patch for testing. Cheers, Lawrence ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"