Re: TSO and FreeBSD vs Linux

2013-08-14 Thread Lawrence Stewart
On 08/14/13 16:33, Julian Elischer wrote:
> On 8/14/13 11:39 AM, Lawrence Stewart wrote:
>> On 08/14/13 03:29, Julian Elischer wrote:
>>> I have been tracking down a performance embarrassment on AMAZON EC2 and
>>> have found it I think.
>> Let us please avoid conflating performance with throughput. The
>> behaviour you go on to describe as a performance embarrassment is
>> actually a throughput difference, and the FreeBSD behaviour you're
>> describing is essentially sacrificing throughput and CPU cycles for
>> lower latency. That may not be a trade-off you like, but it is an
>> important factor in this discussion.
> it was an embarrassment in that in one class of test we performed very
> poorly.
> It was not a disaster or a show-stopper, but for our product it is a
> critical number.

Sure, there's nothing wrong with holding throughput up as a key
performance metric for your use case.

I'm just trying to pre-empt a discussion that focuses on one metric and
fails to consider the bigger picture.

> It is a throughput difference, as you say but that is a very important
> part of performance...
> The latency of linux didn't seem to be any worse
> than FreeBSD, just the throughput was a lot higher in the same scenario.

The latency must increase when you delay packets in order to coalesce
them. Whether you were able to perceive that or not with the bulk send
type testing and measurement that you're doing is a separate issue.

>> Don't fall into the trap of labelling Linux's propensity for maximising
>> throughput as superior to an alternative approach which strikes a
>> different balance. It all depends on the use case.
> well the linux balance seems t be "be better all around" at this moment
> so that is
> embarrassing. :-) 

Better all round for you? Seems to be the case. Better all round for
everyone? No. Linux's choice of CUBIC as the default congestion control
algorithm is also a choice that maximises throughput but has side
effects which IMO are quite unfortunate.

> I could see no latency reversion.

You wouldn't because it would be practically invisible in the sorts of
tests/measurements you're doing. Our good friends over at HRT on the
other hand would be far more likely to care about latency on the order
of microseconds. Again, the use case matters a lot.

[snip]
>>> Notice that this behaviour in Linux seems to be modal.. it seems to
>>> 'switch on' a little bit
>>> into the 'starting' trace.
>>>
>>> In addition, you can see also that Linux gets going faster even in the
>>> beginning where
>>> TSO isn't in play, by sending a lot more packets up-front. (of course
>>> the wisdom of this
>>> can be argued).
>> They switched to using an initial window of 10 segments some time ago.
>> FreeBSD starts with 3 or more recently, 10 if you're running recent
>> 9-STABLE or 10-CURRENT.
> I tried setting initial values as shown:
>   net.inet.tcp.local_slowstart_flightsize: 10
>   net.inet.tcp.slowstart_flightsize: 10
> it didn't seem to make too much difference but I will redo the test.

Assuming this is still FreeBSD 8.0 as you mentioned out-of-band,
changing those variables without disabling rfc3390 will have no effect.

>>> Has anyone done any work on aggregating ACKs, or delaying responding to
>>> them?
>> As noted by Navdeep, we already have the code to aggregate ACKs in our
>> software LRO implementation. The bigger problem is that appropriate byte
>> counting places a default 2*MSS limit on the amount of ACKed data the
>> window can grow by i.e. if an ACK for 64k of data comes up the stack,
>> we'll grow the window by 2 segments worth of data in response. That
>> needs to be addressed - we could send the ACK count up with the
>> aggregated single ACK or just ignore abc_l_var when LRO is in use for a
>> connection.

BTW, as a work around for the appropriate byte counting issue, you can
crank net.inet.tcp.abc_l_var up to the number of MSS segments you wish
to allow it to increase the window by (per ack). If LRO is aggregating
64 on-wire acks into 1 mega ack, you should set abc_l_var=64.

> so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to
> see this?

I think (check the driver code in question as I'm not sure) that if you
"ifconfig  lro" and the driver has hardware support or has been made
aware of our software implementation, it should DTRT.

Cheers,
Lawrence
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Fwd: RFC 6980 on Security Implications of IPv6 Fragmentation with IPv6 Neighbor Discovery

2013-08-14 Thread Fernando Gont
Folks,

FYI. -- this is an important piece when it comes to First Hop (i.e.,
"local link") Security.

Cheers,
Fernando




 Original Message 
Subject: RFC 6980 on Security Implications of IPv6 Fragmentation with
IPv6 Neighbor Discovery
Date: Tue, 13 Aug 2013 15:13:21 -0700 (PDT)
From: rfc-edi...@rfc-editor.org
To: ietf-annou...@ietf.org, rfc-d...@rfc-editor.org
CC: drafts-update-...@iana.org, i...@ietf.org, rfc-edi...@rfc-editor.org

A new Request for Comments is now available in online RFC libraries.


RFC 6980

Title:  Security Implications of IPv6 Fragmentation
with IPv6 Neighbor Discovery
Author: F. Gont
Status: Standards Track
Stream: IETF
Date:   August 2013
Mailbox:fg...@si6networks.com
Pages:  10
Characters: 20850
Updates:RFC 3971, RFC 4861

I-D Tag:draft-ietf-6man-nd-extension-headers-05.txt

URL:http://www.rfc-editor.org/rfc/rfc6980.txt

This document analyzes the security implications of employing IPv6
fragmentation with Neighbor Discovery (ND) messages.  It updates RFC
4861 such that use of the IPv6 Fragmentation Header is forbidden in
all Neighbor Discovery messages, thus allowing for simple and
effective countermeasures for Neighbor Discovery attacks.  Finally,
it discusses the security implications of using IPv6 fragmentation
with SEcure Neighbor Discovery (SEND) and formally updates RFC 3971
to provide advice regarding how the aforementioned security
implications can be mitigated.

This document is a product of the IPv6 Maintenance Working Group of the
IETF.

This is now a Proposed Standard.

STANDARDS TRACK: This document specifies an Internet standards track
protocol for the Internet community,and requests discussion and suggestions
for improvements.  Please refer to the current edition of the Internet
Official Protocol Standards (STD 1) for the standardization state and
status of this protocol.  Distribution of this memo is unlimited.

This announcement is sent to the IETF-Announce and rfc-dist lists.
To subscribe or unsubscribe, see
  http://www.ietf.org/mailman/listinfo/ietf-announce
  http://mailman.rfc-editor.org/mailman/listinfo/rfc-dist

For searching the RFC series, see
http://www.rfc-editor.org/search/rfc_search.php
For downloading RFCs, see http://www.rfc-editor.org/rfc.html

Requests for special distribution should be addressed to either the
author of the RFC in question, or to rfc-edi...@rfc-editor.org.  Unless
specifically noted otherwise on the RFC itself, all RFCs are for
unlimited distribution.


The RFC Editor Team
Association Management Solutions, LLC

IETF IPv6 working group mailing list
i...@ietf.org
Administrative Requests: https://www.ietf.org/mailman/listinfo/ipv6



-- 
Fernando Gont
e-mail: ferna...@gont.com.ar || fg...@si6networks.com
PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1






-- 
Fernando Gont
e-mail: ferna...@gont.com.ar || fg...@si6networks.com
PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: TCP Initial Window 10 MFC (was: Re: svn commit: r252789 - stable/9/sys/netinet)

2013-08-14 Thread Eggert, Lars
Hi,

On Aug 14, 2013, at 10:36, Lawrence Stewart  wrote:
> I don't think this change should have been MFCed, at least not in its
> current form.

FYI, Google's own data as presented in the HTTPBIS working group of the recent 
Berlin IETF shows that 10 is too high for ~25% of their web connections: see 
slide 2 of http://www.ietf.org/proceedings/87/slides/slides-87-httpbis-5.pdf

(That slide shows a CDF of CWND values the server used at the end of a web 
transaction.)

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: TCP Initial Window 10 MFC

2013-08-14 Thread Lawrence Stewart
Hi Lars,

On 08/14/13 18:46, Eggert, Lars wrote:
> Hi,
> 
> On Aug 14, 2013, at 10:36, Lawrence Stewart  wrote:
>> I don't think this change should have been MFCed, at least not in its
>> current form.
> 
> FYI, Google's own data as presented in the HTTPBIS working group of the 
> recent Berlin IETF shows that 10 is too high for ~25% of their web 
> connections: see slide 2 of 
> http://www.ietf.org/proceedings/87/slides/slides-87-httpbis-5.pdf
> 
> (That slide shows a CDF of CWND values the server used at the end of a web 
> transaction.)

Thanks for the pointer - very interesting. Do you recall if they said
how many flows made up the CDF?

Cheers,
Lawrence
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


telnet authentication using RADIUS

2013-08-14 Thread takCoder
hi all,

I need to apply radius authentication for my remote connections. For ssh, I
have no problems, as I use pam.d/sshd file to add pam_radius.so entry..

but for telnet I've faced a problem.. as I have seen, for non-SRA telnet
connections, telnet authentication will be done via pam.d/login rather than
pam.d/telnetd.. and this depends on telnet client as well rather than just
my server..

I need it to always apply pam.d/telnetd file for all telnet
authentications, so i can separate my remote authentication policies from
local ones..

am I right with the facts I said above about telnet?
Do you know of any tip or trick on this?? any ideas are really appreciated..
Thank you :)

Best Regards,
t.a.k
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)

2013-08-14 Thread Luigi Rizzo
On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote:
> On 08/14/13 16:33, Julian Elischer wrote:
> > On 8/14/13 11:39 AM, Lawrence Stewart wrote:
> >> On 08/14/13 03:29, Julian Elischer wrote:
> >>> I have been tracking down a performance embarrassment on AMAZON EC2 and
> >>> have found it I think.
> >> Let us please avoid conflating performance with throughput. The
> >> behaviour you go on to describe as a performance embarrassment is
> >> actually a throughput difference, and the FreeBSD behaviour you're
> >> describing is essentially sacrificing throughput and CPU cycles for
> >> lower latency. That may not be a trade-off you like, but it is an
> >> important factor in this discussion.
...
> Sure, there's nothing wrong with holding throughput up as a key
> performance metric for your use case.
> 
> I'm just trying to pre-empt a discussion that focuses on one metric and
> fails to consider the bigger picture.
...
> > I could see no latency reversion.
> 
> You wouldn't because it would be practically invisible in the sorts of
> tests/measurements you're doing. Our good friends over at HRT on the
> other hand would be far more likely to care about latency on the order
> of microseconds. Again, the use case matters a lot.
...
> > so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to
> > see this?
> 
> I think (check the driver code in question as I'm not sure) that if you
> "ifconfig  lro" and the driver has hardware support or has been made
> aware of our software implementation, it should DTRT.

The "lower throughput than linux" that julian was seeing is either
because of a slow (CPU-bound) sender or slow receiver. Given that
the FreeBSD tx path is quite expensive (redoing route and arp lookups
on every packet, etc.) I highly suspect the sender side is at fault.

Ack coalescing, LRO, GRO are limited to the set of packets that you
receive in the same batch, which in turn is upper bounded by the
interrupt moderation delay. Apart from simple benchmarks with only
a few flows, it is very hard that ack/lro/gro can coalesce more
than a few segments for the same flow.

But the real fix is in tcp_output.

In fact, it has never been the case that an ack (single or coalesced)
triggers an immediate transmission in the output path.  We had this
in the past (Silly Window Syndrome) and there is code that avoids
sending less than 1-mtu under appropriate conditions (there is more
data to push out anyways, no NODELAY, there are outstanding acks,
the window can open further).  In all these cases there is no
reasonable way to experience the difference in terms of latency.

If one really cares, e.g. the High Speed Trading example, this is
a non issue because any reasonable person would run with TCP_NODELAY
(and possibly disable interrupt moderation), and optimize for latency
even on a per flow basis.

In terms of coding effort, i suspect that by replacing the 1-mtu
limit (t_maxseg i believe is the variable that we use in the SWS
avoidance code) with 1-max-tso-segment we can probably achieve good
results with little programming effort.

Then the problem remains that we should keep a copy of route and
arp information in the socket instead of redoing the lookups on
every single transmission, as they consume some 25% of the time of
a sendto(), and probably even more when it comes to large tcp
segments, sendfile() and the like.

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)

2013-08-14 Thread Lev Serebryakov
Hello, Luigi.
You wrote 14 августа 2013 г., 14:21:09:

LR> Then the problem remains that we should keep a copy of route and
LR> arp information in the socket instead of redoing the lookups on
LR> every single transmission, as they consume some 25% of the time of
LR> a sendto(), and probably even more when it comes to large tcp
LR> segments, sendfile() and the like.
  And we should invalidate this info on ARP/route changes, or connection
 will be lost in such cases, am I right?.. So, on each such event code
 should look into all sockets and check, if routing/ARP information is still
 valid for them. Or we should store lists of sockets in routing and ARP
 tables... I don't know, what is worse.


-- 
// Black Lion AKA Lev Serebryakov 

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

route/arp lifetime (Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux))

2013-08-14 Thread Luigi Rizzo
On Wed, Aug 14, 2013 at 03:47:13PM +0400, Lev Serebryakov wrote:
> Hello, Luigi.
> You wrote 14 ?? 2013 ??., 14:21:09:
> 
> LR> Then the problem remains that we should keep a copy of route and
> LR> arp information in the socket instead of redoing the lookups on
> LR> every single transmission, as they consume some 25% of the time of
> LR> a sendto(), and probably even more when it comes to large tcp
> LR> segments, sendfile() and the like.

>   And we should invalidate this info on ARP/route changes, or connection
>  will be lost in such cases, am I right?.. So, on each such event code
>  should look into all sockets and check, if routing/ARP information is still
>  valid for them. Or we should store lists of sockets in routing and ARP
>  tables... I don't know, what is worse.

I think we should start by acknowledging that routing and ARP
information is inherently stale, and changes unfrequently.
So it is not a disaster if we have incorrect information for some
short amount of time (milliseconds) because in the end the remote
party that decides to change it and inform us may take much longer
than that to distribute the update.


Considering that each lookup takes between 100..300ns if you are
lucky (not many misses, relatively empty table etc.), one could
reasonably do the lookup at most once per millisecond or so (just
reading 'ticks', no need for a nanotime() if you have a slow clock),
or whenever we get an error related to the socket, either in the
forward path (e.g. ifp points to an interface that is down) or in
the reverse path (e.g. a dupack because we sent a packet to the
wrong place).

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


ng_ipacct locking rework

2013-08-14 Thread Vsevolod Stakhov

Hello,

I've reworked the locking model of the ng_ipacct module 
(ports/net-mgmt/ng_ipacct) for better parallel access support. I did the 
following:


- convert locking from a global mutex to hash bucket level locks;
- convert a mutex to rmlock (as ip accounting data is mostly read from 
the hash from my observations).


I appreciate if somebody could review/test this patch and thus I can 
commit it to the port afterwards.


The patches themselves are here:
http://highsecure.ru/patch-ng_ipacct.c
http://highsecure.ru/patch-ng_ipacct_hash.h

Or for comfortable viewing are mirrored on gist:
https://gist.github.com/vstakhov/6223170

--
Vsevolod Stakhov
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: route/arp lifetime (Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux))

2013-08-14 Thread Alexander V. Chernikov

On 14.08.2013 16:05, Luigi Rizzo wrote:

On Wed, Aug 14, 2013 at 03:47:13PM +0400, Lev Serebryakov wrote:

Hello, Luigi.
You wrote 14 ?? 2013 ??., 14:21:09:

LR> Then the problem remains that we should keep a copy of route and
LR> arp information in the socket instead of redoing the lookups on
LR> every single transmission, as they consume some 25% of the time of
LR> a sendto(), and probably even more when it comes to large tcp
LR> segments, sendfile() and the like.
   And we should invalidate this info on ARP/route changes, or connection
  will be lost in such cases, am I right?.. So, on each such event code
  should look into all sockets and check, if routing/ARP information is still
  valid for them. Or we should store lists of sockets in routing and ARP
  tables... I don't know, what is worse.

I think we should start by acknowledging that routing and ARP
information is inherently stale, and changes unfrequently.
So it is not a disaster if we have incorrect information for some
short amount of time (milliseconds) because in the end the remote
party that decides to change it and inform us may take much longer
than that to distribute the update.

You can save rte&arp, however doing this
gives you perfect chance to crash your kernel if egress interface is 
destroyed (like vlan or ng or tun).



Considering that each lookup takes between 100..300ns if you are
lucky (not many misses, relatively empty table etc.), one could
reasonably do the lookup at most once per millisecond or so (just
reading 'ticks', no need for a nanotime() if you have a slow clock),
or whenever we get an error related to the socket, either in the
forward path (e.g. ifp points to an interface that is down) or in
the reverse path (e.g. a dupack because we sent a packet to the
wrong place).
This sounds like "Hey, the kernel lookup is slow (which is true), let's 
make a hack and don't bother lookups".
This approach gives us mtx-locked rte refcounts which are used (misused) 
in many places making things worse and decreasing the ability to fix the 
things up..


cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: route/arp lifetime (Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux))

2013-08-14 Thread Luigi Rizzo
On Wed, Aug 14, 2013 at 04:15:25PM +0400, Alexander V. Chernikov wrote:
> On 14.08.2013 16:05, Luigi Rizzo wrote:
> > On Wed, Aug 14, 2013 at 03:47:13PM +0400, Lev Serebryakov wrote:
> >> Hello, Luigi.
> >> You wrote 14 ?? 2013 ??., 14:21:09:
> >>
> >> LR> Then the problem remains that we should keep a copy of route and
> >> LR> arp information in the socket instead of redoing the lookups on
> >> LR> every single transmission, as they consume some 25% of the time of
> >> LR> a sendto(), and probably even more when it comes to large tcp
> >> LR> segments, sendfile() and the like.
> >>And we should invalidate this info on ARP/route changes, or connection
> >>   will be lost in such cases, am I right?.. So, on each such event code
> >>   should look into all sockets and check, if routing/ARP information is 
> >> still
> >>   valid for them. Or we should store lists of sockets in routing and ARP
> >>   tables... I don't know, what is worse.
> > I think we should start by acknowledging that routing and ARP
> > information is inherently stale, and changes unfrequently.
> > So it is not a disaster if we have incorrect information for some
> > short amount of time (milliseconds) because in the end the remote
> > party that decides to change it and inform us may take much longer
> > than that to distribute the update.
> You can save rte&arp, however doing this
> gives you perfect chance to crash your kernel if egress interface is 
> destroyed (like vlan or ng or tun).

I hope I learned not to follow a stale ifp pointer :)
anyways ARP is really just the mac address so there is no
dandling pointer issue.

For the ifp associated to the route,
i do not see a huge problem in marking the route/ifp as
zombie and destroy it when the last reference goes away.

Not that the current way is any better -- you need to lock/unlock
the rte while you do the lookup, and hold a refcount to the ifp
until the packet is queued. So how does my suggestion make
things worse ?

cheers
luigi


> >
> >
> > Considering that each lookup takes between 100..300ns if you are
> > lucky (not many misses, relatively empty table etc.), one could
> > reasonably do the lookup at most once per millisecond or so (just
> > reading 'ticks', no need for a nanotime() if you have a slow clock),
> > or whenever we get an error related to the socket, either in the
> > forward path (e.g. ifp points to an interface that is down) or in
> > the reverse path (e.g. a dupack because we sent a packet to the
> > wrong place).
> This sounds like "Hey, the kernel lookup is slow (which is true), let's 
> make a hack and don't bother lookups".
> This approach gives us mtx-locked rte refcounts which are used (misused) 
> in many places making things worse and decreasing the ability to fix the 
> things up..
> >
> > cheers
> > luigi
> > ___
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> >
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: route/arp lifetime (Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux))

2013-08-14 Thread Alexander V. Chernikov

On 14.08.2013 16:40, Luigi Rizzo wrote:

On Wed, Aug 14, 2013 at 04:15:25PM +0400, Alexander V. Chernikov wrote:

On 14.08.2013 16:05, Luigi Rizzo wrote:

On Wed, Aug 14, 2013 at 03:47:13PM +0400, Lev Serebryakov wrote:

Hello, Luigi.
You wrote 14 ?? 2013 ??., 14:21:09:

LR> Then the problem remains that we should keep a copy of route and
LR> arp information in the socket instead of redoing the lookups on
LR> every single transmission, as they consume some 25% of the time of
LR> a sendto(), and probably even more when it comes to large tcp
LR> segments, sendfile() and the like.
And we should invalidate this info on ARP/route changes, or connection
   will be lost in such cases, am I right?.. So, on each such event code
   should look into all sockets and check, if routing/ARP information is still
   valid for them. Or we should store lists of sockets in routing and ARP
   tables... I don't know, what is worse.

I think we should start by acknowledging that routing and ARP
information is inherently stale, and changes unfrequently.
So it is not a disaster if we have incorrect information for some
short amount of time (milliseconds) because in the end the remote
party that decides to change it and inform us may take much longer
than that to distribute the update.

You can save rte&arp, however doing this
gives you perfect chance to crash your kernel if egress interface is
destroyed (like vlan or ng or tun).

I hope I learned not to follow a stale ifp pointer :)
Well, currently we have no locks (or other means)  to ensure all other 
cores has "current" pointer to ifp or its fields (or am I wrong?)

anyways ARP is really just the mac address so there is no
dandling pointer issue.

For the ifp associated to the route,
i do not see a huge problem in marking the route/ifp as
zombie and destroy it when the last reference goes away.
Yes, but references requires some synchronization primitives. One 
possible solution is using pcpu counters, but it does not play well on 
!amd64.


Not that the current way is any better -- you need to lock/unlock
the rte while you do the lookup, and hold a refcount to the ifp
until the packet is queued. So how does my suggestion make
things worse ?

cheers
luigi




Considering that each lookup takes between 100..300ns if you are
lucky (not many misses, relatively empty table etc.), one could
reasonably do the lookup at most once per millisecond or so (just
reading 'ticks', no need for a nanotime() if you have a slow clock),
or whenever we get an error related to the socket, either in the
forward path (e.g. ifp points to an interface that is down) or in
the reverse path (e.g. a dupack because we sent a packet to the
wrong place).

This sounds like "Hey, the kernel lookup is slow (which is true), let's
make a hack and don't bother lookups".
This approach gives us mtx-locked rte refcounts which are used (misused)
in many places making things worse and decreasing the ability to fix the
things up..

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Updating route MTU when interface changes

2013-08-14 Thread Joe Holden

Hi guys,

I noticed this some years ago but I just checked and its still there - 
when the mtu of an interface is changed, any routes (eg connected route) 
using that interface aren't updated until the interface is shut/unshut - 
is this by design or is it an oversight?  Having to down/up remote 
machine interfaces that are potentially forwarding traffic just to 
change the mtu seems silly.


Ta,
Joe
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: route/arp lifetime (Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux))

2013-08-14 Thread Luigi Rizzo
On Wed, Aug 14, 2013 at 05:01:05PM +0400, Alexander V. Chernikov wrote:
> On 14.08.2013 16:40, Luigi Rizzo wrote:
...
> >> You can save rte&arp, however doing this
> >> gives you perfect chance to crash your kernel if egress interface is
> >> destroyed (like vlan or ng or tun).
> > I hope I learned not to follow a stale ifp pointer :)
> Well, currently we have no locks (or other means)  to ensure all other 
> cores has "current" pointer to ifp or its fields (or am I wrong?)

This i don't know -- but in case, we should fix the race anyways
(another timescale, but still dangerous).

> > anyways ARP is really just the mac address so there is no
> > dandling pointer issue.
> >
> > For the ifp associated to the route,
> > i do not see a huge problem in marking the route/ifp as
> > zombie and destroy it when the last reference goes away.
> Yes, but references requires some synchronization primitives. One 

Again, we should protect against ifp destruction anyways.  Surely
we should try and make the protection mechanism cheap (in my proposal,
going through the refcount once per millisecond instead of every
single packet; there might be better ways, and i am all ears on
that); surely, we cannot dismiss something because "we run without
seatbelts now so anything else is more expensive".

We had a related discussion regarding races in interfaces between
the datapath (if_transmit() and *_rxeof() ) and the control path
(ioctls, watchdog etc.).

The reason I am raising this issue is because i want to fix the
races that emerged when we moved to SMP, not because I want to "make
hacks" and cut corners in unsafe ways.

cheers
luigi

> possible solution is using pcpu counters, but it does not play well on 
> !amd64.
> >
> > Not that the current way is any better -- you need to lock/unlock
> > the rte while you do the lookup, and hold a refcount to the ifp
> > until the packet is queued. So how does my suggestion make
> > things worse ?
> >
> > cheers
> > luigi
> >
> >
> >>>
> >>> Considering that each lookup takes between 100..300ns if you are
> >>> lucky (not many misses, relatively empty table etc.), one could
> >>> reasonably do the lookup at most once per millisecond or so (just
> >>> reading 'ticks', no need for a nanotime() if you have a slow clock),
> >>> or whenever we get an error related to the socket, either in the
> >>> forward path (e.g. ifp points to an interface that is down) or in
> >>> the reverse path (e.g. a dupack because we sent a packet to the
> >>> wrong place).
> >> This sounds like "Hey, the kernel lookup is slow (which is true), let's
> >> make a hack and don't bother lookups".
> >> This approach gives us mtx-locked rte refcounts which are used (misused)
> >> in many places making things worse and decreasing the ability to fix the
> >> things up..
> >>> cheers
> >>> luigi
> >>> ___
> >>> freebsd-net@freebsd.org mailing list
> >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> >>> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> >>>
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: route/arp lifetime (Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux))

2013-08-14 Thread Marko Zec
On Wednesday 14 August 2013 14:40:24 Luigi Rizzo wrote:
> On Wed, Aug 14, 2013 at 04:15:25PM +0400, Alexander V. Chernikov wrote:
> > On 14.08.2013 16:05, Luigi Rizzo wrote:
> > > On Wed, Aug 14, 2013 at 03:47:13PM +0400, Lev Serebryakov wrote:
> > >> Hello, Luigi.
> > >> You wrote 14 ?? 2013 ??., 14:21:09:
> > >>
> > >> LR> Then the problem remains that we should keep a copy of route and
> > >> LR> arp information in the socket instead of redoing the lookups on
> > >> LR> every single transmission, as they consume some 25% of the time
> > >> of LR> a sendto(), and probably even more when it comes to large tcp
> > >> LR> segments, sendfile() and the like.
> > >>And we should invalidate this info on ARP/route changes, or
> > >> connection will be lost in such cases, am I right?.. So, on each
> > >> such event code should look into all sockets and check, if
> > >> routing/ARP information is still valid for them. Or we should store
> > >> lists of sockets in routing and ARP tables... I don't know, what is
> > >> worse.
> > >
> > > I think we should start by acknowledging that routing and ARP
> > > information is inherently stale, and changes unfrequently.
> > > So it is not a disaster if we have incorrect information for some
> > > short amount of time (milliseconds) because in the end the remote
> > > party that decides to change it and inform us may take much longer
> > > than that to distribute the update.
> >
> > You can save rte&arp, however doing this
> > gives you perfect chance to crash your kernel if egress interface is
> > destroyed (like vlan or ng or tun).
>
> I hope I learned not to follow a stale ifp pointer :)
> anyways ARP is really just the mac address so there is no
> dandling pointer issue.
>
> For the ifp associated to the route,
> i do not see a huge problem in marking the route/ifp as
> zombie and destroy it when the last reference goes away.

FWIW, apparently we already have that infrastrucure in place - if_rele() 
calls if_free_internal() only when the last reference to the ifnet is 
dropped, so with little care this should be usable for caching ifp pointers 
w/o fears for kernel crashes mentioned above.

Marko

> Not that the current way is any better -- you need to lock/unlock
> the rte while you do the lookup, and hold a refcount to the ifp
> until the packet is queued. So how does my suggestion make
> things worse ?
>
> cheers
> luigi
>
> > > Considering that each lookup takes between 100..300ns if you are
> > > lucky (not many misses, relatively empty table etc.), one could
> > > reasonably do the lookup at most once per millisecond or so (just
> > > reading 'ticks', no need for a nanotime() if you have a slow clock),
> > > or whenever we get an error related to the socket, either in the
> > > forward path (e.g. ifp points to an interface that is down) or in
> > > the reverse path (e.g. a dupack because we sent a packet to the
> > > wrong place).
> >
> > This sounds like "Hey, the kernel lookup is slow (which is true), let's
> > make a hack and don't bother lookups".
> > This approach gives us mtx-locked rte refcounts which are used
> > (misused) in many places making things worse and decreasing the ability
> > to fix the things up..
> >
> > > cheers
> > > luigi
> > > ___
> > > freebsd-net@freebsd.org mailing list
> > > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > > To unsubscribe, send any mail to
> > > "freebsd-net-unsubscr...@freebsd.org"
>
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: route/arp lifetime (Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux))

2013-08-14 Thread Luigi Rizzo
On Wed, Aug 14, 2013 at 05:40:28PM +0200, Marko Zec wrote:
> On Wednesday 14 August 2013 14:40:24 Luigi Rizzo wrote:
> > On Wed, Aug 14, 2013 at 04:15:25PM +0400, Alexander V. Chernikov wrote:
...
> FWIW, apparently we already have that infrastrucure in place - if_rele() 
> calls if_free_internal() only when the last reference to the ifnet is 
> dropped, so with little care this should be usable for caching ifp pointers 
> w/o fears for kernel crashes mentioned above.

maybe Alexander was referring to holding references to the rte entries
returned as a result of the lookup. The rte holds a reference to the ifp.

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: TSO and FreeBSD vs Linux

2013-08-14 Thread Julian Elischer

On 8/14/13 2:33 PM, Julian Elischer wrote:
On 8/14/13 11:39 AM, Lawrence Stewart wrote: 



There's a thing controlled by ethtool called GRO (generic receive
offload) which appears to be enabled by default on at least Ubuntu 
and I
guess other Linux's too. It's responsible for aggregating ACKs and 
data

to batch them up the stack if the driver doesn't provide a hardware
offload implementation. Try rerunning your experiments with the ACK
batching disabled on the Linux host to get an additional comparison 
point.

I will try that as soon as I get back to the machines in question.


turning on and off GRO seems to make no difference, either at the 
overall throughput level or at the

low level packet-by-packet level (according to tcptrace).


for two examples look at:


http://www.freebsd.org/~julian/LvsF-tcp-start.tiff
and
http://www.freebsd.org/~julian/LvsF-tcp.tiff

in each case, we can see FreeBSD on the left and Linux on the right.

The first case shows the case as the sessions start, and the 
second case

shows
some distance later (when the sequence numbers wrap around.. no 
particular

reason to use that, it was just fun to see).
In both cases you can see that each Linux packet (white)(once they 
have got

going) is responding to multiple bumps in the send window sequence
number (green and yellow lines) (representing the arrival of 
several ACKs)

while FreeBSD produces a whole bunch of smaller packets, slavishly
following
exactly the size of each incoming ack.. This gives us quite  a
performance debt.

Again, please s/performance/what-you-really-mean/ here.
ok, In my tests this makes FreeBSD data transfers much slower, by as 
much as 60%.



Notice that this behaviour in Linux seems to be modal.. it seems to
'switch on' a little bit
into the 'starting' trace.

In addition, you can see also that Linux gets going faster even in 
the

beginning where
TSO isn't in play, by sending a lot more packets up-front. (of course
the wisdom of this
can be argued).

They switched to using an initial window of 10 segments some time ago.
FreeBSD starts with 3 or more recently, 10 if you're running recent
9-STABLE or 10-CURRENT.

I tried setting initial values as shown:
  net.inet.tcp.local_slowstart_flightsize: 10
  net.inet.tcp.slowstart_flightsize: 10
it didn't seem to make too much difference but I will redo the test.



Has anyone done any work on aggregating ACKs, or delaying 
responding to

them?

As noted by Navdeep, we already have the code to aggregate ACKs in our
software LRO implementation. The bigger problem is that appropriate 
byte

counting places a default 2*MSS limit on the amount of ACKed data the
window can grow by i.e. if an ACK for 64k of data comes up the stack,
we'll grow the window by 2 segments worth of data in response. That
needs to be addressed - we could send the ACK count up with the
aggregated single ACK or just ignore abc_l_var when LRO is in use 
for a

connection.
so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF 
to see this?





Cheers,
Lawrence




___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: TCP Initial Window 10 MFC

2013-08-14 Thread Andre Oppermann

On 14.08.2013 04:36, Lawrence Stewart wrote:

Hi Andre,

[RE team is BCCed so they're aware of this discussion]

On 07/06/13 00:58, Andre Oppermann wrote:

Author: andre
Date: Fri Jul  5 14:58:24 2013
New Revision: 252789
URL: http://svnweb.freebsd.org/changeset/base/252789

Log:
   MFC r242266:

Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.

As long as it remains a draft it is placed under a sysctl marking it
as experimental:
 net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.

This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet.  Also the restart
window isn't yet increased as allowed.  Both will be adjusted with
upcoming changes.

Is is enabled by default.  In Linux it is enabled since kernel 3.0.


I haven't been fully alert to FreeBSD happenings this year so apologies
for bringing this up so long after the MFC.

I don't think this change should have been MFCed, at least not in its
current form. Enabling the switch to IW=10 on a stable branch is
inappropriate IMO. I also think the "net.inet.tcp.experimental" sysctl
branch is poorly named as per the important discussion we had back in
February [1]. I would really prefer we didn't get stuck having to keep
it around by making a stable release with it being present.

I think this commit should be backed out of stable/9 and more
importantly, 9.2-RELEASE.


Backing out the patch isn't really necessary, just flip the switch to
off having it revert to the RFC5681 defaults.  Those who want it anyway
can simply enable it again.

IW10 has become RFC6928 (experimental) in April 2013.


As an aside, I am intending to follow up to the Feb discussion with a
patch that implements the basic infrastructure I proposed so that we can
continue that discussion.


Again I'm deeply concerned and opposed to giving end users direct control
over the IW value.  I've had and seen too many cases of totally bogus "tuning"
by cranking up random sysctls to insane values and then complaining about
FreeBSD being slow compared to Linux (and then ditching FreeBSD).

--
Andre


Cheers,
Lawrence

[1] http://lists.freebsd.org/pipermail/freebsd-net/2013-February/034698.html


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)

2013-08-14 Thread Adrian Chadd
On 14 August 2013 04:47, Lev Serebryakov  wrote:


>   And we should invalidate this info on ARP/route changes, or connection
>  will be lost in such cases, am I right?.. So, on each such event code
>  should look into all sockets and check, if routing/ARP information is
> still
>  valid for them. Or we should store lists of sockets in routing and ARP
>  tables... I don't know, what is worse.
>

.. or per-CPU copies of the ARP table.. ?



-adrian
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)

2013-08-14 Thread Peter Wemm
On Wed, Aug 14, 2013 at 11:11 AM, Adrian Chadd  wrote:
> On 14 August 2013 04:47, Lev Serebryakov  wrote:
>
>
>>   And we should invalidate this info on ARP/route changes, or connection
>>  will be lost in such cases, am I right?.. So, on each such event code
>>  should look into all sockets and check, if routing/ARP information is
>> still
>>  valid for them. Or we should store lists of sockets in routing and ARP
>>  tables... I don't know, what is worse.
>>
>
> .. or per-CPU copies of the ARP table.. ?

Local cache at each consumer and check a generation number to see if
it needs to be re-validated before using.  The obvious problem with
this though is that big networks tend to kill your caches.

-- 
Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV
UTF-8: for when a ' just won\342\200\231t do.
 ZFS must be the bacon of file systems. "everything's better with ZFS"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: TSO and FreeBSD vs Linux

2013-08-14 Thread Julian Elischer

On 8/14/13 3:23 PM, Lawrence Stewart wrote:

On 08/14/13 16:33, Julian Elischer wrote:


They switched to using an initial window of 10 segments some time ago.
FreeBSD starts with 3 or more recently, 10 if you're running recent
9-STABLE or 10-CURRENT.

I tried setting initial values as shown:
   net.inet.tcp.local_slowstart_flightsize: 10
   net.inet.tcp.slowstart_flightsize: 10
it didn't seem to make too much difference but I will redo the test.

Assuming this is still FreeBSD 8.0 as you mentioned out-of-band,
changing those variables without disabling rfc3390 will have no effect.

I think (check the driver code in question as I'm not sure) that if you
"ifconfig  lro" and the driver has hardware support or has been made
aware of our software implementation, it should DTRT.


so I ran on 9.2-beta ( a week or two old) and it had similar problems..
only worse.. 9.2 actually sends multiple packets when is doesn't need to..
http://people.freebsd.org/~julian/fbsd9.png



Cheers,
Lawrence




___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: TSO and FreeBSD vs Linux

2013-08-14 Thread Kevin Oberman
On Wed, Aug 14, 2013 at 12:46 PM, Julian Elischer wrote:

> On 8/14/13 3:23 PM, Lawrence Stewart wrote:
>
>> On 08/14/13 16:33, Julian Elischer wrote:
>>
>>  They switched to using an initial window of 10 segments some time ago.
 FreeBSD starts with 3 or more recently, 10 if you're running recent
 9-STABLE or 10-CURRENT.

>>> I tried setting initial values as shown:
>>>net.inet.tcp.local_slowstart_**flightsize: 10
>>>net.inet.tcp.slowstart_**flightsize: 10
>>> it didn't seem to make too much difference but I will redo the test.
>>>
>> Assuming this is still FreeBSD 8.0 as you mentioned out-of-band,
>> changing those variables without disabling rfc3390 will have no effect.
>>
>> I think (check the driver code in question as I'm not sure) that if you
>> "ifconfig  lro" and the driver has hardware support or has been made
>> aware of our software implementation, it should DTRT.
>>
>
> so I ran on 9.2-beta ( a week or two old) and it had similar problems..
> only worse.. 9.2 actually sends multiple packets when is doesn't need to..
> http://people.freebsd.org/~julian/fbsd9.png
>

Ack! (Sorry) I could have sworn that this had been fixed. Has it been
re-broken?
-- 
R. Kevin Oberman, Network Engineer
E-mail: rkober...@gmail.com
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: TCP Initial Window 10 MFC

2013-08-14 Thread Eggert, Lars
Hi,

On Aug 14, 2013, at 17:27, Lawrence Stewart 
 wrote:
> Do you recall if they said
> how many flows made up the CDF?

I think "very many" - check out the audio archive or the minutes of the 
meeting, it should have the details.

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: TCP Initial Window 10 MFC (was: Re: svn commit: r252789 - stable/9/sys/netinet)

2013-08-14 Thread Eggert, Lars
Oh: The other interesting bit is that Chrome defaulted to telling the server to 
use IW32 if it had no cached value...

I think Google are still heavily tweaking the mechanisms.

Lars

On Aug 14, 2013, at 16:46, "Eggert, Lars"  wrote:

> Hi,
> 
> On Aug 14, 2013, at 10:36, Lawrence Stewart  wrote:
>> I don't think this change should have been MFCed, at least not in its
>> current form.
> 
> FYI, Google's own data as presented in the HTTPBIS working group of the 
> recent Berlin IETF shows that 10 is too high for ~25% of their web 
> connections: see slide 2 of 
> http://www.ietf.org/proceedings/87/slides/slides-87-httpbis-5.pdf
> 
> (That slide shows a CDF of CWND values the server used at the end of a web 
> transaction.)
> 
> Lars



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)

2013-08-14 Thread Luigi Rizzo
On Wed, Aug 14, 2013 at 12:40:19PM -0700, Peter Wemm wrote:
> On Wed, Aug 14, 2013 at 11:11 AM, Adrian Chadd  wrote:
> > On 14 August 2013 04:47, Lev Serebryakov  wrote:
> >
> >
> >>   And we should invalidate this info on ARP/route changes, or connection
> >>  will be lost in such cases, am I right?.. So, on each such event code
> >>  should look into all sockets and check, if routing/ARP information is
> >> still
> >>  valid for them. Or we should store lists of sockets in routing and ARP
> >>  tables... I don't know, what is worse.
> >>
> >
> > .. or per-CPU copies of the ARP table.. ?
> 
> Local cache at each consumer and check a generation number to see if
> it needs to be re-validated before using.  The obvious problem with
> this though is that big networks tend to kill your caches.

if you expect this to be problematic you can partition the entries
and use a different generation number per cluster.
Anyways if you really want to be guaranteed you need atomic
reads on the generation numbers (or ticks), which I have heard
are expensive on !i386/amd64 machines.

This is why I would probably try to live with races (which for
arp are a non problem).

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)

2013-08-14 Thread Vijay Singh
Is that what FLOWTABLE does? Also we need a mechanism to record time spent at 
various layers in the stack. Luigi has used his own methods but we're lacking 
something more generic. At work we have some crude tools that use mcount 
information to indirectly measure costs but they are not reliable and only 
provide partial information.

Sent from my iPhone

On Aug 14, 2013, at 11:11 AM, Adrian Chadd  wrote:

> On 14 August 2013 04:47, Lev Serebryakov  wrote:
> 
> 
>>  And we should invalidate this info on ARP/route changes, or connection
>> will be lost in such cases, am I right?.. So, on each such event code
>> should look into all sockets and check, if routing/ARP information is
>> still
>> valid for them. Or we should store lists of sockets in routing and ARP
>> tables... I don't know, what is worse.
> 
> .. or per-CPU copies of the ARP table.. ?
> 
> 
> 
> -adrian
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: TCP Initial Window 10 MFC

2013-08-14 Thread Lawrence Stewart
On 08/15/13 02:44, Andre Oppermann wrote:
> On 14.08.2013 04:36, Lawrence Stewart wrote:
>> Hi Andre,
>>
>> [RE team is BCCed so they're aware of this discussion]
>>
>> On 07/06/13 00:58, Andre Oppermann wrote:
>>> Author: andre
>>> Date: Fri Jul  5 14:58:24 2013
>>> New Revision: 252789
>>> URL: http://svnweb.freebsd.org/changeset/base/252789
>>>
>>> Log:
>>>MFC r242266:
>>>
>>> Increase the initial CWND to 10 segments as defined in IETF TCPM
>>> draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
>>> window improves the overall performance of many web services without
>>> risking congestion collapse.
>>>
>>> As long as it remains a draft it is placed under a sysctl marking it
>>> as experimental:
>>>  net.inet.tcp.experimental.initcwnd10 = 1
>>> When it becomes an official RFC soon the sysctl will be changed to
>>> the RFC number and moved to net.inet.tcp.
>>>
>>> This implementation differs from the RFC draft in that it is a bit
>>> more conservative in the case of packet loss on SYN or SYN|ACK
>>> because
>>> we haven't reduced the default RTO to 1 second yet.  Also the
>>> restart
>>> window isn't yet increased as allowed.  Both will be adjusted with
>>> upcoming changes.
>>>
>>> Is is enabled by default.  In Linux it is enabled since kernel 3.0.
>>
>> I haven't been fully alert to FreeBSD happenings this year so apologies
>> for bringing this up so long after the MFC.
>>
>> I don't think this change should have been MFCed, at least not in its
>> current form. Enabling the switch to IW=10 on a stable branch is
>> inappropriate IMO. I also think the "net.inet.tcp.experimental" sysctl
>> branch is poorly named as per the important discussion we had back in
>> February [1]. I would really prefer we didn't get stuck having to keep
>> it around by making a stable release with it being present.
>>
>> I think this commit should be backed out of stable/9 and more
>> importantly, 9.2-RELEASE.
> 
> Backing out the patch isn't really necessary, just flip the switch to
> off having it revert to the RFC5681 defaults.  Those who want it anyway
> can simply enable it again.

That doesn't address the sysctl tree naming concern or mechanism issue -
please refer back to the Feb discussion; specifically the proposal to
rename the experimental branch to "net.inet.tcp.nonstandard" and add an
"allowed" leaf which takes a list of non-standard behaviours to allow
tweaking in the stack.

Leaving the sysctl branch named "experimental" conveys that the things
which live under the branch are being evaluated in some way for becoming
a default, which is very different to "nonstandard" which conveys that
the user is twiddling things in a way which normally shouldn't be. IW=10
may become a FreeBSD default at some point, but the mechanism for
enabling it should be to specify the initial window as a value in
segments, and as such by allowing any non-standard value (IW=7, IW=50),
I strongly argue in favour for changing the branch name from
"experimental" to "nonstandard".

In order to continue this discussion in the context of what we started
in Feb, I still request that this change be backed out of releng/9.2 so
that 9.2-RELEASE doesn't ship with it. We can continue discussion for
it's future in stable/9 and head after the backout so that 9.2 isn't
held up.

> IW10 has become RFC6928 (experimental) in April 2013.

Great for the draft authors, but irrelevant for this discussion.

>> As an aside, I am intending to follow up to the Feb discussion with a
>> patch that implements the basic infrastructure I proposed so that we can
>> continue that discussion.
> 
> Again I'm deeply concerned and opposed to giving end users direct control
> over the IW value.  I've had and seen too many cases of totally bogus
> "tuning"
> by cranking up random sysctls to insane values and then complaining about
> FreeBSD being slow compared to Linux (and then ditching FreeBSD).

Sorry, but referring to unspecified cases of stupidity resulting in loss
of unquantified numbers of users as a reason against providing a
controlled mechanism to change a default system parameter in a
potentially harmful way is not a rational argument.

Cheers,
Lawrence
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"