On 01.02.2013 22:21, Kevin Day wrote:
We've got a large cluster of HTTP servers, each server handling >10,000req/sec. 
Occasionally, and
during periods of heavy load, we'd get complaints from some users that 
downloads were working but
going EXTREMELY slowly. After a whole lot of debugging, we narrowed it down to 
being only Windows
8 clients experiencing this problem. It turns out that FreeBSD's implementation 
of syncookies is
likely violating RFC1323.

When syncookies kicks in, either because the syncache limit is reached or
net.inet.tcp.syncookies_only is set, some shortcuts are taken with regard to 
TCP connections.
Unlike some other syncookies implementations which (ab)use timestamps to store 
options, the
FreeBSD implementation of syncookies discards TCP options such as window 
scaling. In itself this
isn't a bad thing, but it becomes a bad thing because we then lie and pretend 
that we are
supporting window scaling.

This is not true.  FreeBSD uses bits in the timestamp to encode all
recognized TCP options including window scaling.

According to RFC1323, if you want to use TCP window scaling, the client says so 
on the initial
SYN. If the server is also willing to use scaling, it says so on the SYN/ACK. 
If both parties
included a scaling option on their respective SYN, you assume window scaling is 
working and
proceed to use it. If one or both parties don't have a scaling option, you 
don't scale at all.
The problem here is that with syncookies, we don't save the wscale parameter 
from the client's
SYN, but offer to use window scaling anyway on our SYN/ACK, so the client 
thinks we successfully
negotiated window scaling even though we haven't.

The syncookie window scale is stored in the timestamp.  Of course
this becomes problematic as you describe when timestamps are not
active on an connection.

This is how a normal window scaled connection happens:

client > server: Flags [S], win 65535, options [mss 1460,nop,wscale 
4,nop,nop,sackOK], length 0
(client is connecting, offering a window of 64K, but if scaling is negotiated 
wants to scale
future window sizes by 4 bits)

server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale 
5,sackOK,eol], length 0
(server is ACKing the client's SYN, also offering an unscaled window of 64K, 
but wanting to shift
by 5 going forward)

The server and client both offered window scaling, so they're now using it from 
this point on.
All window sizes sent/received are shifted by the appropriate number of bits.

No timestamps.

When syncookies kicks in on the server, and the client is anything BUT Windows 
8, this happens:

client > server: Flags [S], win 65535, options [mss 1460,nop,wscale 
4,nop,nop,sackOK], length 0
However, syncookies cause the options to get lost. The client sent the "wscale 
4" parameter, but
we immediately forgot it.

server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale 
5,sackOK,eol], length 0
(server is ACKing the client's SYN, also offering an unscaled window of 64K, 
but wanting to shift
by 5 going forward)

The server sent a wscale back on its SYN/ACK, so the client thinks window 
scaling is now in
effect. But it's not, the server didn't remember the client's wscale option, so 
it's not scaling
any of the received window sizes that are coming in from the client. This 
doesn't actually hurt
much. The client thinks it's telling us it has a 1MB window open, but we're 
only hearing that
it's sent a 64K window, so that's all we ever use. It's "failing safe" here, 
and nothing actually
breaks.


Now throw Windows 8 into the mix. Windows 8's TCP auto tuning is much more 
aggressive than
previous versions of Windows. I honestly can't tell if this is a bug or 
intentional design, but
Windows will sometimes, intermittently, advertise a much much larger wscale 
option than it
actually needs. This is a mild example of what happens:

client > server: Flags [S], win 8192, options [mss 1460,nop,wscale 
8,nop,nop,sackOK], length 0
(client is connecting, offering an unscaled window of 8192 bytes, but wants to 
negotiate window
scaling of 8 bits if the server will accept it)

server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale 
5,sackOK,eol], length 0
(server is ACKing the client's SYN, also offering an unscaled window of 64K, 
but wanting to shift
by 5 going forward)

We're at the same point here as in the above example, the client now believes 
we've successfully
negotiated window scaling, but on the server side we're treating all window 
sizes coming from the
client as being shifted by 0. So the client sends it's first ACK:

client > server: Flags [.], seq 1, ack 1, win 256, length 0

The client believes we're still scaling everything it says by 8 bits, but it 
only wants to give
us a 64K window, so it's saying 256 here. (256<<8 = 65536). We don't remember 
that we agreed to
shift everything by 8, so we treat that as just 256. The connection now 
proceeds, but we think we
can only send 256 bytes at a time. It is extremely slow.

Yeah, that's bad.

I have seen Windows 8 attempt to use wscale parameters of 8 all way up to 10. 
While I've only
caught a few cases of this happening in the wild, when it's using 10 we end up 
thinking we only
have a 64 byte window and things get really silly really fast.

Indeed.

I've been talking with someone on Microsoft's side of things about why Windows 
is choosing to do
this. But my own view of this is that if syncookies are being used in their 
current state (we
lose the client's wscale option), we can't advertise wscale on the SYN/ACK. My 
reading of RFC1323
says that if we put a wscale option in our SYN/ACK that means we agreed to use 
the client's
wscale in their SYN. I don't think that's correct. If syncookies are being 
used, we should
advertise MIN(sb_max, TCP_MAXWIN) with no scaling and stay within the RFC.

This doesn't affect Linux because it uses timestamp options to stuff the 
client's wscale, so it
gets re-learned on the ACK. OpenBSD and OS X don't have syncookies. NetBSD 
seems to have the same
problem if it's new syncookie implementation gets turned on.

This can't be because of the lack of timestamps.  Linux must be
encoding the scale in the ISS taking away bits from the cookie.

I haven't looked into how Linux actually does it recently.

Any thoughts? Was there a reason why we're forcing the use of wscale on 
syncookie connections?

We can change the behavior of syncookies in a couple of ways to
deal with this problem:

 1/ send syncookies only when the syncache overflows and set wscale
    to 0 in the SYN-ACK when timestamps are not active.

 2/ move the wscale bits from timestamp encoding to the ISS taking
    bits away from the cookie.

At the moment we send syncookies on every SYN-ACK and bump the oldest
entry from the syncache when it is full.  That results in potentially
every segment degrading to syncookie only.  The default values are
insufficient for such high loads.

In general at 10,000 connections per second you should significantly
increase the size of your syncache to 3 * conn/sec at least.

In the loader you can set these tunables:

net.inet.tcp.syncache.hashsize    = 2048
net.inet.tcp.syncache.bucketlimit = 32
net.inet.tcp.syncache.cachelimit  = 65536

These settings are a bit more complicated than they should be.

Going forward I have to take a closer look at the modifications
in 1/ and 2/ and possibly combinations of both.  The main issue
is the tradeoff in cookie bits vs. cookie life time and how fast
the hash can be cracked these days.  OTOH a too complicated hash
would cost us significant cpu power too.

--
Andre
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Reply via email to