On 01.02.2013 22:21, Kevin Day wrote:
We've got a large cluster of HTTP servers, each server handling >10,000req/sec.
Occasionally, and
during periods of heavy load, we'd get complaints from some users that
downloads were working but
going EXTREMELY slowly. After a whole lot of debugging, we narrowed it down to
being only Windows
8 clients experiencing this problem. It turns out that FreeBSD's implementation
of syncookies is
likely violating RFC1323.
When syncookies kicks in, either because the syncache limit is reached or
net.inet.tcp.syncookies_only is set, some shortcuts are taken with regard to
TCP connections.
Unlike some other syncookies implementations which (ab)use timestamps to store
options, the
FreeBSD implementation of syncookies discards TCP options such as window
scaling. In itself this
isn't a bad thing, but it becomes a bad thing because we then lie and pretend
that we are
supporting window scaling.
This is not true. FreeBSD uses bits in the timestamp to encode all
recognized TCP options including window scaling.
According to RFC1323, if you want to use TCP window scaling, the client says so
on the initial
SYN. If the server is also willing to use scaling, it says so on the SYN/ACK.
If both parties
included a scaling option on their respective SYN, you assume window scaling is
working and
proceed to use it. If one or both parties don't have a scaling option, you
don't scale at all.
The problem here is that with syncookies, we don't save the wscale parameter
from the client's
SYN, but offer to use window scaling anyway on our SYN/ACK, so the client
thinks we successfully
negotiated window scaling even though we haven't.
The syncookie window scale is stored in the timestamp. Of course
this becomes problematic as you describe when timestamps are not
active on an connection.
This is how a normal window scaled connection happens:
client > server: Flags [S], win 65535, options [mss 1460,nop,wscale
4,nop,nop,sackOK], length 0
(client is connecting, offering a window of 64K, but if scaling is negotiated
wants to scale
future window sizes by 4 bits)
server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale
5,sackOK,eol], length 0
(server is ACKing the client's SYN, also offering an unscaled window of 64K,
but wanting to shift
by 5 going forward)
The server and client both offered window scaling, so they're now using it from
this point on.
All window sizes sent/received are shifted by the appropriate number of bits.
No timestamps.
When syncookies kicks in on the server, and the client is anything BUT Windows
8, this happens:
client > server: Flags [S], win 65535, options [mss 1460,nop,wscale
4,nop,nop,sackOK], length 0
However, syncookies cause the options to get lost. The client sent the "wscale
4" parameter, but
we immediately forgot it.
server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale
5,sackOK,eol], length 0
(server is ACKing the client's SYN, also offering an unscaled window of 64K,
but wanting to shift
by 5 going forward)
The server sent a wscale back on its SYN/ACK, so the client thinks window
scaling is now in
effect. But it's not, the server didn't remember the client's wscale option, so
it's not scaling
any of the received window sizes that are coming in from the client. This
doesn't actually hurt
much. The client thinks it's telling us it has a 1MB window open, but we're
only hearing that
it's sent a 64K window, so that's all we ever use. It's "failing safe" here,
and nothing actually
breaks.
Now throw Windows 8 into the mix. Windows 8's TCP auto tuning is much more
aggressive than
previous versions of Windows. I honestly can't tell if this is a bug or
intentional design, but
Windows will sometimes, intermittently, advertise a much much larger wscale
option than it
actually needs. This is a mild example of what happens:
client > server: Flags [S], win 8192, options [mss 1460,nop,wscale
8,nop,nop,sackOK], length 0
(client is connecting, offering an unscaled window of 8192 bytes, but wants to
negotiate window
scaling of 8 bits if the server will accept it)
server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale
5,sackOK,eol], length 0
(server is ACKing the client's SYN, also offering an unscaled window of 64K,
but wanting to shift
by 5 going forward)
We're at the same point here as in the above example, the client now believes
we've successfully
negotiated window scaling, but on the server side we're treating all window
sizes coming from the
client as being shifted by 0. So the client sends it's first ACK:
client > server: Flags [.], seq 1, ack 1, win 256, length 0
The client believes we're still scaling everything it says by 8 bits, but it
only wants to give
us a 64K window, so it's saying 256 here. (256<<8 = 65536). We don't remember
that we agreed to
shift everything by 8, so we treat that as just 256. The connection now
proceeds, but we think we
can only send 256 bytes at a time. It is extremely slow.
Yeah, that's bad.
I have seen Windows 8 attempt to use wscale parameters of 8 all way up to 10.
While I've only
caught a few cases of this happening in the wild, when it's using 10 we end up
thinking we only
have a 64 byte window and things get really silly really fast.
Indeed.
I've been talking with someone on Microsoft's side of things about why Windows
is choosing to do
this. But my own view of this is that if syncookies are being used in their
current state (we
lose the client's wscale option), we can't advertise wscale on the SYN/ACK. My
reading of RFC1323
says that if we put a wscale option in our SYN/ACK that means we agreed to use
the client's
wscale in their SYN. I don't think that's correct. If syncookies are being
used, we should
advertise MIN(sb_max, TCP_MAXWIN) with no scaling and stay within the RFC.
This doesn't affect Linux because it uses timestamp options to stuff the
client's wscale, so it
gets re-learned on the ACK. OpenBSD and OS X don't have syncookies. NetBSD
seems to have the same
problem if it's new syncookie implementation gets turned on.
This can't be because of the lack of timestamps. Linux must be
encoding the scale in the ISS taking away bits from the cookie.
I haven't looked into how Linux actually does it recently.
Any thoughts? Was there a reason why we're forcing the use of wscale on
syncookie connections?
We can change the behavior of syncookies in a couple of ways to
deal with this problem:
1/ send syncookies only when the syncache overflows and set wscale
to 0 in the SYN-ACK when timestamps are not active.
2/ move the wscale bits from timestamp encoding to the ISS taking
bits away from the cookie.
At the moment we send syncookies on every SYN-ACK and bump the oldest
entry from the syncache when it is full. That results in potentially
every segment degrading to syncookie only. The default values are
insufficient for such high loads.
In general at 10,000 connections per second you should significantly
increase the size of your syncache to 3 * conn/sec at least.
In the loader you can set these tunables:
net.inet.tcp.syncache.hashsize = 2048
net.inet.tcp.syncache.bucketlimit = 32
net.inet.tcp.syncache.cachelimit = 65536
These settings are a bit more complicated than they should be.
Going forward I have to take a closer look at the modifications
in 1/ and 2/ and possibly combinations of both. The main issue
is the tradeoff in cookie bits vs. cookie life time and how fast
the hash can be cracked these days. OTOH a too complicated hash
would cost us significant cpu power too.
--
Andre
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"