On 8/14/13 6:21 PM, Luigi Rizzo wrote:
On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote:
On 08/14/13 16:33, Julian Elischer wrote:
On 8/14/13 11:39 AM, Lawrence Stewart wrote:
On 08/14/13 03:29, Julian Elischer wrote:
I have been tracking down a performance embarrassment on AMAZON EC2 and
have found it I think.
Let us please avoid conflating performance with throughput. The
behaviour you go on to describe as a performance embarrassment is
actually a throughput difference, and the FreeBSD behaviour you're
describing is essentially sacrificing throughput and CPU cycles for
lower latency. That may not be a trade-off you like, but it is an
important factor in this discussion.
...
Sure, there's nothing wrong with holding throughput up as a key
performance metric for your use case.
I'm just trying to pre-empt a discussion that focuses on one metric and
fails to consider the bigger picture.
...
I could see no latency reversion.
You wouldn't because it would be practically invisible in the sorts of
tests/measurements you're doing. Our good friends over at HRT on the
other hand would be far more likely to care about latency on the order
of microseconds. Again, the use case matters a lot.
...
so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to
see this?
I think (check the driver code in question as I'm not sure) that if you
"ifconfig <if> lro" and the driver has hardware support or has been made
aware of our software implementation, it should DTRT.
The "lower throughput than linux" that julian was seeing is either
because of a slow (CPU-bound) sender or slow receiver. Given that
the FreeBSD tx path is quite expensive (redoing route and arp lookups
on every packet, etc.) I highly suspect the sender side is at fault.
if we send bigger packets then we do less lookups do we not?
Ack coalescing, LRO, GRO are limited to the set of packets that you
receive in the same batch, which in turn is upper bounded by the
interrupt moderation delay. Apart from simple benchmarks with only
a few flows, it is very hard that ack/lro/gro can coalesce more
than a few segments for the same flow.
But the real fix is in tcp_output.
In fact, it has never been the case that an ack (single or coalesced)
triggers an immediate transmission in the output path. We had this
in the past (Silly Window Syndrome) and there is code that avoids
sending less than 1-mtu under appropriate conditions (there is more
data to push out anyways, no NODELAY, there are outstanding acks,
the window can open further). In all these cases there is no
reasonable way to experience the difference in terms of latency.
If one really cares, e.g. the High Speed Trading example, this is
a non issue because any reasonable person would run with TCP_NODELAY
(and possibly disable interrupt moderation), and optimize for latency
even on a per flow basis.
In terms of coding effort, i suspect that by replacing the 1-mtu
limit (t_maxseg i believe is the variable that we use in the SWS
avoidance code) with 1-max-tso-segment we can probably achieve good
results with little programming effort.
Then the problem remains that we should keep a copy of route and
arp information in the socket instead of redoing the lookups on
every single transmission, as they consume some 25% of the time of
a sendto(), and probably even more when it comes to large tcp
segments, sendfile() and the like.
cheers
luigi
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"