From: Alexey Kuznetsov <[EMAIL PROTECTED]>
Date: Fri, 21 Jul 2006 02:59:08 +0400

> > Moving protocol (no matter if it is TCP or not) closer to user allows
> > naturally control the dataflow - when user can read that data(and _this_
> > is the main goal), user acks, when it can not - it does not generate
> > ack. In theory
> 
> To all that I rememeber, in theory absence of feedback leads
> to loss of control yet. The same is in practice, unfortunately.
> You must say that window is closed, otherwise sender is totally
> confused.

Correct, and too large delay even results in retransmits.  You can say
that RTT will be adjusted by delay of ACK, but if user context
switches cleanly at the beginning, resulting in near immediate ACKs,
and then blocks later you will get spurious retransmits.  Alexey's
example of blocking on a disk write is a good example.  I really don't
like when pure NULL data sinks are used for "benchmarking" these kinds
of things because real applications 1) touch the data, 2) do something
with that data, and 3) have some life outside of TCP!

If you optimize an application that does nothing with the data it
receives, you have likewise optimized nothing :-)

All this talk reminds me of one thing, how expensive tcp_ack() is.
And this expense has nothing to do with TCP really.  The main cost is
purging and freeing up the skbs which have been ACK'd in the
retransmit queue.

So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs
which haven't been touched by the cpu in some time and are thus nearly
guarenteed to be cold in the cache.

This is the kind of work we could think about batching to user
sleeping on some socket call.

Also notice that retransmit queue is potentially a good use of an
array similar VJ netchannel lockless queue data structure. :)

BTW, notice that TSO makes this work touch less skb state.  TSO also
decreases cpu utilization.  I think these two things are no
coincidence. :-)

I have even toyed with the idea of eventually abstracting the
retransmit queue into a pure data representation.  The skb_shinfo()
page vector is very nearly this already.  Or, a less extreme idea
where we have fully retained huge TSO skbs, but we do not chop them up
to create smaller TSO frames.  Instead, we add "offset" GSO attribute
which is used in the clones.

Calls to tso_fragment() would be replaced with pure clones and
adjustment of skb->len and the new "skb->gso_offset" in the clone.
Rest of the logic would remain identical except that non-linear data
would start "skb->gso_offset" bytes into the skb_shinfo() described
area.

In this way we could also set tp->xmit_size_goal to it's maximum
possible value, always.  Actually, I was looking at this the other day
and this clamping of xmit_size_goal to 1/2 max_window is extremely
dubious.  In fact it's downright wrong, only MSS needs this limiting
for sender side SWS avoidance.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to