On 6/23/20, Neal Cardwell <ncardw...@google.com> wrote: > On Tue, Jun 23, 2020 at 10:54 AM Denis Kirjanov <k...@linux-powerpc.org> > wrote: >> >> there is a problem with the CWR flag set in an incoming ACK segment >> and it leads to the situation when the ECE flag is latched forever >> >> the following packetdrill script shows what happens: >> >> // Stack receives incoming segments with CE set >> +0.1 <[ect0] . 11001:12001(1000) ack 1001 win 65535 >> +0.0 <[ce] . 12001:13001(1000) ack 1001 win 65535 >> +0.0 <[ect0] P. 13001:14001(1000) ack 1001 win 65535 >> >> // Stack repsonds with ECN ECHO >> +0.0 >[noecn] . 1001:1001(0) ack 12001 >> +0.0 >[noecn] E. 1001:1001(0) ack 13001 >> +0.0 >[noecn] E. 1001:1001(0) ack 14001 >> >> // Write a packet >> +0.1 write(3, ..., 1000) = 1000 >> +0.0 >[ect0] PE. 1001:2001(1000) ack 14001 >> >> // Pure ACK received >> +0.01 <[noecn] W. 14001:14001(0) ack 2001 win 65535 >> >> // Since CWR was sent, this packet should NOT have ECE set >> >> +0.1 write(3, ..., 1000) = 1000 >> +0.0 >[ect0] P. 2001:3001(1000) ack 14001 >> // but Linux will still keep ECE latched here, with packetdrill >> // flagging a missing ECE flag, expecting >> // >[ect0] PE. 2001:3001(1000) ack 14001 >> // in the script >> >> In the situation above we will continue to send ECN ECHO packets >> and trigger the peer to reduce the congestion window. >> >> Signed-off-by: Denis Kirjanov <denis.kirja...@suse.com> >> --- >> net/ipv4/tcp_input.c | 9 ++++++--- >> 1 file changed, 6 insertions(+), 3 deletions(-) >> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >> index 12fda8f27b08..e00b88c8598d 100644 >> --- a/net/ipv4/tcp_input.c >> +++ b/net/ipv4/tcp_input.c >> @@ -3696,8 +3696,13 @@ static int tcp_ack(struct sock *sk, const struct >> sk_buff *skb, int flag) >> &rexmit); >> } >> >> - if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) >> + if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) { >> + /* If we miss the CWR flag then ECE will be >> + * latched forever. >> + */ >> + tcp_ecn_accept_cwr(sk, skb); >> sk_dst_confirm(sk); >> + } >> >> delivered = tcp_newly_delivered(sk, delivered, flag); >> lost = tp->lost - lost; /* freshly marked lost */ >> @@ -4800,8 +4805,6 @@ static void tcp_data_queue(struct sock *sk, struct >> sk_buff *skb) >> skb_dst_drop(skb); >> __skb_pull(skb, tcp_hdr(skb)->doff * 4); >> >> - tcp_ecn_accept_cwr(sk, skb); >> - >> tp->rx_opt.dsack = 0; >> >> /* Queue data for delivery to the user. >> -- >> 2.27.0 >> > > Hi Denis - > > Thanks for this patch!
Hi Neal, > A few thoughts: > > (1) Richard Scheffenegger (cc-ed) raised this issue in January. > IMHO The Linux TCP code is RFC-compliant here. If you have a > sender that is sending CWR on pure ACKs, then that sender is > violating RFC3168, which specifies that CWR should only sent on > data segments, so that the sender can be sure that there is a > cwnd reduction even if a packet with CWR set is lost: > > RFC3168 says: > > """ > When an ECN-Capable TCP sender reduces its congestion window for > any reason (because of a retransmit timeout, a Fast Retransmit, > or in response to an ECN Notification), the TCP sender sets the > CWR flag in the TCP header of the first new data packet sent > after the window reduction. If that data packet is dropped in > the network, then the > ... > sending TCP will have to reduce the congestion window again and > retransmit the dropped packet. > > We ensure that the "Congestion Window Reduced" information is > reliably delivered to the TCP receiver. This comes about from > the fact that if the new data packet carrying the CWR flag is > dropped, then the TCP sender will have to again reduce its > congestion window, and send another new data packet with the CWR > flag set. Thus, the CWR bit in the TCP header SHOULD NOT be set > on retransmitted packets. > > When the TCP data sender is ready to set the CWR bit after > reducing the congestion window, it SHOULD set the CWR bit only on > the first new data packet that it transmits. > """ > > (2) Even given (1), I would agree with Richard's point that TCP > receivers should accept CWR on pure ACKs, per the rationale of > Postel's robustness principle ("Be liberal in what you accept, > and conservative in what you send.") Here we should be > liberal and accept the CWR in the pure ACK, particularly > because we know that at least one class of widely-deployed TCP > stacks (*BSD stacks) do this. > > (3) Given (2), I don't think this is the proper place to accept a > CWR. That point in the code is not reached by a connection > that is only currently receiving data, because of the lines > above that say: > > if (!prior_packets) > goto no_queue; Right, here we don't loose a segment > > Accepting CWR is something a data receiver does (to respond to > a data sender that is signalling that it has slowed down). So > even though I agree we should make the code more > liberal/robust in acepting CWR on pure ACK, I don't think this > is the right spot to insert the code. > > (4) I would say this is an interoperability bug fix, so this > should be explicitly targetd to the "net" branch. Sure > > (5) If you are able and have time, it would be nice to post the > full packetdrill script, either in the commit message, or as a > pull request to the packetdrill github project, so that the > Linux community may benefit from the full test case. Initially I thought that we do have packetdrill scripts in the kernel but we don't. Sure, I'll post a patch. Actually I've sent a different patch before (regarding gcc 10 issue) and looks like the mailing list doesn't work. Do I have to do that through gihub? > > Putting all that together, I think this patch should add a > comment that this code snippet is accepting non-RFC3168-compliant > behavior for the sake of robustness to known deployed stacks, and > should put that code in a place reached by data receivers. I > would suggest perhaps something like: Thanks for review! > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 12fda8f27b08..717475b6f5c3 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -3665,6 +3665,13 @@ static int tcp_ack(struct sock *sk, const > struct sk_buff *skb, int flag) > tcp_in_ack_event(sk, ack_ev_flags); > } > > + /* An RFC3168-compliant sender should only set CWR on data > segments. > + * ("it SHOULD set the CWR bit only on the first new data packet > that > + * it transmits." However, we accept CWR on pure ACKs to be more > robust > + * with widely-deployed TCP implementations that do this. > + */ > + tcp_ecn_accept_cwr(sk, skb); > + > /* We passed data and got it acked, remove any soft error > * log. Something worked... > */ > @@ -4800,8 +4807,6 @@ static void tcp_data_queue(struct sock *sk, > struct sk_buff *skb) > skb_dst_drop(skb); > __skb_pull(skb, tcp_hdr(skb)->doff * 4); > > - tcp_ecn_accept_cwr(sk, skb); > - > tp->rx_opt.dsack = 0; > > /* Queue data for delivery to the user. > --- > > best, > neal >