On Mon, Sep 18, 2017 at 10:18 AM, Yuchung Cheng <ych...@google.com> wrote: > On Sun, Sep 17, 2017 at 11:43 AM, Oleksandr Natalenko > <oleksa...@natalenko.name> wrote: >> Hi. >> >> Just to note that it looks like disabling RACK and re-enabling FACK prevents >> warning from happening: >> >> net.ipv4.tcp_fack = 1 >> net.ipv4.tcp_recovery = 0 >> >> Hope I get semantics of these tunables right. > Thanks. > > One difference between RACK and FACK is that RACK can detect lost > retransmission in CA_Recovery (fast recovery) and CA_Loss (post RTO) > mode, while the current FACK can not. A previous FACK version can also > detect lost retransmission in CA_recovery with limited-transmit. I > suspect it is RACK's special ability that triggers this warning. > > IMO, however, this warning itself is questionably valid: with undo > (TCP Eifel), the sender can detect and revert a false CA_Recovery / > CA_Loss to CA_Open, with spurious retransmission in-flight > (tp->retrans_out > 0). Then another SACK after undo triggers this > warning. Neal and I are not sure if this is causing the panics you're > seeing, but personally I'd argue this warning is false, or at least > should be revised to skip undo case. Can you try this patch to verify my theory with tcp_recovery=0 and 1? thanks
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 5af2f04f8859..9253d9ee7d0e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2381,6 +2381,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss) } tp->snd_cwnd_stamp = tcp_time_stamp; tp->undo_marker = 0; + WARN_ON(tp->retrans_out); } > > >> >> On pátek 15. září 2017 21:04:36 CEST Oleksandr Natalenko wrote: >>> Hello. >>> >>> With net.ipv4.tcp_fack set to 0 the warning still appears: >>> >>> === >>> » sysctl net.ipv4.tcp_fack >>> net.ipv4.tcp_fack = 0 >>> >>> » LC_TIME=C dmesg -T | grep WARNING >>> [Fri Sep 15 20:40:30 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c: >>> 2826 tcp_fastretrans_alert+0x7c8/0x990 >>> [Fri Sep 15 20:40:30 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c: >>> 2826 tcp_fastretrans_alert+0x7c8/0x990 >>> [Fri Sep 15 20:48:37 2017] WARNING: CPU: 1 PID: 711 at net/ipv4/tcp_input.c: >>> 2826 tcp_fastretrans_alert+0x7c8/0x990 >>> [Fri Sep 15 20:48:55 2017] WARNING: CPU: 0 PID: 711 at net/ipv4/tcp_input.c: >>> 2826 tcp_fastretrans_alert+0x7c8/0x990 >>> >>> » ps -up 711 >>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND >>> root 711 4.3 0.0 0 0 ? S 18:12 7:23 [irq/123- >>> enp3s0] >>> === >>> >>> Any suggestions? >>> >>> On pátek 15. září 2017 16:03:00 CEST Neal Cardwell wrote: >>> > Thanks for testing that. That is a very useful data point. >>> > >>> > I was able to cook up a packetdrill test that could put the connection >>> > in CA_Disorder with retransmitted packets out, but not in CA_Open. So >>> > we do not yet have a test case to reproduce this. >>> > >>> > We do not see this warning on our fleet at Google. One significant >>> > difference I see between our environment and yours is that it seems >>> > >>> > you run with FACK enabled: >>> > net.ipv4.tcp_fack = 1 >>> > >>> > Note that FACK was disabled by default (since it was replaced by RACK) >>> > between kernel v4.10 and v4.11. And this is exactly the time when this >>> > bug started manifesting itself for you and some others, but not our >>> > fleet. So my new working hypothesis would be that this warning is due >>> > to a behavior that only shows up in kernels >=4.11 when FACK is >>> > enabled. >>> > >>> > Would you be able to disable FACK ("sysctl net.ipv4.tcp_fack=0" at >>> > boot, or net.ipv4.tcp_fack=0 in /etc/sysctl.conf, or equivalent), >>> > reboot, and test the kernel for a few days to see if the warning still >>> > pops up? >>> > >>> > thanks, >>> > neal >>> > >>> > [ps: apologies for the previous, mis-formatted post...] >> >>