On 7/10/19 8:23 PM, Prout, Andrew - LLSC - MITLL wrote:
> On 6/17/19 8:19 PM, Christoph Paasch wrote:
>>
>> Yes, this does the trick for my packetdrill-test.
>>
>> I wonder, is there a way we could end up in a situation where we can't
>> retransmit anymore?
>> For example, sk_wmem_queued has grown so much that the new test fails.
>> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
>> won't be able to do so. So we will never retransmit. And if no ACK
>> comes back in to make some room we are stuck, no?
>
> We seem to be having exactly this problem. We’re running on the 4.14 branch.
> After recently updating our kernel, we’ve been having a problem with TCP
> connections stalling / dying off without disconnecting. They're stuck and
> never recover.
>
> I bisected the problem to 4.14.127 commit
> 9daf226ff92679d09aeca1b5c1240e3607153336 (commit
> f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment()
> should apply sane memory limits. That lead me to this thread.
>
> Our environment is a supercomputing center: lots of servers interconnected
> with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem
> in two situations: remote users on VPN accessing large files via samba and
> compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other
> situations, many of our workloads have been unstable since this patch went
> into production, but those are the two we clearly identified as they fail
> reliably every time. We had to take the system down for unscheduled
> maintenance to roll back to an older kernel.
>
> The TCPWqueueTooBig count is incrementing when the problem occurs.
>
> Using ftrace/trace-cmd on an affected process, it appears the call stack is:
> run_timer_softirq
> expire_timers
> call_timer_fn
> tcp_write_timer
> tcp_write_timer_handler
> tcp_retransmit_timer
> tcp_retransmit_skb
> __tcp_retransmit_skb
> tcp_fragment
>
> Andrew Prout
> MIT Lincoln Laboratory Supercomputing Center
>
What was the kernel version you used exactly ?
This problem is supposed to be fixed in v4.14.131