On 7/10/19 8:23 PM, Prout, Andrew - LLSC - MITLL wrote:
> On 6/17/19 8:19 PM, Christoph Paasch wrote:
>>
>> Yes, this does the trick for my packetdrill-test.
>>
>> I wonder, is there a way we could end up in a situation where we can't
>> retransmit anymore?
>> For example, sk_wmem_queued has grown so much that the new test fails.
>> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
>> won't be able to do so. So we will never retransmit. And if no ACK
>> comes back in to make some room we are stuck, no?
> 
> We seem to be having exactly this problem. We’re running on the 4.14 branch. 
> After recently updating our kernel, we’ve been having a problem with TCP 
> connections stalling / dying off without disconnecting. They're stuck and 
> never recover.
> 
> I bisected the problem to 4.14.127 commit 
> 9daf226ff92679d09aeca1b5c1240e3607153336 (commit 
> f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() 
> should apply sane memory limits. That lead me to this thread.
> 
> Our environment is a supercomputing center: lots of servers interconnected 
> with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem 
> in two situations: remote users on VPN accessing large files via samba and 
> compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other 
> situations, many of our workloads have been unstable since this patch went 
> into production, but those are the two we clearly identified as they fail 
> reliably every time. We had to take the system down for unscheduled 
> maintenance to roll back to an older kernel.
> 
> The TCPWqueueTooBig count is incrementing when the problem occurs.
> 
> Using ftrace/trace-cmd on an affected process, it appears the call stack is:
> run_timer_softirq
> expire_timers
> call_timer_fn
> tcp_write_timer
> tcp_write_timer_handler
> tcp_retransmit_timer
> tcp_retransmit_skb
> __tcp_retransmit_skb
> tcp_fragment
> 
> Andrew Prout
> MIT Lincoln Laboratory Supercomputing Center
> 

What was the kernel version you used exactly ?

This problem is supposed to be fixed in v4.14.131

Reply via email to