Note this is mostly me talking to myself, but in case anyone else hits the same issues I might as well post my meagre progress.
Dominique Martinet wrote on Fri, Apr 06, 2018: > (current kernel: vanilla 4.14.29) reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64 kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen any patch about that kind of stuff recently so probably still valid) Other main difference is the qdisc, VM is using fq_codel, host is directly on wireless so mq with 4 pfifo_fast queues - it is harder to reproduce on the VM (even on another VM with the same kernel) so I'd put the difference down to the qdisc, but I was able to reproduce with both ultimately. (update: it actually was still fairly easy to reproduce until it got later (coworkers left?), going from ~5-15s to reproduce to multiple minutes, so this likely depends on net quality a lot. I couldn't reproduce by fiddling with netem on a local network though...) > [please find previous email for setup/tcpdump output] So I have a crash dump with a socket / inet_sock that are blocked, since this is a VM I even get gdb in bonus.. With the crash dump, I can confirm that the socket is not available for writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1), but that doesn't help much if I can't tell why we're not taking acks in With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I think that's the function that should handle my ack, and that gets the replay. First, abusing next, the flow seems to be something like (I folded if/else not taken) static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); struct tcp_sacktag_state sack_state; struct rate_sample rs = { .prior_delivered = 0 }; u32 prior_snd_una = tp->snd_una; bool is_sack_reneg = tp->is_sack_reneg; u32 ack_seq = TCP_SKB_CB(skb)->seq; u32 ack = TCP_SKB_CB(skb)->ack_seq; bool is_dupack = false; int prior_packets = tp->packets_out; u32 delivered = tp->delivered; u32 lost = tp->lost; int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover losses */ u32 prior_fack; sack_state.first_sackt = 0; sack_state.rate = &rs; /* We very likely will need to access rtx queue. */ prefetch(sk->tcp_rtx_queue.rb_node); /* If the ack is older than previous acks * then we can probably ignore it. */ if (before(ack, prior_snd_una)) { } /* If the ack includes data we haven't sent yet, discard * this segment (RFC793 Section 3.9). */ if (after(ack, tp->snd_nxt)) if (after(ack, prior_snd_una)) { } prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una; rs.prior_in_flight = tcp_packets_in_flight(tp); /* ts_recent update must be made after we are sure that the packet * is in window. */ if (flag & FLAG_UPDATE_TS_RECENT) if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) { } else { u32 ack_ev_flags = CA_ACK_SLOWPATH; if (ack_seq != TCP_SKB_CB(skb)->end_seq) else NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPUREACKS); flag |= tcp_ack_update_window(sk, skb, ack, ack_seq); if (TCP_SKB_CB(skb)->sacked) if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) { } if (flag & FLAG_WIN_UPDATE) tcp_in_ack_event(sk, ack_ev_flags); } /* We passed data and got it acked, remove any soft error * log. Something worked... */ sk->sk_err_soft = 0; icsk->icsk_probes_out = 0; tp->rcv_tstamp = tcp_jiffies32; if (!prior_packets) goto no_queue; no_queue: /* If data was DSACKed, see if we can undo a cwnd reduction. */ if (flag & FLAG_DSACKING_ACK) tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, &flag, &rexmit); /* If this ack opens up a zero window, clear backoff. It was * being used to time the probes, and is probably far higher than * it needs to be for normal retransmission. */ tcp_ack_probe(sk); if (tp->tlp_high_seq) tcp_process_tlp_ack(sk, ack, flag); return 1; And here is 'info local' towards the end of the function: icsk = 0xffff88001740b800 tp = 0xffff88001740b800 sack_state = {reord = 84, first_sackt = 0, last_sackt = 0, rate = 0xffff88003fc83ae0, flag = 0, mss_now = 0} rs = {prior_mstamp = 0, prior_delivered = 0, delivered = 0, interval_us = 0, rtt_us = 0, losses = 0, acked_sacked = 0, prior_in_flight = 0, is_app_limited = false, is_retrans = false, is_ack_delayed = false} prior_snd_una = 2896339291 is_sack_reneg = false ack_seq = 2651638615 ack = 2896339291 is_dupack = false prior_packets = 0 delivered = 172 lost = 0 rexmit = 0 prior_fack = 2896339291 __vpp_verify = <optimized out> pao_ID__ = <optimized out> pao_tmp__ = <optimized out> pao_ID__ = <optimized out> pao_tmp__ = <optimized out> pao_ID__ = <optimized out> pao_tmp__ = <optimized out> So as I was seeing, we have ack = prior_fack = prior_snd_una (which is > ack_seq) We're in the !prior_packets goto to no_queue though so we don't pass tcp_clean_rtx_queue - so that explains why the replays don't help. I'm ok with that. But all these traces actually don't tell me why the first time we saw that packed didn't take the ack and decreate sk_wmem_queued properly. This isn't easy enough to reproduce that I feel confident grabbing the first call to that function that fails though, so I guess I'll actually have to read the code I'm debugging instead of trying to just plow through it... Hints still welcome, if anyone has an idea. -- Dominique Martinet | Asmadeus