On Wed, 2018-02-07 at 11:16 +0900, 배석진 wrote: > Hello, > this is bae working on samsung elec. >
Hi Bae, thanks for this detailed report and analysis. > we have a problem that packet discarded during 3-way handshaking on TCP. > already looks like that Mr Dumazet try to fix the similar issue on this > patch, > https://android.googlesource.com/kernel/common/+/5e0724d027f0548511a2165a209572d48fe7a4c8 > This patch was fixing a more serious bug, since this was possibly causing corruptions and kernel crashes. What you describe is simply a packet being dropped, on a very low probability. Which is not a huge deal for TCP since this packet will eventually be re-transmitted. Also most TCP stacks/flows use DF bit set to not allow packets being fragmented... Anyway see my answer at the end of this mail. > but we are still facing the another corner case. > > it needs preconditions for this problem. > (1) last ack packet of 3-way handshaking and next packet have been arrived at > almost same time > (2) next packet, the first data packet was fragmented > (3) enable rps > > > [tcp dump] > No. A-Time Source Destination Len Seq Info > 1 08:35:18.115259 193.81.6.70 10.217.0.47 84 0 [SYN] Seq=0 > Win=21504 Len=0 MSS=1460 > 2 08:35:18.115888 10.217.0.47 193.81.6.70 84 0 6100 → 5063 [SYN, > ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 > 3 08:35:18.142385 193.81.6.70 10.217.0.47 80 1 5063 → 6100 [ACK] > Seq=1 Ack=1 Win=21504 Len=0 > 4 08:35:18.142425 193.81.6.70 10.217.0.47 1516 Fragmented IP > protocol (proto=Encap Security Payload 50, off=0, ID=6e24) [Reassembled in > #5] > 5 08:35:18.142449 193.81.6.70 10.217.0.47 60 1 5063 → 6100 [ACK] > Seq=1 Ack=1 Win=21504 Len=1460 [TCP segment of a reassembled PDU] > 6 08:35:21.227070 193.81.6.70 10.217.0.47 1516 Fragmented IP > protocol (proto=Encap Security Payload 50, off=0, ID=71e9) [Reassembled in > #7] > 7 08:35:21.227191 193.81.6.70 10.217.0.47 60 1 [TCP > Retransmission] 5063 → 6100 [ACK] Seq=1 Ack=1 Win=21504 Len=1460 > 8 08:35:21.228822 10.217.0.47 193.81.6.70 80 1 6100 → 5063 [ACK] > Seq=1 Ack=1461 Win=32120 Len=0 > > - last ack packet of handshaking(No.3) and next data packet(No4,5) were > arrived with just 40us time gap. > > > [kernel log] > - stage 1 > <3>[ 1037.669229] I[0: system_server: 3778] get_rps_cpu: skb(64), check hash > value:3412396090 > <3>[ 1037.669261] I[0: system_server: 3778] get_rps_cpu: skb(1500), check > hash value:158575680 > <3>[ 1037.669285] I[0: system_server: 3778] get_rps_cpu: skb(44), check hash > value:158575680 > - stage 2 > <3>[ 1037.669541] I[1: Binder:3778_13: 8391] tcp_v4_rcv: Enter! > skb(seq:A93E087B, len:1480) > <3>[ 1037.669552] I[2:Jit thread pool:12990] tcp_v4_rcv: Enter! > skb(seq:A93E087B, len:20) > <3>[ 1037.669564] I[2:Jit thread pool:12990] tcp_v4_rcv: check sk_state:12 > skb(seq:A93E087B, len:20) > <3>[ 1037.669585] I[2:Jit thread pool:12990] tcp_check_req, Enter!: > skb(seq:A93E087B, len:20) > <3>[ 1037.669612] I[1: Binder:3778_13: 8391] tcp_v4_rcv: check sk_state:12 > skb(seq:A93E087B, len:1480) > <3>[ 1037.669625] I[1: Binder:3778_13: 8391] tcp_check_req, Enter!: > skb(seq:A93E087B, len:1480) > <3>[ 1037.669653] I[2:Jit thread pool:12990] tcp_check_req, skb(seq:A93E087B, > len:20), own_req:1 > <3>[ 1037.669668] I[1: Binder:3778_13: 8391] tcp_check_req, skb(seq:A93E087B, > len:1480), own_req:0 > <3>[ 1037.669708] I[2:Jit thread pool:12990] tcp_rcv_state_process, > Established: skb(seq:A93E087B, len:20) > <3>[ 1037.669724] I[1: Binder:3778_13: 8391] tcp_v4_rcv: discard_relse > skb(seq:A93E087B, len:1480) > > - stage 1 > because of the data packet has been fragmented(No.4 & 5), > it was hashed to another core(cpu1) which was differnet with last ack > packet(cpu2), by rps. > so last ack and data packet handled in different core almost simultaniously, > at NEW_SYN_RECV state. > > - stage 2, cpu2 > one of them will be treated in tcp_check_req() function a little more > earlier, > then it got the true value for own_req from tcp_v4_syn_recv_sock(), and > return valid nsk. > finally going to ESTABLISHED state. > > - stage 2, cpu1 > but another, later one is got the false value for own_req, > and return null for nsk, because of own_req value is false in > inet_csk_complete_hashdance(). > so earlier packet was handled successfully but later one has gone to discard. > > at this time, one of the ack or data packet could be discarded, by schedule > timing. (we saw both of them) > if the ack was discarded, that's ok. > tcp state goes to ESTABLISHED by piggyback on data packet, and payload will > be deliverd to upper layer. > but if the data packet was discarded, client can't receive the payload it > have to. > this is the problem we faced. > > > although server retransmitted the dropped packet(No6,7), but it takes few > seconds delay. > since of this problem occured in IMS-Call setup, this is appeared to call > connection delay. > these situation is serious problem in call service. > > do you have any report about this or plan to fix it? I guess the following patch should help, can you test it ? Thanks ! include/net/tcp.h | 3 ++- net/ipv4/tcp_input.c | 4 +++- net/ipv4/tcp_ipv4.c | 9 ++++++++- net/ipv4/tcp_minisocks.c | 3 ++- net/ipv6/tcp_ipv6.c | 9 ++++++++- 5 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 58278669cc5572ec397f028b5888f574764e298a..fe4616978ccf2b048ca1b7ad8976558563798785 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -374,7 +374,8 @@ enum tcp_tw_status tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, const struct tcphdr *th); struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, - struct request_sock *req, bool fastopen); + struct request_sock *req, bool fastopen, + bool *lost_race); int tcp_child_process(struct sock *parent, struct sock *child, struct sk_buff *skb); void tcp_enter_loss(struct sock *sk); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index cfa51cfd2d999342018aee3511d5760a17b1493b..20c68a4765235daec01a020913d4779d6926d203 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5870,10 +5870,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) tp->rx_opt.saw_tstamp = 0; req = tp->fastopen_rsk; if (req) { + bool lost_race; + WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV && sk->sk_state != TCP_FIN_WAIT1); - if (!tcp_check_req(sk, skb, req, true)) + if (!tcp_check_req(sk, skb, req, true, &lost_race)) goto discard; } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 95738aa0d8a60697053d763ba7023be2c62d7a90..d72de3d12b4c960c3bd1ad4c45fc493ee919201c 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1672,6 +1672,7 @@ int tcp_v4_rcv(struct sk_buff *skb) if (sk->sk_state == TCP_NEW_SYN_RECV) { struct request_sock *req = inet_reqsk(sk); struct sock *nsk; + bool lost_race; sk = req->rsk_listener; if (unlikely(tcp_v4_inbound_md5_hash(sk, skb))) { @@ -1688,15 +1689,21 @@ int tcp_v4_rcv(struct sk_buff *skb) */ sock_hold(sk); refcounted = true; + lost_race = false; nsk = NULL; if (!tcp_filter(sk, skb)) { th = (const struct tcphdr *)skb->data; iph = ip_hdr(skb); tcp_v4_fill_cb(skb, iph, th); - nsk = tcp_check_req(sk, skb, req, false); + nsk = tcp_check_req(sk, skb, req, false, &lost_race); } if (!nsk) { reqsk_put(req); + if (lost_race) { + tcp_v4_restore_cb(skb); + sock_put(sk); + goto lookup; + } goto discard_and_relse; } if (nsk == sk) { diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index a8384b0c11f8fa589e2ed5311899b62c80a269f8..7ebc222854f7d7215850f86456aeb6298659b6fa 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -578,7 +578,7 @@ EXPORT_SYMBOL(tcp_create_openreq_child); struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, struct request_sock *req, - bool fastopen) + bool fastopen, bool *lost_race) { struct tcp_options_received tmp_opt; struct sock *child; @@ -785,6 +785,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, sock_rps_save_rxhash(child, skb); tcp_synack_rtt_meas(child, req); + *lost_race = !own_req; return inet_csk_complete_hashdance(sk, child, req, own_req); listen_overflow: diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index a1ab29e2ab3bb99e120920186be46d2dd4bbb71e..211570bd524ddc3eddadec0b58c587935ad9a42c 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1451,6 +1451,7 @@ static int tcp_v6_rcv(struct sk_buff *skb) if (sk->sk_state == TCP_NEW_SYN_RECV) { struct request_sock *req = inet_reqsk(sk); struct sock *nsk; + bool lost_race; sk = req->rsk_listener; if (tcp_v6_inbound_md5_hash(sk, skb)) { @@ -1464,15 +1465,21 @@ static int tcp_v6_rcv(struct sk_buff *skb) } sock_hold(sk); refcounted = true; + lost_race = false; nsk = NULL; if (!tcp_filter(sk, skb)) { th = (const struct tcphdr *)skb->data; hdr = ipv6_hdr(skb); tcp_v6_fill_cb(skb, hdr, th); - nsk = tcp_check_req(sk, skb, req, false); + nsk = tcp_check_req(sk, skb, req, false, &lost_race); } if (!nsk) { reqsk_put(req); + if (lost_race) { + tcp_v6_restore_cb(skb); + sock_put(sk); + goto lookup; + } goto discard_and_relse; } if (nsk == sk) {