On Mon, Oct 17, 2016 at 8:35 PM, Lawrence Brakmo <bra...@fb.com> wrote: > Yuchung and Eric, thank you for your comments. > > It looks like I need to think more about this patch. I was trying > to reduce the likelihood of reordering (which seems even more > important based on EricĀ¹s comment on pacing), but it seems like > the only way to prevent reordering is to only re-hash after an RTO > or when there are no packets in flight (which may not occur). > Sounds like that should be the same condition as when we set ooo_okay?
> > On 10/11/16, 8:56 PM, "Yuchung Cheng" <ych...@google.com> wrote: > >>On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng <ych...@google.com> wrote: >>> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo <bra...@fb.com> wrote: >>>> Yuchung, thank you for your comments. Responses inline. >>>> >>>> On 10/11/16, 12:49 PM, "Yuchung Cheng" <ych...@google.com> wrote: >>>> >>>>>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo <bra...@fb.com> wrote: >>>>>> >>>>>> The purpose of this patch is to help balance flows across paths. A >>>>>>new >>>>>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) >>>>>>that >>>>>> the txhash (IPv6 flowlabel) will be changed after a non-RTO >>>>>>retransmit. >>>>>> A probability is used in order to control how many flows are moved >>>>>> during a congestion event and prevent the congested path from >>>>>>becoming >>>>>> under utilized (which could occur if too many flows leave the current >>>>>> path). Txhash changes may be delayed in order to decrease the >>>>>>likelihood >>>>>> that it will trigger retransmists due to too much reordering. >>>>>> >>>>>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior >>>>>>after >>>>>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger >>>>>> txhash changes. The idea is to decrease the likelihood of going back >>>>>> to a broken path. That is, we don't want flow balancing to trigger >>>>>> changes to broken paths. The drawback is that flow balancing does >>>>>> not work as well. If the sysctl is greater than 1, then we always >>>>>> do flow balancing, even after RTOs. >>>>>> >>>>>> Tested with packedrill tests (for correctness) and performance >>>>>> experiments with 2 and 3 paths. Performance experiments looked at >>>>>> aggregate goodput and fairness. For each run, we looked at the ratio >>>>>>of >>>>>> the goodputs for the fastest and slowest flows. These were averaged >>>>>>for >>>>>> all the runs. A fairness of 1 means all flows had the same goodput, a >>>>>> fairness of 2 means the fastest flow was twice as fast as the slowest >>>>>> flow. >>>>>> >>>>>> The setup for the performance experiments was 4 or 5 serves in a >>>>>>rack, >>>>>> 10G links. I tested various probabilities, but 20 seemed to have the >>>>>> best tradeoff for my setup (small RTTs). >>>>>> >>>>>> --- node1 ----- >>>>>> sender --- switch --- node2 ----- switch ---- receiver >>>>>> --- node3 ----- >>>>>> >>>>>> Scenario 1: One sender sends to one receiver through 2 routes (node1 >>>>>>or >>>>>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With >>>>>>only 2 >>>>>> flows, without flow balancing (prob=0) the average goodput is 1.6G >>>>>>vs. >>>>>> 1.9G with flow balancing due to 2 flows ending up in one link and >>>>>>either >>>>>> not moving and taking some time to move. Fairness was 1 in all cases. >>>>>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or >>>>>>1.2 >>>>>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is, >>>>>> flow balancing increased fairness. >>>>>> >>>>>> Scenario 2: One sender to one receiver, through 3 routes (node1,... >>>>>> node2). With 6 or 16 flows the goodput was the same for all, but >>>>>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst >>>>>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That >>>>>>is, >>>>>> prob=20,mode=1 improved average and worst case fairness. >>>>>I am wondering if we can build better API with routing layer to >>>>>implement this type of feature, instead of creeping the tx_rehashing >>>>>logic scatter in TCP. For example, we call dst_negative_advice on TCP >>>>>write timeouts. >>>> >>>> Not sure. The route is not necessarily bad, may be temporarily >>>>congested >>>> or they may all be congested. If all we want to do is change the txhash >>>> (unlike dst_negative_advice), then calling a tx_rehashing function may >>>> be the appropriate call. >>>> >>>>> >>>>>On the patch itself, it seems aggressive to (attempt to) rehash every >>>>>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to >>>>>identify post-RTO retransmission directly. >>>> >>>> Thanks, I will add the test. >>>> >>>>> >>>>>is this an implementation of the Flow Bender ? >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation >>>>>.cf >>>>>m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx >>>>>1u_ >>>>>g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf >>>>>7BJ >>>>>Ol3-oxAzZDEYUG4cE-s&e= >>>> >>>> Part of flow bender, although there are also some similarities to >>>>flowlet >>>> switching. >>>> >>>>> >>>>>> >>>>>> Scenario 3: One sender to one receiver, 2 routes, one route drops >>>>>>50% of >>>>>> the packets. With 7 flows, goodput was the same 1.1G, but fairness >>>>>>was >>>>>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then >>>>>> balancing, which does more re-routes, is less fair. >>>>>> >>>>>> Signed-off-by: Lawrence Brakmo <bra...@fb.com> >>>>>> --- >>>>>> Documentation/networking/ip-sysctl.txt | 15 +++++++++++++++ >>>>>> include/linux/tcp.h | 4 +++- >>>>>> include/net/tcp.h | 2 ++ >>>>>> net/ipv4/sysctl_net_ipv4.c | 18 ++++++++++++++++++ >>>>>> net/ipv4/tcp_input.c | 10 ++++++++++ >>>>>> net/ipv4/tcp_output.c | 23 ++++++++++++++++++++++- >>>>>> net/ipv4/tcp_timer.c | 4 ++++ >>>>>> 7 files changed, 74 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/networking/ip-sysctl.txt >>>>>>b/Documentation/networking/ip-sysctl.txt >>>>>> index 3db8c67..87a984c 100644 >>>>>> --- a/Documentation/networking/ip-sysctl.txt >>>>>> +++ b/Documentation/networking/ip-sysctl.txt >>>>>> @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER >>>>>> if paths are using per packet load balancing (like bonding rr >>>>>>mode) >>>>>> Default: 300 >>>>>> >>>>>> +tcp_retrans_txhash_mode - INTEGER >>>>>> + If zero, disable txhash recalculation due to non-RTO >>>>>>retransmissions >>>>>> + after an RTO. The idea is that broken paths will trigger an >>>>>>RTO >>>>>>and >>>>>> + we don't want going back to that path due to standard >>>>>>retransmissons >>>>>> + (flow balancing). The drawback is that balancing is less >>>>>>robust. >>>>>> + If greater than zero, can always (probabilistically) >>>>>>recalculate >>>>>> + txhash after non-RTO retransmissions. >>>>>> + >>>>>> +tcp_retrans_txhash_prob - INTEGER >>>>>> + Probability [0 to 100] that we will recalculate txhash when a >>>>>> + packet is resent not due to RTO (for RTO txhash is always >>>>>>recalculated). >>>>>> + The recalculation of the txhash may be delayed to decrease >>>>>>the >>>>>> + likelihood that reordering will trigger retransmissons. >>>>>> + The purpose is to help balance the flows among the possible >>>>>>paths. >>>>>> + >>>>>> tcp_retrans_collapse - BOOLEAN >>>>>> Bug-to-bug compatibility with some broken printers. >>>>>> On retransmit try to send bigger packets to work around bugs >>>>>>in >>>>>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h >>>>>> index a17ae7b..e0e3b7d 100644 >>>>>> --- a/include/linux/tcp.h >>>>>> +++ b/include/linux/tcp.h >>>>>> @@ -214,7 +214,9 @@ struct tcp_sock { >>>>>> } rack; >>>>>> u16 advmss; /* Advertised MSS >>>>>>*/ >>>>>> u8 rate_app_limited:1, /* rate_{delivered,interval_us} >>>>>>limited? */ >>>>>> - unused:7; >>>>>> + txhash_rto:1, /* If set, don't do flow balancing >>>>>>*/ >>>>>> + txhash_want:1, /* We want to change txhash when safe >>>>>>*/ >>>>>> + unused:5; >>>>>> u8 nonagle : 4,/* Disable Nagle algorithm? >>>>>>*/ >>>>>> thin_lto : 1,/* Use linear timeouts for thin >>>>>>streams >>>>>>*/ >>>>>> thin_dupack : 1,/* Fast retransmit on first dupack >>>>>>*/ >>>>>> diff --git a/include/net/tcp.h b/include/net/tcp.h >>>>>> index f83b7f2..3abd304 100644 >>>>>> --- a/include/net/tcp.h >>>>>> +++ b/include/net/tcp.h >>>>>> @@ -271,6 +271,8 @@ extern int sysctl_tcp_autocorking; >>>>>> extern int sysctl_tcp_invalid_ratelimit; >>>>>> extern int sysctl_tcp_pacing_ss_ratio; >>>>>> extern int sysctl_tcp_pacing_ca_ratio; >>>>>> +extern int sysctl_tcp_retrans_txhash_prob; >>>>>> +extern int sysctl_tcp_retrans_txhash_mode; >>>>>> >>>>>> extern atomic_long_t tcp_memory_allocated; >>>>>> extern struct percpu_counter tcp_sockets_allocated; >>>>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c >>>>>> index 1cb67de..00d6f26 100644 >>>>>> --- a/net/ipv4/sysctl_net_ipv4.c >>>>>> +++ b/net/ipv4/sysctl_net_ipv4.c >>>>>> @@ -28,6 +28,7 @@ >>>>>> static int zero; >>>>>> static int one = 1; >>>>>> static int four = 4; >>>>>> +static int hundred = 100; >>>>>> static int thousand = 1000; >>>>>> static int gso_max_segs = GSO_MAX_SEGS; >>>>>> static int tcp_retr1_max = 255; >>>>>> @@ -624,6 +625,23 @@ static struct ctl_table ipv4_table[] = { >>>>>> .proc_handler = proc_dointvec_ms_jiffies, >>>>>> }, >>>>>> { >>>>>> + .procname = "tcp_retrans_txhash_prob", >>>>>> + .data = &sysctl_tcp_retrans_txhash_prob, >>>>>> + .maxlen = sizeof(int), >>>>>> + .mode = 0644, >>>>>> + .proc_handler = proc_dointvec_minmax, >>>>>> + .extra1 = &zero, >>>>>> + .extra2 = &hundred, >>>>>> + }, >>>>>> + { >>>>>> + .procname = "tcp_retrans_txhash_mode", >>>>>> + .data = &sysctl_tcp_retrans_txhash_mode, >>>>>> + .maxlen = sizeof(int), >>>>>> + .mode = 0644, >>>>>> + .proc_handler = proc_dointvec_minmax, >>>>>> + .extra1 = &zero, >>>>>> + }, >>>>>> + { >>>>>> .procname = "icmp_msgs_per_sec", >>>>>> .data = &sysctl_icmp_msgs_per_sec, >>>>>> .maxlen = sizeof(int), >>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >>>>>> index a27b9c0..fed5366 100644 >>>>>> --- a/net/ipv4/tcp_input.c >>>>>> +++ b/net/ipv4/tcp_input.c >>>>>> @@ -101,6 +101,9 @@ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1; >>>>>> int sysctl_tcp_early_retrans __read_mostly = 3; >>>>>> int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2; >>>>>> >>>>>> +int sysctl_tcp_retrans_txhash_prob __read_mostly; >>>>>> +int sysctl_tcp_retrans_txhash_mode __read_mostly; >>>>>> + >>>>>> #define FLAG_DATA 0x01 /* Incoming frame contained >>>>>>data. >>>>>> */ >>>>>> #define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a >>>>>>window update. */ >>>>>> #define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged >>>>>>new data. */ >>>>>> @@ -3674,6 +3677,13 @@ static int tcp_ack(struct sock *sk, const >>>>>>struct >>>>>>sk_buff *skb, int flag) >>>>>> flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, >>>>>>&acked, >>>>>> &sack_state, &now); >>>>>> >>>>>> + /* Check if we should set txhash (would not cause >>>>>>reordering) */ >>>>>> + if (tp->txhash_want && >>>>>> + (tp->packets_out - tp->sacked_out) < tp->reordering) { >>>>>> + sk_set_txhash(sk); >>>>>> + tp->txhash_want = 0; >>>>>> + } >>>>>> + >>>>>> if (tcp_ack_is_dubious(sk, flag)) { >>>>>> is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | >>>>>>FLAG_NOT_DUP)); >>>>>> tcp_fastretrans_alert(sk, acked, is_dupack, &flag, >>>>>>&rexmit); >>>>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c >>>>>> index 896e9df..58490ac 100644 >>>>>> --- a/net/ipv4/tcp_output.c >>>>>> +++ b/net/ipv4/tcp_output.c >>>>>> @@ -2738,9 +2738,30 @@ int tcp_retransmit_skb(struct sock *sk, struct >>>>>>sk_buff *skb, int segs) >>>>>> tp->retrans_out += tcp_skb_pcount(skb); >>>>>> >>>>>> /* Save stamp of the first retransmit. */ >>>>>> - if (!tp->retrans_stamp) >>>>>> + if (!tp->retrans_stamp) { >>>>>> tp->retrans_stamp = tcp_skb_timestamp(skb); >>>>>> >>>>>> + /* Determine if we should reset hash, only >>>>>>done >>>>>>once >>>>>> + * per recovery >>>>>> + */ >>>>>> + if ((!tp->txhash_rto || >>>>>> + sysctl_tcp_retrans_txhash_mode > 0) && >>>>>> + sk->sk_txhash && >>>>>> + (prandom_u32_max(100) < >>>>>> + sysctl_tcp_retrans_txhash_prob)) { >>>>>> + /* If not too much reordering, or >>>>>>RTT is >>>>>> + * small enough that we don't care >>>>>>about >>>>>> + * reordering, then change it now. >>>>>> + * Else, wait until it is safe. >>>>>> + */ >>>>>> + if ((tp->packets_out - >>>>>>tp->sacked_out) < >>>>>> + tp->reordering) >>>>>I don't parse this logic ... suppose reordering is 100 (not uncommon >>>>>today due to the last packet being delivered slightly earlier than the >>>>>rest), and cwnd==packets_out =~200,we only want to rehash until half >>>>>of the packets are sacked, so we are still rehashing even when >>>>>reordering is heavy? >>>> >>>> In your scenario, there would be no re-hashing until sacked_out is 101 >>>> ((packets_out - sacked_out) < 100). This code would mark txhash_want. >>>> Then when an ACK is received and the conditional is true, txhash >>>> would be changed. >>>> >>>> Now, the test does not prevent retransmissions due to reordering in all >>>> cases, but hopefully in most. I will also add the test you recommended, >>>> checking for CA_Loss, to prevent too much re-hashing. >>> If the whole point is to rehash at most once between recovery events, >>> why do we need this complicated change at per packet level in >>> tcp_retransmit_skb()? there are functions that start and end recovery >>> (tcp_enter_recovery, tcp_end_cwnd_reduction). our retransmission logic >>> is already very complicated. >>> >>> I still don't understand the incentive of starting the rehashing >>> half-way retransmitting depending on the sacking and reordering >>> status, for temporarily congestion as you've mentioned. >>> >>> is this feature unique for intra-DC connections? >>I thought more about this patch on my way home and have more >>questions: why do we exclude RTO retransmission specifically? also >>when we rehash, we'll introduce reordering either in recovery or after >>recovery, as some TCP CC like bbr would continue sending regardlessly, >>so starting in tcp_ack() with tp->txhash_want does not really prevent >>causing more reordering. >> >>so if we want to rehash upon a recovery event (fast or timeout), then >>we can make a helper function ie tcp_retry_alt_path, and just call it >>in tcp_enter_loss and tcp_recovery (which are called once per >>recovery). Since it'll cause reordering anyways, I doubt we need to >>check tp->reordering at all. If packets already traverse different >>paths, rehashing can make it better or worse. if packets weren't >>traversing different paths, the rehashing itself can introduce >>reordering. so in the end there is much benefit to make decision based >>on current inflight and reordering IMO... >> >> >>> >>>> >>>>> >>>>>also where do we check RTT is small? >>>> >>>> The RTT comment is left over from a previous version, I will remove it. >>>> >>>>> >>>>>> + sk_set_txhash(sk); >>>>>> + else >>>>>> + tp->txhash_want = 1; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> } else if (err != -EBUSY) { >>>>>> NET_INC_STATS(sock_net(sk), >>>>>>LINUX_MIB_TCPRETRANSFAIL); >>>>>> } >>>>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c >>>>>> index 3ea1cf8..e66baad 100644 >>>>>> --- a/net/ipv4/tcp_timer.c >>>>>> +++ b/net/ipv4/tcp_timer.c >>>>>> @@ -186,6 +186,8 @@ static int tcp_write_timeout(struct sock *sk) >>>>>> >>>>>> if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { >>>>>> if (icsk->icsk_retransmits) { >>>>>> + tp->txhash_rto = 1; >>>>>> + tp->txhash_want = 0; >>>>>> dst_negative_advice(sk); >>>>>> if (tp->syn_fastopen || tp->syn_data) >>>>>> tcp_fastopen_cache_set(sk, 0, NULL, >>>>>>true, 0); >>>>>> @@ -218,6 +220,8 @@ static int tcp_write_timeout(struct sock *sk) >>>>>> } else { >>>>>> sk_rethink_txhash(sk); >>>>>> } >>>>>> + tp->txhash_rto = 1; >>>>>> + tp->txhash_want = 0; >>>>>> >>>>>> retry_until = net->ipv4.sysctl_tcp_retries2; >>>>>> if (sock_flag(sk, SOCK_DEAD)) { >>>>>> -- >>>>>> 2.9.3 >>>>>> >>>> >