On Thu, Mar 17, 2016 at 4:26 PM, Bendik Rønning Opstad <bro.de...@gmail.com> wrote: > > >> diff --git a/Documentation/networking/ip-sysctl.txt > >> b/Documentation/networking/ip-sysctl.txt > >> index 6a92b15..8f3f3bf 100644 > >> --- a/Documentation/networking/ip-sysctl.txt > >> +++ b/Documentation/networking/ip-sysctl.txt > >> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER > >> calculated, which is used to classify whether a stream is thin. > >> Default: 10000 > >> > >> +tcp_rdb - BOOLEAN > >> + Enable RDB for all new TCP connections. > > Please describe RDB briefly, perhaps with a pointer to your paper. > > Ah, yes, that description may have been a bit too brief... > > What about pointing to tcp-thin.txt in the brief description, and > rewrite tcp-thin.txt with a more detailed description of RDB along > with a paper reference? +1 > > > I suggest have three level of controls: > > 0: disable RDB completely > > 1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket > > options > > 2: enable RDB on all thin-stream conn. by default > > > > currently it only provides mode 1 and 2. but there may be cases where > > the administrator wants to disallow it (e.g., broken middle-boxes). > > Good idea. Will change this. > > >> + Default: 0 > >> + > >> +tcp_rdb_max_bytes - INTEGER > >> + Enable restriction on how many bytes an RDB packet can contain. > >> + This is the total amount of payload including the new unsent data. > >> + Default: 0 > >> + > >> +tcp_rdb_max_packets - INTEGER > >> + Enable restriction on how many previous packets in the output queue > >> + RDB may include data from. A value of 1 will restrict bundling to > >> + only the data from the last packet that was sent. > >> + Default: 1 > > why two metrics on redundancy? > > We have primarily used the packet based limit in our tests. This is > also the most important knob as it directly controls how many lost > packets each RDB packet may recover. > > We believe that the byte based limit can also be useful because it > allows more fine grained control on how much impact RDB can have on > the increased bandwidth requirements of the flows. If an application > writes 700 bytes per write call, the bandwidth increase can be quite > significant (even with a 1 packet bundling limit) if we consider a > scenario with thousands of RDB streams. > > In some of our experiments with many simultaneous thin streams, where > we set up a bottleneck rate limited by a htb with pfifo queue, we > observed considerable difference in loss rates depending on how many > bytes (packets) were allowed to be bundled with each packet. This is > partly why we recommend a default bundling limit of 1 packet. > > By limiting the total payload size of RDB packets to e.g. 100 bytes, > only the smallest segments will benefit from RDB, while the segments > that would increase the bandwidth requirements the most, will not. > > While a very large number of RDB streams from one sender may be a > corner case, we still think this sysctl knob can be valuable for a > sysadmin that finds himself in such a situation. These nice comments would be useful in the sysctl descriptions.
> > > It also seems better to > > allow individual socket to select the redundancy level (e.g., > > setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting. > > This requires more bits in tcp_sock but 2-3 more is suffice. > > Most certainly. We decided not to implement this for the patch to keep > it as simple as possible, however, we surely prefer to have this > functionality included if possible. > > >> +static unsigned int rdb_detect_loss(struct sock *sk) > >> +{ > ... > >> + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { > >> + if (before(TCP_SKB_CB(skb)->seq, > >> scb->tx.rdb_start_seq)) > >> + break; > >> + packets_lost++; > > since we only care if there is packet loss or not, we can return early here? > > Yes, I considered that, and as long as the number of packets presumed > to be lost is not needed, that will suffice. However, could this not > be useful for statistical purposes? > > This is also relevant to the comment from Eric on SNMP counters for > how many times losses could be repaired by RDB? > > >> + } > >> + break; > >> + } > >> + return packets_lost; > >> +} > >> + > >> +/** > >> + * tcp_rdb_ack_event() - initiate RDB loss detection > >> + * @sk: socket > >> + * @flags: flags > >> + */ > >> +void tcp_rdb_ack_event(struct sock *sk, u32 flags) > > flags are not used > > Ah, yes, will remove that. > > >> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, > >> + unsigned int mss_now, gfp_t gfp_mask) > >> +{ > >> + struct sk_buff *rdb_skb = NULL; > >> + struct sk_buff *first_to_bundle; > >> + u32 bytes_in_rdb_skb = 0; > >> + > >> + /* How we detect that RDB was used. When equal, no RDB data was > >> sent */ > >> + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq; > > > >> + > >> + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk))) > > During loss recovery tcp inflight fluctuates and would like to trigger > > this check even for non-thin-stream connections. > > Good point. > > > Since the loss > > already occurs, RDB can only take advantage from limited-transmit, > > which it likely does not have (b/c its a thin-stream). It might be > > checking if the state is open. > > You mean to test for open state to avoid calling rdb_can_bundle_test() > unnecessarily if we (presume to) know it cannot bundle anyway? That > makes sense, however, I would like to do some tests on whether "state > != open" is a good indicator on when bundling is not possible. > > >> + goto xmit_default; > >> + > >> + /* No bundling if first in queue, or on FIN packet */ > >> + if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) || > >> + (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)) > > seems there are still benefit to bundle packets up to FIN? > > I was close to removing the FIN test, but decided to not remove it > until I could verify that it will not cause any issues on some TCP > receivers. If/(Since?) you are certain it will not cause any issues, I > will remove it. > > > since RDB will cause DSACKs, and we only blindly count DSACKs to > > perform CWND undo. How does RDB handle that false positives? > > That is a very good question. The simple answer is that the > implementation does not handle any such false positives, which I > expect can result in incorrectly undoing CWND reduction in some cases. > This gets a bit complicated, so I'll have to do some more testing on > this to verify with certainty when it happens. > > When there is no loss, and each RDB packet arriving at the receiver > contains both already received and new data, the receiver will respond > with an ACK that acknowledges new data (moves snd_una), with the SACK > field populated with the already received sequence range (DSACK). > > The DSACKs in these incoming ACKs are not counted (tp->undo_retrans--) > unless tp->undo_marker has been set by tcp_init_undo(), which is > called by either tcp_enter_loss() or tcp_enter_recovery(). However, > whenever a loss is detected by rdb_detect_loss(), tcp_enter_cwr() is > called, which disables CWND undo. Therefore, I believe the incorrect thanks for the clarification. it might worth a short comment on why we use tcp_enter_cwr() (to disable undo) > counting of DSACKs from ACKs on RDB packets will only be a problem > after the regular loss detection mechanisms (Fast Retransmit/RTO) have > been triggered (i.e. we are in either TCP_CA_Recovery or TCP_CA_Loss). > > We have recorded the CWND values for both RDB and non-RDB streams in > our experiments, and have not found any obvious red flags when > analysing the results, so I presume (hope may be more precise) this is > not a major issue we have missed. Nevertheless, I will investigate > this in detail and get back to you. > > > Thank you for the detailed comments. > > Bendik >