On Mon, 25 Jun 2007 22:09:39 +0900 (JST)
OBATA Noboru <[EMAIL PROTECTED]> wrote:

> From: OBATA Noboru <[EMAIL PROTECTED]>
> 
> Make TCP_RTO_MAX a variable, and allow a user to change it via a
> new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> then guarantee TCP retransmission to be more controllable, say,
> at least once per 10 seconds, by setting it to 10.  This is
> quite helpful on failover-capable network devices, such as an
> active-backup bonding device.  On such devices, it is desirable
> that TCP retransmits a packet shortly after the failover, which
> is what I would like to do with this patch.  Please see
> Background and Problem below for rationale in detail.
> 
> Reading from /proc/sys/net/ipv4/tcp_rto_max shows the current
> TCP_RTO_MAX in seconds.  The actual value of TCP_RTO_MAX is
> stored in sysctl_tcp_rto_max in jiffies.
> 
> Writing to /proc/sys/net/ipv4/tcp_rto_max updates the
> TCP_RTO_MAX, only if the new value is not smaller than
> TCP_RTO_MIN, which is currently 0.2[sec].  Since tcp_rto_max is
> an integer, the minimum value of /proc/sys/net/ipv4/tcp_rto_max
> is 1, in substance.  Also the RtoMax entry in /proc/net/snmp is
> updated.
> 
> Please note that this is effective in IPv6 as well.
> 
> 
> Background and Problem
> ======================
> 
> When designing a TCP/IP based network system on failover-capable
> network devices, people want to set timeouts hierarchically in
> three layers, network device layer, TCP layer, and application
> layer (bottom-up order), such that:
> 
> 1. Network device layer detects a failure first and switch to a
>    backup device (say, in 20sec).
> 
> 2. TCP layer timeout & retransmission comes next, _hopefully_
>    before the application layer timeout.
> 
> 3. Application layer detects a network failure last (by, say,
>    30sec timeout) and may trigger a system-level failover.
> 
>    * Note 1.  The timeouts for #1 and #2 are handled
>      independently and there is no relationship between them.
> 
>    * Note 2.  The actual timeout settings (20sec or 30sec in
>      this example) are often determined by systems requirement
>      and so setting them to certain "safe values" (if any) are
>      usually not possible.
> 
> If TCP retransmission misses the time frame between event #1
> and #3 in Background above (between 20 and 30sec since network
> failure), a failure causes the system-level failover where the
> network-device-level failover should be enough.
> 
> The problem in this hierarchical timeout scheme is that TCP
> layer does not guarantee the next retransmission to occur in
> certain period of time.  In the above example, people expect TCP
> to retransmit a packet between 20 and 30sec since network
> failure, but it may not happen.
> 
> Starting from RTO=0.5sec for example, retransmission will occur
> at time 0.5, 1.5, 3.5, 7.5, 15.5, and 31.5 as indicated by 'o'
> in the following diagram, but miss the time frame between time
> 20 and 30.
> 
>        time: 0         10        20        30sec
>              |         |         |         |
>   App. layer |---------+---------+---------X  ==> system failover
>    TCP layer oo-o---o--+----o----+---------+o <== expects retrans. b/w 20~30
> Netdev layer |---------+---------X            ==> network failover
> 
> 
> Signed-off-by: OBATA Noboru <[EMAIL PROTECTED]>
> ---
> 
>  Documentation/networking/ip-sysctl.txt |    6 +
>  include/linux/sysctl.h                 |    1
>  include/net/tcp.h                      |    5 +
>  net/ipv4/sysctl_net_ipv4.c             |   77 +++++++++++++++++++++++++
>  net/ipv4/tcp_timer.c                   |    3
>  5 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff -uprN -X a/Documentation/dontdiff 
> linux-2.6.22-rc5-orig/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> --- a/Documentation/networking/ip-sysctl.txt  2007-06-22 21:34:18.000000000 
> +0900
> +++ b/Documentation/networking/ip-sysctl.txt  2007-06-25 16:07:21.000000000 
> +0900
> @@ -340,6 +340,12 @@ tcp_rmem - vector of 3 INTEGERs: min, de
>       net.core.rmem_max, "static" selection via SO_RCVBUF does not use this.
>       Default: 87380*2 bytes.
>  
> +tcp_rto_max - INTEGER
> +     Maximum time in seconds to which RTO can grow.  Exponential
> +     backoff of RTO is bounded by this value.  The value must not be
> +     smaller than 1.  Note this parameter is also effective for IPv6.
> +     Default: 120
> +
>  tcp_sack - BOOLEAN
>       Enable select acknowledgments (SACKS).
>  
> diff -uprN -X a/Documentation/dontdiff 
> linux-2.6.22-rc5-orig/include/linux/sysctl.h b/include/linux/sysctl.h
> --- a/include/linux/sysctl.h  2007-06-22 21:34:33.000000000 +0900
> +++ b/include/linux/sysctl.h  2007-06-25 16:27:29.000000000 +0900
> @@ -441,6 +441,7 @@ enum
>       NET_TCP_ALLOWED_CONG_CONTROL=123,
>       NET_TCP_MAX_SSTHRESH=124,
>       NET_TCP_FRTO_RESPONSE=125,
> +     NET_TCP_RTO_MAX=126,
>  };
>  

Rather than assigning another numeric sysctl value, you can use
CTL_UNNUMBERED.  The use of numeric sysctl's is being phased down, at one
point they were even going to be deprecated.


>  enum {
> diff -uprN -X a/Documentation/dontdiff 
> linux-2.6.22-rc5-orig/include/net/tcp.h b/include/net/tcp.h
> --- a/include/net/tcp.h       2007-06-22 21:34:33.000000000 +0900
> +++ b/include/net/tcp.h       2007-06-22 21:40:05.000000000 +0900
> @@ -121,7 +121,9 @@ extern void tcp_time_wait(struct sock *s
>  #define TCP_DELACK_MIN       4U
>  #define TCP_ATO_MIN  4U
>  #endif
> -#define TCP_RTO_MAX  ((unsigned)(120*HZ))
> +extern int sysctl_tcp_rto_max;
> +#define TCP_RTO_MAX  ((unsigned)(sysctl_tcp_rto_max))
> +#define TCP_RTO_MAX_DEFAULT  ((unsigned)(120*HZ))
>  #define TCP_RTO_MIN  ((unsigned)(HZ/5))
>  #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))  /* RFC 1122 initial RTO value   
> */

Rather than causing macro TCP_RTO_MAX to reference sysctl_rto_max directly.

> @@ -203,6 +205,7 @@ extern int sysctl_tcp_synack_retries;
>  extern int sysctl_tcp_retries1;
>  extern int sysctl_tcp_retries2;
>  extern int sysctl_tcp_orphan_retries;
> +extern int sysctl_tcp_rto_max;
>  extern int sysctl_tcp_syncookies;
>  extern int sysctl_tcp_retrans_collapse;
>  extern int sysctl_tcp_stdurg;
> diff -uprN -X a/Documentation/dontdiff 
> linux-2.6.22-rc5-orig/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> --- a/net/ipv4/sysctl_net_ipv4.c      2007-06-22 21:34:33.000000000 +0900
> +++ b/net/ipv4/sysctl_net_ipv4.c      2007-06-25 16:27:53.000000000 +0900
> @@ -186,6 +186,74 @@ static int strategy_allowed_congestion_c
>  
>  }
>  
> +static int proc_tcp_rto_max(ctl_table *ctl, int write, struct file *filp,
> +                         void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +     int val = *(int *)ctl->data;
> +     int ret;
> +
> +     ret = proc_dointvec_jiffies(ctl, write, filp, buffer, lenp, ppos);
> +     if (ret)
> +             return ret;
> +
> +     if (write && *(int *)ctl->data != val) {
> +             if (*(int *)ctl->data < TCP_RTO_MIN) {
> +                     *(int *)ctl->data = val;
> +                     return -EINVAL;
> +             }
> +             TCP_ADD_STATS_USER(TCP_MIB_RTOMAX,
> +                                (*(int *)ctl->data - val) * 1000 / HZ);
> +     }
> +
> +     return 0;
> +}
> +
> +static int strategy_tcp_rto_max(ctl_table *table, int __user *name,
> +                             int nlen, void __user *oldval,
> +                             size_t __user *oldlenp,
> +                             void __user *newval, size_t newlen)
> +{
> +     int *valp = table->data;
> +     int new;
> +
> +     if (!newval || !newlen)
> +             return 0;
> +
> +     if (newlen != sizeof(int))
> +             return -EINVAL;
> +
> +     if (get_user(new, (int __user *)newval))
> +             return -EFAULT;
> +
> +     if (new * HZ == *valp)
> +             return 0;
> +
> +     if (new * HZ < TCP_RTO_MIN)
> +             return -EINVAL;
> +
> +     if (oldval && oldlenp) {
> +             size_t len;
> +
> +             if (get_user(len, oldlenp))
> +                     return -EFAULT;
> +
> +             if (len) {
> +                     if (len > table->maxlen)
> +                             len = table->maxlen;
> +                     if (put_user(*valp / HZ, (int __user *)oldval))
> +                             return -EFAULT;
> +                     if (put_user(len, oldlenp))
> +                             return -EFAULT;
> +             }
> +     }
> +
> +     TCP_ADD_STATS_USER(TCP_MIB_RTOMAX, (new * HZ - *valp) * 1000 / HZ);
> +
> +     *valp = new * HZ;
> +
> +     return 1;
> +}

Could sysctl_rto_max be unsigned instead of int to avoid possible sign wrap 
issues and
having to cast it on each use?

>  ctl_table ipv4_table[] = {
>       {
>               .ctl_name       = NET_IPV4_TCP_TIMESTAMPS,
> @@ -363,6 +431,15 @@ ctl_table ipv4_table[] = {
>               .proc_handler   = &proc_dointvec
>       },
>       {
> +             .ctl_name       = NET_TCP_RTO_MAX,
> +             .procname       = "tcp_rto_max",
> +             .data           = &sysctl_tcp_rto_max,
> +             .maxlen         = sizeof(int),
> +             .mode           = 0644,
> +             .proc_handler   = &proc_tcp_rto_max,
> +             .strategy       = &strategy_tcp_rto_max
> +     },
> +     {
>               .ctl_name       = NET_IPV4_TCP_FIN_TIMEOUT,
>               .procname       = "tcp_fin_timeout",
>               .data           = &sysctl_tcp_fin_timeout,
> diff -uprN -X a/Documentation/dontdiff 
> linux-2.6.22-rc5-orig/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> --- a/net/ipv4/tcp_timer.c    2007-06-22 21:34:33.000000000 +0900
> +++ b/net/ipv4/tcp_timer.c    2007-06-22 21:39:35.000000000 +0900
> @@ -31,6 +31,9 @@ int sysctl_tcp_keepalive_intvl __read_mo
>  int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
>  int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
>  int sysctl_tcp_orphan_retries __read_mostly;
> +int sysctl_tcp_rto_max __read_mostly = TCP_RTO_MAX_DEFAULT;
> +
> +EXPORT_SYMBOL(sysctl_tcp_rto_max);
>  
>  static void tcp_write_timer(unsigned long);
>  static void tcp_delack_timer(unsigned long);


-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to