Author: lstewart
Date: Fri Nov 12 06:41:55 2010
New Revision: 215166
URL: http://svn.freebsd.org/changeset/base/215166

Log:
  This commit marks the first formal contribution of the "Five New TCP 
Congestion
  Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More 
details
  about the project are available at: http://caia.swin.edu.au/freebsd/5cc/
  
  - Add a KPI and supporting infrastructure to allow modular congestion control
    algorithms to be used in the net stack. Algorithms can maintain 
per-connection
    state if required, and connections maintain their own algorithm pointer, 
which
    allows different connections to concurrently use different algorithms. The
    TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
    programmatically query or change the congestion control algorithm 
respectively
    from within an application at runtime.
  
  - Integrate the framework with the TCP stack in as least intrusive a manner as
    possible. Care was also taken to develop the framework in a way that should
    allow integration with other congestion aware transport protocols (e.g. 
SCTP)
    in the future. The hope is that we will one day be able to share a single 
set
    of congestion control algorithm modules between all congestion aware 
transport
    protocols.
  
  - Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP 
stack
    and use it to decouple the meaning of recovery from a congestion event and
    recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay 
based
    congestion control protocols don't generally need to recover from packet 
loss
    and need a different way to note a congestion recovery episode within the
    stack.
  
  - Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of 
code
    and ensures the stack always uses the appropriate mechanisms for recovering
    from packet loss during a congestion recovery episode.
  
  - Extract the NewReno congestion control algorithm from the TCP stack and
    massage it into module form. NewReno is always built into the kernel and 
will
    remain the default algorithm for the forseeable future. Implementations of
    additional different algorithms will become available in the near future.
  
  - Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
    that relies on the size of "struct tcpcb" is required.
  
  Many thanks go to the Cisco University Research Program Fund at Community
  Foundation Silicon Valley and the FreeBSD Foundation. Their support of our 
work
  at the Centre for Advanced Internet Architectures, Swinburne University of
  Technology is greatly appreciated.
  
  In collaboration with:        David Hayes <dahayes at swin edu au> and
                        Grenville Armitage <garmitage at swin edu au>
  Sponsored by: Cisco URP, FreeBSD Foundation
  Reviewed by:  rpaulo
  Tested by:    David Hayes (and many others over the years)
  MFC after:    3 months

Added:
  head/sys/netinet/cc/
  head/sys/netinet/cc.h   (contents, props changed)
  head/sys/netinet/cc/cc.c   (contents, props changed)
  head/sys/netinet/cc/cc_module.h   (contents, props changed)
  head/sys/netinet/cc/cc_newreno.c   (contents, props changed)
Modified:
  head/UPDATING
  head/sys/conf/files
  head/sys/netinet/tcp_input.c
  head/sys/netinet/tcp_output.c
  head/sys/netinet/tcp_sack.c
  head/sys/netinet/tcp_subr.c
  head/sys/netinet/tcp_timer.c
  head/sys/netinet/tcp_usrreq.c
  head/sys/netinet/tcp_var.h
  head/sys/sys/param.h

Modified: head/UPDATING
==============================================================================
--- head/UPDATING       Fri Nov 12 05:22:27 2010        (r215165)
+++ head/UPDATING       Fri Nov 12 06:41:55 2010        (r215166)
@@ -22,6 +22,13 @@ NOTE TO PEOPLE WHO THINK THAT FreeBSD 9.
        machines to maximize performance.  (To disable malloc debugging, run
        ln -s aj /etc/malloc.conf.)
 
+20101111:
+       The TCP stack has received a significant update to add support for
+       modularised congestion control and generally improve the clarity of
+       congestion control decisions. Bump __FreeBSD_version to 900025. User
+       space tools that rely on the size of struct tcpcb in tcp_var.h (e.g.
+       sockstat) need to be recompiled.
+
 20101002:
        The man(1) utility has been replaced by a new version that no longer
        uses /etc/manpath.config. Please consult man.conf(5) for how to

Modified: head/sys/conf/files
==============================================================================
--- head/sys/conf/files Fri Nov 12 05:22:27 2010        (r215165)
+++ head/sys/conf/files Fri Nov 12 06:41:55 2010        (r215166)
@@ -2598,6 +2598,8 @@ netinet/ip_mroute.c               optional mrouting i
 netinet/ip_options.c           optional inet
 netinet/ip_output.c            optional inet
 netinet/raw_ip.c               optional inet
+netinet/cc/cc.c                        optional inet
+netinet/cc/cc_newreno.c                optional inet
 netinet/sctp_asconf.c          optional inet sctp
 netinet/sctp_auth.c            optional inet sctp
 netinet/sctp_bsd_addr.c                optional inet sctp

Added: head/sys/netinet/cc.h
==============================================================================
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/sys/netinet/cc.h       Fri Nov 12 06:41:55 2010        (r215166)
@@ -0,0 +1,161 @@
+/*-
+ * Copyright (c) 2007-2008
+ *     Swinburne University of Technology, Melbourne, Australia.
+ * Copyright (c) 2009-2010 Lawrence Stewart <lstew...@freebsd.org>
+ * Copyright (c) 2010 The FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed at the Centre for Advanced Internet
+ * Architectures, Swinburne University, by Lawrence Stewart and James Healy,
+ * made possible in part by a grant from the Cisco University Research Program
+ * Fund at Community Foundation Silicon Valley.
+ *
+ * Portions of this software were developed at the Centre for Advanced
+ * Internet Architectures, Swinburne University of Technology, Melbourne,
+ * Australia by David Hayes under sponsorship from the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+/*
+ * This software was first released in 2007 by James Healy and Lawrence Stewart
+ * whilst working on the NewTCP research project at Swinburne University's
+ * Centre for Advanced Internet Architectures, Melbourne, Australia, which was
+ * made possible in part by a grant from the Cisco University Research Program
+ * Fund at Community Foundation Silicon Valley. More details are available at:
+ *   http://caia.swin.edu.au/urp/newtcp/
+ */
+
+#ifndef _NETINET_CC_H_
+#define _NETINET_CC_H_
+
+/* XXX: TCP_CA_NAME_MAX define lives in tcp.h for compat reasons. */
+#include <netinet/tcp.h>
+
+/* Global CC vars. */
+extern STAILQ_HEAD(cc_head, cc_algo) cc_list;
+extern const int tcprexmtthresh;
+extern struct cc_algo newreno_cc_algo;
+
+/* Define the new net.inet.tcp.cc sysctl tree. */
+SYSCTL_DECL(_net_inet_tcp_cc);
+
+/* CC housekeeping functions. */
+void   cc_init(void);
+int    cc_register_algo(struct cc_algo *add_cc);
+int    cc_deregister_algo(struct cc_algo *remove_cc);
+
+/*
+ * Wrapper around transport structs that contain same-named congestion
+ * control variables. Allows algos to be shared amongst multiple CC aware
+ * transprots.
+ */
+struct cc_var {
+       void            *cc_data; /* Per-connection private CC algorithm data. 
*/
+       int             bytes_this_ack; /* # bytes acked by the current ACK. */
+       tcp_seq         curack; /* Most recent ACK. */
+       uint32_t        flags; /* Flags for cc_var (see below) */
+       int             type; /* Indicates which ptr is valid in ccvc. */
+       union ccv_container {
+               struct tcpcb            *tcp;
+               struct sctp_nets        *sctp;
+       } ccvc;
+};
+
+/* cc_var flags. */
+#define        CCF_ABC_SENTAWND        0x0001  /* ABC counted cwnd worth of 
bytes? */
+#define        CCF_CWND_LIMITED        0x0002  /* Are we currently cwnd 
limited? */
+
+/* ACK types passed to the ack_received() hook. */
+#define        CC_ACK          0x0001  /* Regular in sequence ACK. */
+#define        CC_DUPACK       0x0002  /* Duplicate ACK. */
+#define        CC_PARTIALACK   0x0004  /* Not yet. */
+#define        CC_SACK         0x0008  /* Not yet. */
+
+/*
+ * Congestion signal types passed to the cong_signal() hook. The highest order 
8
+ * bits (0x01000000 - 0x80000000) are reserved for CC algos to declare their 
own
+ * congestion signal types.
+ */
+#define        CC_ECN          0x000001/* ECN marked packet received. */
+#define        CC_RTO          0x000002/* RTO fired. */
+#define        CC_RTO_ERR      0x000004/* RTO fired in error. */
+#define        CC_NDUPACK      0x000008/* Threshold of dupack's reached. */
+
+/*
+ * Structure to hold data and function pointers that together represent a
+ * congestion control algorithm.
+ */
+struct cc_algo {
+       char    name[TCP_CA_NAME_MAX];
+
+       /* Init global module state on kldload. */
+       int     (*mod_init)(void);
+
+       /* Cleanup global module state on kldunload. */
+       int     (*mod_destroy)(void);
+
+       /* Init CC state for a new control block. */
+       int     (*cb_init)(struct cc_var *ccv);
+
+       /* Cleanup CC state for a terminating control block. */
+       void    (*cb_destroy)(struct cc_var *ccv);
+
+       /* Init variables for a newly established connection. */
+       void    (*conn_init)(struct cc_var *ccv);
+
+       /* Called on receipt of an ack. */
+       void    (*ack_received)(struct cc_var *ccv, uint16_t type);
+
+       /* Called on detection of a congestion signal. */
+       void    (*cong_signal)(struct cc_var *ccv, uint32_t type);
+
+       /* Called after exiting congestion recovery. */
+       void    (*post_recovery)(struct cc_var *ccv);
+
+       /* Called when data transfer resumes after an idle period. */
+       void    (*after_idle)(struct cc_var *ccv);
+
+       STAILQ_ENTRY (cc_algo) entries;
+};
+
+/* Macro to obtain the CC algo's struct ptr. */
+#define        CC_ALGO(tp)     ((tp)->cc_algo)
+
+/* Macro to obtain the CC algo's data ptr. */
+#define        CC_DATA(tp)     ((tp)->ccv->cc_data)
+
+/* Macro to obtain the system default CC algo's struct ptr. */
+#define        CC_DEFAULT()    STAILQ_FIRST(&cc_list)
+
+extern struct rwlock cc_list_lock;
+#define        CC_LIST_LOCK_INIT()     rw_init(&cc_list_lock, "cc_list")
+#define        CC_LIST_LOCK_DESTROY()  rw_destroy(&cc_list_lock)
+#define        CC_LIST_RLOCK()         rw_rlock(&cc_list_lock)
+#define        CC_LIST_RUNLOCK()       rw_runlock(&cc_list_lock)
+#define        CC_LIST_WLOCK()         rw_wlock(&cc_list_lock)
+#define        CC_LIST_WUNLOCK()       rw_wunlock(&cc_list_lock)
+#define        CC_LIST_WLOCK_ASSERT()  rw_assert(&cc_list_lock, RA_WLOCKED)
+
+#endif /* _NETINET_CC_H_ */

Added: head/sys/netinet/cc/cc.c
==============================================================================
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/sys/netinet/cc/cc.c    Fri Nov 12 06:41:55 2010        (r215166)
@@ -0,0 +1,340 @@
+/*-
+ * Copyright (c) 2007-2008
+ *     Swinburne University of Technology, Melbourne, Australia.
+ * Copyright (c) 2009-2010 Lawrence Stewart <lstew...@freebsd.org>
+ * Copyright (c) 2010 The FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed at the Centre for Advanced Internet
+ * Architectures, Swinburne University, by Lawrence Stewart and James Healy,
+ * made possible in part by a grant from the Cisco University Research Program
+ * Fund at Community Foundation Silicon Valley.
+ *
+ * Portions of this software were developed at the Centre for Advanced
+ * Internet Architectures, Swinburne University of Technology, Melbourne,
+ * Australia by David Hayes under sponsorship from the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * This software was first released in 2007 by James Healy and Lawrence Stewart
+ * whilst working on the NewTCP research project at Swinburne University's
+ * Centre for Advanced Internet Architectures, Melbourne, Australia, which was
+ * made possible in part by a grant from the Cisco University Research Program
+ * Fund at Community Foundation Silicon Valley. More details are available at:
+ *   http://caia.swin.edu.au/urp/newtcp/
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/kernel.h>
+#include <sys/libkern.h>
+#include <sys/lock.h>
+#include <sys/malloc.h>
+#include <sys/module.h>
+#include <sys/mutex.h>
+#include <sys/queue.h>
+#include <sys/rwlock.h>
+#include <sys/sbuf.h>
+#include <sys/socket.h>
+#include <sys/socketvar.h>
+#include <sys/sysctl.h>
+
+#include <net/if.h>
+#include <net/if_var.h>
+
+#include <netinet/cc.h>
+#include <netinet/in.h>
+#include <netinet/in_pcb.h>
+#include <netinet/tcp_var.h>
+
+#include <netinet/cc/cc_module.h>
+
+/*
+ * List of available cc algorithms on the current system. First element
+ * is used as the system default CC algorithm.
+ */
+struct cc_head cc_list = STAILQ_HEAD_INITIALIZER(cc_list);
+
+/* Protects the cc_list TAILQ. */
+struct rwlock cc_list_lock;
+
+/*
+ * Set the default CC algorithm to new_default. The default is identified
+ * by being the first element in the cc_list TAILQ.
+ */
+static void
+cc_set_default(struct cc_algo *new_default)
+{
+       CC_LIST_WLOCK_ASSERT();
+
+       /*
+        * Make the requested system default CC algorithm the first element in
+        * the list if it isn't already.
+        */
+       if (new_default != CC_DEFAULT()) {
+               STAILQ_REMOVE(&cc_list, new_default, cc_algo, entries);
+               STAILQ_INSERT_HEAD(&cc_list, new_default, entries);
+       }
+}
+
+/*
+ * Sysctl handler to show and change the default CC algorithm.
+ */
+static int
+cc_default_algo(SYSCTL_HANDLER_ARGS)
+{
+       struct cc_algo *funcs;
+       int err, found;
+
+       err = found = 0;
+
+       if (req->newptr == NULL) {
+               char default_cc[TCP_CA_NAME_MAX];
+
+               /* Just print the current default. */
+               CC_LIST_RLOCK();
+               strlcpy(default_cc, CC_DEFAULT()->name, TCP_CA_NAME_MAX);
+               CC_LIST_RUNLOCK();
+               err = sysctl_handle_string(oidp, default_cc, 1, req);
+       } else {
+               /* Find algo with specified name and set it to default. */
+               CC_LIST_WLOCK();
+               STAILQ_FOREACH(funcs, &cc_list, entries) {
+                       if (strncmp((char *)req->newptr, funcs->name,
+                           TCP_CA_NAME_MAX) == 0) {
+                               found = 1;
+                               cc_set_default(funcs);
+                       }
+               }
+               CC_LIST_WUNLOCK();
+
+               if (!found)
+                       err = ESRCH;
+       }
+
+       return (err);
+}
+
+/*
+ * Sysctl handler to display the list of available CC algorithms.
+ */
+static int
+cc_list_available(SYSCTL_HANDLER_ARGS)
+{
+       struct cc_algo *algo;
+       struct sbuf *s;
+       int err, first;
+
+       err = 0;
+       first = 1;
+       s = sbuf_new(NULL, NULL, TCP_CA_NAME_MAX, SBUF_AUTOEXTEND);
+
+       if (s == NULL)
+               return (ENOMEM);
+
+       CC_LIST_RLOCK();
+       STAILQ_FOREACH(algo, &cc_list, entries) {
+               err = sbuf_printf(s, first ? "%s" : ", %s", algo->name);
+               if (err)
+                       break;
+               first = 0;
+       }
+       CC_LIST_RUNLOCK();
+
+       if (!err) {
+               sbuf_finish(s);
+               err = sysctl_handle_string(oidp, sbuf_data(s), 1, req);
+       }
+
+       sbuf_delete(s);
+       return (err);
+}
+
+/*
+ * Initialise CC subsystem on system boot.
+ */
+void
+cc_init()
+{
+       CC_LIST_LOCK_INIT();
+       STAILQ_INIT(&cc_list);
+}
+
+/*
+ * Returns non-zero on success, 0 on failure.
+ */
+int
+cc_deregister_algo(struct cc_algo *remove_cc)
+{
+       struct cc_algo *funcs, *tmpfuncs;
+       struct tcpcb *tp;
+       struct inpcb *inp;
+       int err;
+
+       err = ENOENT;
+
+       /* Never allow newreno to be deregistered. */
+       if (&newreno_cc_algo == remove_cc)
+               return (EPERM);
+
+       /* Remove algo from cc_list so that new connections can't use it. */
+       CC_LIST_WLOCK();
+       STAILQ_FOREACH_SAFE(funcs, &cc_list, entries, tmpfuncs) {
+               if (funcs == remove_cc) {
+                       /*
+                        * If we're removing the current system default,
+                        * reset the default to newreno.
+                        */
+                       if (strncmp(CC_DEFAULT()->name, remove_cc->name,
+                           TCP_CA_NAME_MAX) == 0)
+                               cc_set_default(&newreno_cc_algo);
+
+                       STAILQ_REMOVE(&cc_list, funcs, cc_algo, entries);
+                       err = 0;
+                       break;
+               }
+       }
+       CC_LIST_WUNLOCK();
+       
+       if (!err) {
+               /*
+                * Check all active control blocks and change any that are
+                * using this algorithm back to newreno. If the algorithm that
+                * was in use requires cleanup code to be run, call it.
+                *
+                * New connections already part way through being initialised
+                * with the CC algo we're removing will not race with this code
+                * because the INP_INFO_WLOCK is held during initialisation.
+                * We therefore don't enter the loop below until the connection
+                * list has stabilised.
+                */
+               INP_INFO_RLOCK(&V_tcbinfo);
+               LIST_FOREACH(inp, &V_tcb, inp_list) {
+                       INP_WLOCK(inp);
+                       /* Important to skip tcptw structs. */
+                       if (!(inp->inp_flags & INP_TIMEWAIT) &&
+                           (tp = intotcpcb(inp)) != NULL) {
+                               /*
+                                * By holding INP_WLOCK here, we are
+                                * assured that the connection is not
+                                * currently executing inside the CC
+                                * module's functions i.e. it is safe to
+                                * make the switch back to newreno.
+                                */
+                               if (CC_ALGO(tp) == remove_cc) {
+                                       tmpfuncs = CC_ALGO(tp);
+                                       /* Newreno does not require any init. */
+                                       CC_ALGO(tp) = &newreno_cc_algo;
+                                       if (tmpfuncs->cb_destroy != NULL)
+                                               tmpfuncs->cb_destroy(tp->ccv);
+                               }
+                       }
+                       INP_WUNLOCK(inp);
+               }
+               INP_INFO_RUNLOCK(&V_tcbinfo);
+       }
+
+       return (err);
+}
+
+/*
+ * Returns 0 on success, non-zero on failure.
+ */
+int
+cc_register_algo(struct cc_algo *add_cc)
+{
+       struct cc_algo *funcs;
+       int err;
+
+       err = 0;
+
+       /*
+        * Iterate over list of registered CC algorithms and make sure
+        * we're not trying to add a duplicate.
+        */
+       CC_LIST_WLOCK();
+       STAILQ_FOREACH(funcs, &cc_list, entries) {
+               if (funcs == add_cc || strncmp(funcs->name, add_cc->name,
+                   TCP_CA_NAME_MAX) == 0)
+                       err = EEXIST;
+       }
+
+       if (!err)
+               STAILQ_INSERT_TAIL(&cc_list, add_cc, entries);
+
+       CC_LIST_WUNLOCK();
+
+       return (err);
+}
+
+/*
+ * Handles kld related events. Returns 0 on success, non-zero on failure.
+ */
+int
+cc_modevent(module_t mod, int event_type, void *data)
+{
+       struct cc_algo *algo;
+       int err;
+
+       err = 0;
+       algo = (struct cc_algo *)data;
+
+       switch(event_type) {
+       case MOD_LOAD:
+               if (algo->mod_init != NULL)
+                       err = algo->mod_init();
+               if (!err)
+                       err = cc_register_algo(algo);
+               break;
+
+       case MOD_QUIESCE:
+       case MOD_SHUTDOWN:
+       case MOD_UNLOAD:
+               err = cc_deregister_algo(algo);
+               if (!err && algo->mod_destroy != NULL)
+                       algo->mod_destroy();
+               if (err == ENOENT)
+                       err = 0;
+               break;
+
+       default:
+               err = EINVAL;
+               break;
+       }
+
+       return (err);
+}
+
+/* Declare sysctl tree and populate it. */
+SYSCTL_NODE(_net_inet_tcp, OID_AUTO, cc, CTLFLAG_RW, NULL,
+    "congestion control related settings");
+
+SYSCTL_PROC(_net_inet_tcp_cc, OID_AUTO, algorithm, CTLTYPE_STRING|CTLFLAG_RW,
+    NULL, 0, cc_default_algo, "A", "default congestion control algorithm");
+
+SYSCTL_PROC(_net_inet_tcp_cc, OID_AUTO, available, CTLTYPE_STRING|CTLFLAG_RD,
+    NULL, 0, cc_list_available, "A",
+    "list available congestion control algorithms");

Added: head/sys/netinet/cc/cc_module.h
==============================================================================
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/sys/netinet/cc/cc_module.h     Fri Nov 12 06:41:55 2010        
(r215166)
@@ -0,0 +1,70 @@
+/*-
+ * Copyright (c) 2009-2010 Lawrence Stewart <lstew...@freebsd.org>
+ * All rights reserved.
+ *
+ * This software was developed by Lawrence Stewart while studying at the Centre
+ * for Advanced Internet Architectures, Swinburne University, made possible in
+ * part by a grant from the Cisco University Research Program Fund at Community
+ * Foundation Silicon Valley.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+/*
+ * This software was first released in 2009 by Lawrence Stewart as part of the
+ * NewTCP research project at Swinburne University's Centre for Advanced
+ * Internet Architectures, Melbourne, Australia, which was made possible in 
part
+ * by a grant from the Cisco University Research Program Fund at Community
+ * Foundation Silicon Valley. More details are available at:
+ *   http://caia.swin.edu.au/urp/newtcp/
+ */
+
+#ifndef _NETINET_CC_MODULE_H_
+#define _NETINET_CC_MODULE_H_
+
+/*
+ * Allows a CC algorithm to manipulate a commonly named CC variable regardless
+ * of the transport protocol and associated C struct.
+ * XXXLAS: Out of action until the work to support SCTP is done.
+ *
+#define        CCV(ccv, what)                                                  
\
+(*(                                                                    \
+       (ccv)->type == IPPROTO_TCP ?    &(ccv)->ccvc.tcp->what :        \
+                                       &(ccv)->ccvc.sctp->what         \
+))
+ */
+#define        CCV(ccv, what) (ccv)->ccvc.tcp->what
+
+#define        DECLARE_CC_MODULE(ccname, ccalgo)                               
\
+       static moduledata_t cc_##ccname = {                             \
+               .name = #ccname,                                        \
+               .evhand = cc_modevent,                                  \
+               .priv = ccalgo                                          \
+       };                                                              \
+       DECLARE_MODULE(ccname, cc_##ccname,                             \
+           SI_SUB_PROTO_IFATTACHDOMAIN, SI_ORDER_ANY)
+
+int    cc_modevent(module_t mod, int type, void *data);
+
+#endif /* _NETINET_CC_MODULE_H_ */

Added: head/sys/netinet/cc/cc_newreno.c
==============================================================================
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/sys/netinet/cc/cc_newreno.c    Fri Nov 12 06:41:55 2010        
(r215166)
@@ -0,0 +1,231 @@
+/*-
+ * Copyright (c) 1982, 1986, 1988, 1990, 1993, 1994, 1995
+ *     The Regents of the University of California.
+ * Copyright (c) 2007-2008,2010
+ *     Swinburne University of Technology, Melbourne, Australia.
+ * Copyright (c) 2009-2010 Lawrence Stewart <lstew...@freebsd.org>
+ * Copyright (c) 2010 The FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed at the Centre for Advanced Internet
+ * Architectures, Swinburne University, by Lawrence Stewart, James Healy and
+ * David Hayes, made possible in part by a grant from the Cisco University
+ * Research Program Fund at Community Foundation Silicon Valley.
+ *
+ * Portions of this software were developed at the Centre for Advanced
+ * Internet Architectures, Swinburne University of Technology, Melbourne,
+ * Australia by David Hayes under sponsorship from the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * This software was first released in 2007 by James Healy and Lawrence Stewart
+ * whilst working on the NewTCP research project at Swinburne University's
+ * Centre for Advanced Internet Architectures, Melbourne, Australia, which was
+ * made possible in part by a grant from the Cisco University Research Program
+ * Fund at Community Foundation Silicon Valley. More details are available at:
+ *   http://caia.swin.edu.au/urp/newtcp/
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/kernel.h>
+#include <sys/module.h>
+#include <sys/socket.h>
+#include <sys/socketvar.h>
+#include <sys/sysctl.h>
+
+#include <net/if.h>
+#include <net/if_var.h>
+
+#include <netinet/cc.h>
+#include <netinet/in.h>
+#include <netinet/in_pcb.h>
+#include <netinet/tcp_seq.h>
+#include <netinet/tcp_var.h>
+
+#include <netinet/cc/cc_module.h>
+
+void   newreno_ack_received(struct cc_var *ccv, uint16_t type);
+void   newreno_cong_signal(struct cc_var *ccv, uint32_t type);
+void   newreno_post_recovery(struct cc_var *ccv);
+void   newreno_after_idle(struct cc_var *ccv);
+
+struct cc_algo newreno_cc_algo = {
+       .name = "newreno",
+       .ack_received = newreno_ack_received,
+       .cong_signal = newreno_cong_signal,
+       .post_recovery = newreno_post_recovery,
+       .after_idle = newreno_after_idle
+};
+
+/*
+ * Increase cwnd on receipt of a successful ACK:
+ * if cwnd <= ssthresh, increases by 1 MSS per ACK
+ * if cwnd > ssthresh, increase by ~1 MSS per RTT
+ */
+void
+newreno_ack_received(struct cc_var *ccv, uint16_t type)
+{
+       if (type == CC_ACK && !IN_RECOVERY(CCV(ccv, t_flags)) &&
+           (ccv->flags & CCF_CWND_LIMITED)) {
+               u_int cw = CCV(ccv, snd_cwnd);
+               u_int incr = CCV(ccv, t_maxseg);
+
+               /*
+                * Regular in-order ACK, open the congestion window.
+                * Method depends on which congestion control state we're
+                * in (slow start or cong avoid) and if ABC (RFC 3465) is
+                * enabled.
+                *
+                * slow start: cwnd <= ssthresh
+                * cong avoid: cwnd > ssthresh
+                *
+                * slow start and ABC (RFC 3465):
+                *   Grow cwnd exponentially by the amount of data
+                *   ACKed capping the max increment per ACK to
+                *   (abc_l_var * maxseg) bytes.
+                *
+                * slow start without ABC (RFC 5681):
+                *   Grow cwnd exponentially by maxseg per ACK.
+                *
+                * cong avoid and ABC (RFC 3465):
+                *   Grow cwnd linearly by maxseg per RTT for each
+                *   cwnd worth of ACKed data.
+                *
+                * cong avoid without ABC (RFC 5681):
+                *   Grow cwnd linearly by approximately maxseg per RTT using
+                *   maxseg^2 / cwnd per ACK as the increment.
+                *   If cwnd > maxseg^2, fix the cwnd increment at 1 byte to
+                *   avoid capping cwnd.
+                */
+               if (cw > CCV(ccv, snd_ssthresh)) {
+                       if (V_tcp_do_rfc3465) {
+                               if (ccv->flags & CCF_ABC_SENTAWND)
+                                       ccv->flags &= ~CCF_ABC_SENTAWND;
+                               else
+                                       incr = 0;
+                       } else
+                               incr = max((incr * incr / cw), 1);
+               } else if (V_tcp_do_rfc3465) {
+                       /*
+                        * In slow-start with ABC enabled and no RTO in sight?
+                        * (Must not use abc_l_var > 1 if slow starting after
+                        * an RTO. On RTO, snd_nxt = snd_una, so the
+                        * snd_nxt == snd_max check is sufficient to
+                        * handle this).
+                        *
+                        * XXXLAS: Find a way to signal SS after RTO that
+                        * doesn't rely on tcpcb vars.
+                        */
+                       if (CCV(ccv, snd_nxt) == CCV(ccv, snd_max))
+                               incr = min(ccv->bytes_this_ack,
+                                   V_tcp_abc_l_var * CCV(ccv, t_maxseg));
+                       else
+                               incr = min(ccv->bytes_this_ack, CCV(ccv, 
t_maxseg));
+               }
+               /* ABC is on by default, so incr equals 0 frequently. */
+               if (incr > 0)
+                       CCV(ccv, snd_cwnd) = min(cw + incr,
+                           TCP_MAXWIN << CCV(ccv, snd_scale));
+       }
+}
+
+/*
+ * manage congestion signals
+ */
+void
+newreno_cong_signal(struct cc_var *ccv, uint32_t type)
+{
+       u_int win;
+
+       win = max(CCV(ccv, snd_cwnd) / 2 / CCV(ccv, t_maxseg), 2) *
+           CCV(ccv, t_maxseg);
+
+       switch (type) {
+       case CC_NDUPACK:
+               if (!IN_FASTRECOVERY(CCV(ccv, t_flags))) {
+                       if (!IN_CONGRECOVERY(CCV(ccv, t_flags)))
+                               CCV(ccv, snd_ssthresh) = win;
+                       ENTER_RECOVERY(CCV(ccv, t_flags));
+               }
+               break;
+       case CC_ECN:
+               if (!IN_CONGRECOVERY(CCV(ccv, t_flags))) {
+                       CCV(ccv, snd_ssthresh) = win;
+                       CCV(ccv, snd_cwnd) = win;
+                       ENTER_CONGRECOVERY(CCV(ccv, t_flags));
+               }
+               break;
+       }
+}
+
+/*
+ * decrease the cwnd in response to packet loss or a transmit timeout.
+ * th can be null, in which case cwnd will be set according to reno instead
+ * of new reno.
+ */
+void
+newreno_post_recovery(struct cc_var *ccv)
+{
+       if (IN_FASTRECOVERY(CCV(ccv, t_flags))) {
+               /*
+                * Fast recovery will conclude after returning from this
+                * function. Window inflation should have left us with
+                * approximately snd_ssthresh outstanding data. But in case we
+                * would be inclined to send a burst, better to do it via the
+                * slow start mechanism.
+                *
+                * XXXLAS: Find a way to do this without needing curack
+                */
+               if (SEQ_GT(ccv->curack + CCV(ccv, snd_ssthresh),
+                   CCV(ccv, snd_max)))
+                       CCV(ccv, snd_cwnd) = CCV(ccv, snd_max) -
+                       ccv->curack + CCV(ccv, t_maxseg);
+               else
+                       CCV(ccv, snd_cwnd) = CCV(ccv, snd_ssthresh);
+       }
+}
+
+/*
+ * if a connection has been idle for a while and more data is ready to be sent,
+ * reset cwnd
+ */
+void
+newreno_after_idle(struct cc_var *ccv)
+{
+       /*
+        * We have been idle for "a while" and no acks are expected to clock out
+        * any data we send -- slow start to get ack "clock" running again.
+        */
+       if (V_tcp_do_rfc3390)
+               CCV(ccv, snd_cwnd) = min(4 * CCV(ccv, t_maxseg),
+                   max(2 * CCV(ccv, t_maxseg), 4380));
+       else
+               CCV(ccv, snd_cwnd) = CCV(ccv, t_maxseg) * 2;
+}
+
+
+DECLARE_CC_MODULE(newreno, &newreno_cc_algo);

Modified: head/sys/netinet/tcp_input.c
==============================================================================
--- head/sys/netinet/tcp_input.c        Fri Nov 12 05:22:27 2010        
(r215165)
+++ head/sys/netinet/tcp_input.c        Fri Nov 12 06:41:55 2010        
(r215166)
@@ -1,6 +1,20 @@
 /*-
  * Copyright (c) 1982, 1986, 1988, 1990, 1993, 1994, 1995
  *     The Regents of the University of California.  All rights reserved.
+ * Copyright (c) 2007-2008,2010
+ *     Swinburne University of Technology, Melbourne, Australia.
+ * Copyright (c) 2009-2010 Lawrence Stewart <lstew...@freebsd.org>
+ * Copyright (c) 2010 The FreeBSD Foundation
+ * All rights reserved.
+ *
+ * Portions of this software were developed at the Centre for Advanced Internet
+ * Architectures, Swinburne University, by Lawrence Stewart, James Healy and
+ * David Hayes, made possible in part by a grant from the Cisco University
+ * Research Program Fund at Community Foundation Silicon Valley.
+ *
+ * Portions of this software were developed at the Centre for Advanced
+ * Internet Architectures, Swinburne University of Technology, Melbourne,
+ * Australia by David Hayes under sponsorship from the FreeBSD Foundation.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -61,6 +75,7 @@ __FBSDID("$FreeBSD$");
 
 #define TCPSTATES              /* for logging */
 
+#include <netinet/cc.h>
 #include <netinet/in.h>
 #include <netinet/in_pcb.h>
 #include <netinet/in_systm.h>
@@ -75,7 +90,6 @@ __FBSDID("$FreeBSD$");
 #include <netinet6/in6_pcb.h>
 #include <netinet6/ip6_var.h>
 #include <netinet6/nd6.h>
-#include <netinet/tcp.h>
 #include <netinet/tcp_fsm.h>
 #include <netinet/tcp_seq.h>
 #include <netinet/tcp_timer.h>
@@ -96,7 +110,7 @@ __FBSDID("$FreeBSD$");
 
 #include <security/mac/mac_framework.h>
 
-static const int tcprexmtthresh = 3;
+const int tcprexmtthresh = 3;
 
 VNET_DEFINE(struct tcpstat, tcpstat);
 SYSCTL_VNET_STRUCT(_net_inet_tcp, TCPCTL_STATS, stats, CTLFLAG_RW,
@@ -132,19 +146,16 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO,
     "Enable RFC 3042 (Limited Transmit)");
 
 VNET_DEFINE(int, tcp_do_rfc3390) = 1;
-#define        V_tcp_do_rfc3390        VNET(tcp_do_rfc3390)
 SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, rfc3390, CTLFLAG_RW,
     &VNET_NAME(tcp_do_rfc3390), 0,
     "Enable RFC 3390 (Increasing TCP's Initial Congestion Window)");
 
 VNET_DEFINE(int, tcp_do_rfc3465) = 1;
-#define        V_tcp_do_rfc3465        VNET(tcp_do_rfc3465)
 SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, rfc3465, CTLFLAG_RW,
     &VNET_NAME(tcp_do_rfc3465), 0,
     "Enable RFC 3465 (Appropriate Byte Counting)");
 
 VNET_DEFINE(int, tcp_abc_l_var) = 2;
-#define        V_tcp_abc_l_var         VNET(tcp_abc_l_var)
 SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, abc_l_var, CTLFLAG_RW,
     &VNET_NAME(tcp_abc_l_var), 2,
     "Cap the max cwnd increment during slow-start to this number of segments");
@@ -203,8 +214,10 @@ static void         tcp_pulloutofband(struct so
                     struct tcphdr *, struct mbuf *, int);
 static void     tcp_xmit_timer(struct tcpcb *, int);
 static void     tcp_newreno_partial_ack(struct tcpcb *, struct tcphdr *);
-static void inline
-                tcp_congestion_exp(struct tcpcb *);
+static void inline     cc_ack_received(struct tcpcb *tp, struct tcphdr *th,
+                           uint16_t type);
+static void inline     cc_conn_init(struct tcpcb *tp);
+static void inline     cc_post_recovery(struct tcpcb *tp, struct tcphdr *th);
 
 /*
  * Kernel module interface for updating tcpstat.  The argument is an index
@@ -220,20 +233,188 @@ kmod_tcpstat_inc(int statnum)
        (*((u_long *)&V_tcpstat + statnum))++;
 }
 
+/*
+ * CC wrapper hook functions
+ */
 static void inline
-tcp_congestion_exp(struct tcpcb *tp)
+cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type)
 {
-       u_int win;
-       
-       win = min(tp->snd_wnd, tp->snd_cwnd) /
-           2 / tp->t_maxseg;
-       if (win < 2)
-               win = 2;
-       tp->snd_ssthresh = win * tp->t_maxseg;
-       ENTER_FASTRECOVERY(tp);
-       tp->snd_recover = tp->snd_max;
-       if (tp->t_flags & TF_ECN_PERMIT)
-               tp->t_flags |= TF_ECN_SND_CWR;
+       INP_WLOCK_ASSERT(tp->t_inpcb);
+
+       tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th);
+       if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd))
+               tp->ccv->flags |= CCF_CWND_LIMITED;
+       else
+               tp->ccv->flags &= ~CCF_CWND_LIMITED;
+
+       if (type == CC_ACK) {
+               if (tp->snd_cwnd > tp->snd_ssthresh) {
+                       tp->t_bytes_acked += min(tp->ccv->bytes_this_ack,
+                            V_tcp_abc_l_var * tp->t_maxseg);
+                       if (tp->t_bytes_acked >= tp->snd_cwnd) {
+                               tp->t_bytes_acked -= tp->snd_cwnd;
+                               tp->ccv->flags |= CCF_ABC_SENTAWND;
+                       }
+               } else {
+                               tp->ccv->flags &= ~CCF_ABC_SENTAWND;
+                               tp->t_bytes_acked = 0;
+               }
+       }
+
+       if (CC_ALGO(tp)->ack_received != NULL) {
+               /* XXXLAS: Find a way to live without this */
+               tp->ccv->curack = th->th_ack;
+               CC_ALGO(tp)->ack_received(tp->ccv, type);
+       }
+}
+
+static void inline
+cc_conn_init(struct tcpcb *tp)
+{
+       struct hc_metrics_lite metrics;
+       struct inpcb *inp = tp->t_inpcb;
+       int rtt;
+#ifdef INET6
+       int isipv6 = ((inp->inp_vflag & INP_IPV6) != 0) ? 1 : 0;
+#endif
+
+       INP_WLOCK_ASSERT(tp->t_inpcb);
+
+       tcp_hc_get(&inp->inp_inc, &metrics);
+
+       if (tp->t_srtt == 0 && (rtt = metrics.rmx_rtt)) {
+               tp->t_srtt = rtt;
+               tp->t_rttbest = tp->t_srtt + TCP_RTT_SCALE;
+               TCPSTAT_INC(tcps_usedrtt);
+               if (metrics.rmx_rttvar) {
+                       tp->t_rttvar = metrics.rmx_rttvar;
+                       TCPSTAT_INC(tcps_usedrttvar);
+               } else {
+                       /* default variation is +- 1 rtt */
+                       tp->t_rttvar =
+                           tp->t_srtt * TCP_RTTVAR_SCALE / TCP_RTT_SCALE;
+               }
+               TCPT_RANGESET(tp->t_rxtcur,

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to