Hi David, this patch adds a dormant flag to network devices, RFC2863 operstate derived from these flags and possibility for userspace interaction. It allows drivers to signal that a device is unusable for user traffic without disabling queueing (and therefore the possibility for protocol establishment traffic to flow) and a userspace supplicant (WPA, 802.1X) to mark a device unusable without changes to the driver.
It is the result of our long discussion. However I must admit that it represents what Jamal and I agreed on with compromises towards Krzysztof, but Thomas and Krzysztof still disagree with some parts. Anyway I think it should be applied. Patch is against 2.6.14, applies also against 2.6.15-rc5 with some lines offset. Jamal: > My only other comment was on array operstates; is there any harm > in really adding the strings to name those states nulled right now? Done. Also removed one ASSERT_RTNL() and report IF_OPER_DOWN via netlink when device is admin down. Signed-Off-By: Stefan Rompf <[EMAIL PROTECTED]>
diff -X dontdiff -uNrp linux-2.6.14/Documentation/networking/operstates.txt linux-2.6.14-rfc2863/Documentation/networking/operstates.txt --- linux-2.6.14/Documentation/networking/operstates.txt 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.14-rfc2863/Documentation/networking/operstates.txt 2005-12-05 18:14:49.000000000 +0100 @@ -0,0 +1,161 @@ + +1. Introduction + +Linux distinguishes between administrative and operational state of an +interface. Admininstrative state is the result of "ip link set dev +<dev> up or down" and reflects whether the administrator wants to use +the device for traffic. + +However, an interface is not usable just because the admin enabled it +- ethernet requires to be plugged into the switch and, depending on +a site's networking policy and configuration, an 802.1X authentication +to be performed before user data can be transferred. Operational state +shows the ability of an interface to transmit this user data. + +Thanks to 802.1X, userspace must be granted the possibility to +influence operational state. To accommodate this, operational state is +split into two parts: Two flags that can be set by the driver only, and +a RFC2863 compatible state that is derived from these flags, a policy, +and changeable from userspace under certain rules. + + +2. Querying from userspace + +Both admin and operational state can be queried via the netlink +operation RTM_GETLINK. It is also possible to subscribe to RTMGRP_LINK +to be notified of updates. This is important for setting from userspace. + +These values contain interface state: + +ifinfomsg::if_flags & IFF_UP: + Interface is admin up +ifinfomsg::if_flags & IFF_RUNNING: + Interface is in RFC2863 operational state UP or UNKNOWN. This is for + backward compatibility, routing daemons, dhcp clients can use this + flag to determine whether they should use the interface. +ifinfomsg::if_flags & IFF_LOWER_UP: + Driver has signaled netif_carrier_on() +ifinfomsg::if_flags & IFF_DORMANT: + Driver has signaled netif_dormant_on() + +These interface flags can also be queried without netlink using the +SIOCGIFFLAGS ioctl. + +TLV IFLA_OPERSTATE + +contains RFC2863 state of the interface in numeric representation: + +IF_OPER_UNKNOWN (0): + Interface is in unknown state, neither driver nor userspace has set + operational state. Interface must be considered for user data as + setting operational state has not been implemented in every driver. +IF_OPER_NOTPRESENT (1): + Unused in current kernel (notpresent interfaces normally disappear), + just a numerical placeholder. +IF_OPER_DOWN (2): + Interface is unable to transfer data on L1, f.e. ethernet is not + plugged or interface is ADMIN down. +IF_OPER_LOWERLAYERDOWN (3): + Interfaces stacked on an interface that is IF_OPER_DOWN show this + state (f.e. VLAN). +IF_OPER_TESTING (4): + Unused in current kernel. +IF_OPER_DORMANT (5): + Interface is L1 up, but waiting for an external event, f.e. for a + protocol to establish. (802.1X) +IF_OPER_UP (6): + Interface is operational up and can be used. + +This TLV can also be queried via sysfs. + +TLV IFLA_LINKMODE + +contains link policy. This is needed for userspace interaction +described below. + +This TLV can also be queried via sysfs. + + +3. Kernel driver API + +Kernel drivers have access to two flags that map to IFF_LOWER_UP and +IFF_DORMANT. These flags can be set from everywhere, even from +interrupts. It is guaranteed that only the driver has write access, +however, if different layers of the driver manipulate the same flag, +the driver has to provide the synchronisation needed. + +__LINK_STATE_NOCARRIER, maps to !IFF_LOWER_UP: + +The driver uses netif_carrier_on() to clear and netif_carrier_off() to +set this flag. On netif_carrier_off(), the scheduler stops sending +packets. The name 'carrier' and the inversion are historical, think of +it as lower layer. + +netif_carrier_ok() can be used to query that bit. + +__LINK_STATE_DORMANT, maps to IFF_DORMANT: + +Set by the driver to express that the device cannot yet be used +because some driver controlled protocol establishment has to +complete. Corresponding functions are netif_dormant_on() to set the +flag, netif_dormant_off() to clear it and netif_dormant() to query. + +On device allocation, networking core sets the flags equivalent to +netif_carrier_ok() and !netif_dormant(). + + +Whenever the driver CHANGES one of these flags, a workqueue event is +scheduled to translate the flag combination to IFLA_OPERSTATE as +follows: + +!netif_carrier_ok(): + IF_OPER_LOWERLAYERDOWN if the interface is stacked, IF_OPER_DOWN + otherwise. Kernel can recognise stacked interfaces because their + ifindex != iflink. + +netif_carrier_ok() && netif_dormant(): + IF_OPER_DORMANT + +netif_carrier_ok() && !netif_dormant(): + IF_OPER_UP if userspace interaction is disabled. Otherwise + IF_OPER_DORMANT with the possibility for userspace to initiate the + IF_OPER_UP transition afterwards. + + +4. Setting from userspace + +Applications have to use the netlink interface to influence the +RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1 +via RTM_SETLINK instructs the kernel that an interface should go to +IF_OPER_DORMANT instead of IF_OPER_UP when the combination +netif_carrier_ok() && !netif_dormant() is set by the +driver. Afterwards, the userspace application can set IFLA_OPERSTATE +to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set +netif_carrier_off() or netif_dormant_on(). Changes made by userspace +are multicasted on the netlink group RTMGRP_LINK. + +So basically a 802.1X supplicant interacts with the kernel like this: + +-subscribe to RTMGRP_LINK +-set IFLA_LINKMODE to 1 via RTM_SETLINK +-query RTM_GETLINK once to get initial state +-if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until + netlink multicast signals this state +-do 802.1X, eventually abort if flags go down again +-send RTM_SETLINK to set operstate to IF_OPER_UP if authentication + succeeds, IF_OPER_DORMANT otherwise +-see how operstate and IFF_RUNNING is echoed via netlink multicast +-set interface back to IF_OPER_DORMANT if 802.1X reauthentication + fails +-restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag + +if supplicant goes down, bring back IFLA_LINKMODE to 0 and +IFLA_OPERSTATE to a sane value. + +A routing daemon or dhcp client just needs to care for IFF_RUNNING or +waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before +considering the interface / querying a DHCP address. + + +For technical questions and/or comments please e-mail to Stefan Rompf +(stefan at loplof.de). diff -X dontdiff -uNrp linux-2.6.14/include/linux/if.h linux-2.6.14-rfc2863/include/linux/if.h --- linux-2.6.14/include/linux/if.h 2005-11-02 11:07:32.000000000 +0100 +++ linux-2.6.14-rfc2863/include/linux/if.h 2005-11-30 21:15:24.000000000 +0100 @@ -33,7 +33,7 @@ #define IFF_LOOPBACK 0x8 /* is a loopback net */ #define IFF_POINTOPOINT 0x10 /* interface is has p-p link */ #define IFF_NOTRAILERS 0x20 /* avoid use of trailers */ -#define IFF_RUNNING 0x40 /* interface running and carrier ok */ +#define IFF_RUNNING 0x40 /* interface RFC2863 OPER_UP */ #define IFF_NOARP 0x80 /* no ARP protocol */ #define IFF_PROMISC 0x100 /* receive all packets */ #define IFF_ALLMULTI 0x200 /* receive all multicast packets*/ @@ -43,12 +43,16 @@ #define IFF_MULTICAST 0x1000 /* Supports multicast */ -#define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_MASTER|IFF_SLAVE|IFF_RUNNING) - #define IFF_PORTSEL 0x2000 /* can set media type */ #define IFF_AUTOMEDIA 0x4000 /* auto media select active */ #define IFF_DYNAMIC 0x8000 /* dialup device with changing addresses*/ +#define IFF_LOWER_UP 0x10000 /* driver signals L1 up */ +#define IFF_DORMANT 0x20000 /* driver signals dormant */ + +#define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|\ + IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT) + /* Private (from user) interface flags (netdevice->priv_flags). */ #define IFF_802_1Q_VLAN 0x1 /* 802.1Q VLAN device. */ #define IFF_EBRIDGE 0x2 /* Ethernet bridging device. */ @@ -80,6 +84,22 @@ #define IF_PROTO_FR_ETH_PVC 0x200B #define IF_PROTO_RAW 0x200C /* RAW Socket */ +/* RFC 2863 operational status */ +enum { + IF_OPER_UNKNOWN, + IF_OPER_NOTPRESENT, + IF_OPER_DOWN, + IF_OPER_LOWERLAYERDOWN, + IF_OPER_TESTING, + IF_OPER_DORMANT, + IF_OPER_UP, +}; + +/* link modes */ +enum { + IF_LINK_MODE_DEFAULT, + IF_LINK_MODE_DORMANT, /* limit upward transition to dormant */ +}; /* * Device mapping structure. I'd just gone off and designed a diff -X dontdiff -uNrp linux-2.6.14/include/linux/netdevice.h linux-2.6.14-rfc2863/include/linux/netdevice.h --- linux-2.6.14/include/linux/netdevice.h 2005-11-02 11:08:10.000000000 +0100 +++ linux-2.6.14-rfc2863/include/linux/netdevice.h 2005-11-30 21:01:43.000000000 +0100 @@ -230,7 +230,8 @@ enum netdev_state_t __LINK_STATE_SCHED, __LINK_STATE_NOCARRIER, __LINK_STATE_RX_SCHED, - __LINK_STATE_LINKWATCH_PENDING + __LINK_STATE_LINKWATCH_PENDING, + __LINK_STATE_DORMANT, }; @@ -334,11 +335,14 @@ struct net_device */ - unsigned short flags; /* interface flags (a la BSD) */ + unsigned int flags; /* interface flags (a la BSD) */ unsigned short gflags; unsigned short priv_flags; /* Like 'flags' but invisible to userspace. */ unsigned short padded; /* How much padding added by alloc_netdev() */ + unsigned char operstate; /* RFC2863 operstate */ + unsigned char link_mode; /* mapping policy to operstate */ + unsigned mtu; /* interface MTU value */ unsigned short type; /* interface hardware type */ unsigned short hard_header_len; /* hardware hdr length */ @@ -712,6 +716,10 @@ static inline void dev_put(struct net_de /* Carrier loss detection, dial on demand. The functions netif_carrier_on * and _off may be called from IRQ context, but it is caller * who is responsible for serialization of these calls. + * + * The name carrier is inappropriate, these functions should really be + * called netif_lowerlayer_*() because they represent the state of any + * kind of lower layer not just hardware media. */ extern void linkwatch_fire_event(struct net_device *dev); @@ -727,6 +735,29 @@ extern void netif_carrier_on(struct net_ extern void netif_carrier_off(struct net_device *dev); +static inline void netif_dormant_on(struct net_device *dev) +{ + if (!test_and_set_bit(__LINK_STATE_DORMANT, &dev->state)) + linkwatch_fire_event(dev); +} + +static inline void netif_dormant_off(struct net_device *dev) +{ + if (test_and_clear_bit(__LINK_STATE_DORMANT, &dev->state)) + linkwatch_fire_event(dev); +} + +static inline int netif_dormant(const struct net_device *dev) +{ + return test_bit(__LINK_STATE_DORMANT, &dev->state); +} + + +static inline int netif_oper_up(const struct net_device *dev) { + return (dev->operstate == IF_OPER_UP || + dev->operstate == IF_OPER_UNKNOWN /* backward compat */); +} + /* Hot-plugging. */ static inline int netif_device_present(struct net_device *dev) { diff -X dontdiff -uNrp linux-2.6.14/include/linux/rtnetlink.h linux-2.6.14-rfc2863/include/linux/rtnetlink.h --- linux-2.6.14/include/linux/rtnetlink.h 2005-11-02 11:08:11.000000000 +0100 +++ linux-2.6.14-rfc2863/include/linux/rtnetlink.h 2005-11-18 20:14:05.000000000 +0100 @@ -733,6 +733,8 @@ enum #define IFLA_MAP IFLA_MAP IFLA_WEIGHT, #define IFLA_WEIGHT IFLA_WEIGHT + IFLA_OPERSTATE, + IFLA_LINKMODE, __IFLA_MAX }; diff -X dontdiff -uNrp linux-2.6.14/net/core/dev.c linux-2.6.14-rfc2863/net/core/dev.c --- linux-2.6.14/net/core/dev.c 2005-11-06 17:35:22.000000000 +0100 +++ linux-2.6.14-rfc2863/net/core/dev.c 2005-11-30 21:15:50.000000000 +0100 @@ -2141,12 +2141,20 @@ unsigned dev_get_flags(const struct net_ flags = (dev->flags & ~(IFF_PROMISC | IFF_ALLMULTI | - IFF_RUNNING)) | + IFF_RUNNING | + IFF_LOWER_UP | + IFF_DORMANT)) | (dev->gflags & (IFF_PROMISC | IFF_ALLMULTI)); - if (netif_running(dev) && netif_carrier_ok(dev)) - flags |= IFF_RUNNING; + if (netif_running(dev)) { + if (netif_oper_up(dev)) + flags |= IFF_RUNNING; + if (netif_carrier_ok(dev)) + flags |= IFF_LOWER_UP; + if (netif_dormant(dev)) + flags |= IFF_DORMANT; + } return flags; } diff -X dontdiff -uNrp linux-2.6.14/net/core/link_watch.c linux-2.6.14-rfc2863/net/core/link_watch.c --- linux-2.6.14/net/core/link_watch.c 2005-06-17 21:48:29.000000000 +0200 +++ linux-2.6.14-rfc2863/net/core/link_watch.c 2005-11-30 21:13:53.000000000 +0100 @@ -49,6 +49,34 @@ struct lw_event { /* Avoid kmalloc() for most systems */ static struct lw_event singleevent; +static inline unsigned char default_operstate(const struct net_device *dev) { + if (!netif_carrier_ok(dev)) + return dev->ifindex!=dev->iflink?IF_OPER_LOWERLAYERDOWN:IF_OPER_DOWN; + if (netif_dormant(dev)) return IF_OPER_DORMANT; + return IF_OPER_UP; +} + + +static void rfc2863_policy(struct net_device *dev) { + unsigned char operstate = default_operstate(dev); + + if (operstate == dev->operstate) return; + + switch(dev->link_mode) { + case IF_LINK_MODE_DORMANT: + if (operstate == IF_OPER_UP) operstate = IF_OPER_DORMANT; + break; + case IF_LINK_MODE_DEFAULT: + default: + break; + } + + write_lock_bh(&dev_base_lock); + dev->operstate = operstate; + write_unlock_bh(&dev_base_lock); +} + + /* Must be called with the rtnl semaphore held */ void linkwatch_run_queue(void) { @@ -74,6 +102,7 @@ void linkwatch_run_queue(void) */ clear_bit(__LINK_STATE_LINKWATCH_PENDING, &dev->state); + rfc2863_policy(dev); if (dev->flags & IFF_UP) { if (netif_carrier_ok(dev)) { WARN_ON(dev->qdisc_sleeping == &noop_qdisc); diff -X dontdiff -uNrp linux-2.6.14/net/core/net-sysfs.c linux-2.6.14-rfc2863/net/core/net-sysfs.c --- linux-2.6.14/net/core/net-sysfs.c 2005-06-17 21:48:29.000000000 +0200 +++ linux-2.6.14-rfc2863/net/core/net-sysfs.c 2005-12-07 23:29:46.000000000 +0100 @@ -94,6 +94,7 @@ NETDEVICE_ATTR(iflink, fmt_dec); NETDEVICE_ATTR(ifindex, fmt_dec); NETDEVICE_ATTR(features, fmt_long_hex); NETDEVICE_ATTR(type, fmt_dec); +NETDEVICE_ATTR(link_mode, fmt_dec); /* use same locking rules as GIFHWADDR ioctl's */ static ssize_t format_addr(char *buf, const unsigned char *addr, int len) @@ -136,9 +137,44 @@ static ssize_t show_carrier(struct class return -EINVAL; } +static ssize_t show_dormant(struct class_device *dev, char *buf) +{ + struct net_device *netdev = to_net_dev(dev); + if (netif_running(netdev)) { + return sprintf(buf, fmt_dec, !!netif_dormant(netdev)); + } + return -EINVAL; +} + +static const char *operstates[] = { + "unknown", + "notpresent", /* currently unused */ + "down", + "lowerlayerdown", + "testing", /* currently unused */ + "dormant", + "up" +}; + +static ssize_t show_operstate(struct class_device *dev, char *buf) +{ + const struct net_device *netdev = to_net_dev(dev); + unsigned char operstate; + + read_lock(&dev_base_lock); + operstate = netdev->operstate; + if (!netif_running(netdev)) operstate = IF_OPER_DOWN; + read_unlock(&dev_base_lock); + + if (operstate >= sizeof(operstates)) return -EINVAL; /* should not happen */ + return sprintf(buf, "%s\n", operstates[operstate]); +} + static CLASS_DEVICE_ATTR(address, S_IRUGO, show_address, NULL); static CLASS_DEVICE_ATTR(broadcast, S_IRUGO, show_broadcast, NULL); static CLASS_DEVICE_ATTR(carrier, S_IRUGO, show_carrier, NULL); +static CLASS_DEVICE_ATTR(dormant, S_IRUGO, show_dormant, NULL); +static CLASS_DEVICE_ATTR(operstate, S_IRUGO, show_operstate, NULL); /* read-write attributes */ NETDEVICE_SHOW(mtu, fmt_dec); @@ -212,9 +248,12 @@ static struct class_device_attribute *ne &class_device_attr_flags, &class_device_attr_weight, &class_device_attr_type, + &class_device_attr_link_mode, &class_device_attr_address, &class_device_attr_broadcast, &class_device_attr_carrier, + &class_device_attr_dormant, + &class_device_attr_operstate, NULL }; diff -X dontdiff -uNrp linux-2.6.14/net/core/rtnetlink.c linux-2.6.14-rfc2863/net/core/rtnetlink.c --- linux-2.6.14/net/core/rtnetlink.c 2005-11-02 11:08:12.000000000 +0100 +++ linux-2.6.14-rfc2863/net/core/rtnetlink.c 2005-12-07 23:32:36.000000000 +0100 @@ -178,6 +178,31 @@ rtattr_failure: } +static void set_operstate(struct net_device *dev, unsigned char transition) { + unsigned char operstate = dev->operstate; + + switch(transition) { + case IF_OPER_UP: + if ((operstate == IF_OPER_DORMANT || + operstate == IF_OPER_UNKNOWN) && + !netif_dormant(dev)) + operstate = IF_OPER_UP; + break; + case IF_OPER_DORMANT: + if (operstate == IF_OPER_UP || + operstate == IF_OPER_UNKNOWN) + operstate = IF_OPER_DORMANT; + break; + } + + if (dev->operstate != operstate) { + write_lock_bh(&dev_base_lock); + dev->operstate = operstate; + write_unlock_bh(&dev_base_lock); + netdev_state_change(dev); + } +} + static int rtnetlink_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, int type, u32 pid, u32 seq, u32 change, unsigned int flags) @@ -208,6 +233,13 @@ static int rtnetlink_fill_ifinfo(struct } if (1) { + u8 operstate = netif_running(dev)?dev->operstate:IF_OPER_DOWN; + u8 link_mode = dev->link_mode; + RTA_PUT(skb, IFLA_OPERSTATE, sizeof(operstate), &operstate); + RTA_PUT(skb, IFLA_LINKMODE, sizeof(link_mode), &link_mode); + } + + if (1) { struct rtnl_link_ifmap map = { .mem_start = dev->mem_start, .mem_end = dev->mem_end, @@ -398,6 +430,22 @@ static int do_setlink(struct sk_buff *sk dev->weight = *((u32 *) RTA_DATA(ida[IFLA_WEIGHT - 1])); } + if (ida[IFLA_OPERSTATE - 1]) { + if (ida[IFLA_OPERSTATE - 1]->rta_len != RTA_LENGTH(sizeof(u8))) + goto out; + + set_operstate(dev, *((u8 *) RTA_DATA(ida[IFLA_OPERSTATE - 1]))); + } + + if (ida[IFLA_LINKMODE - 1]) { + if (ida[IFLA_LINKMODE - 1]->rta_len != RTA_LENGTH(sizeof(u8))) + goto out; + + write_lock_bh(&dev_base_lock); + dev->link_mode = *((u8 *) RTA_DATA(ida[IFLA_LINKMODE - 1])); + write_unlock_bh(&dev_base_lock); + } + if (ifm->ifi_index >= 0 && ida[IFLA_IFNAME - 1]) { char ifname[IFNAMSIZ];