From: David Ahern <dsah...@gmail.com> As mentioned at netconf in Seoul, we would like to introduce nexthops as independent objects from the routes to better align with both routing daemons and hardware and to improve route insertion times into the kernel.
This series adds nexthop objects with their own lifecycle. The model retains a lot of the established semantics from routes and re-uses some of the data structures like fib_nh and fib6_nh to more easily align with the existing code. One difference with nexthop objects is the behavior better aligns with the target user - routing daemons and switch ASICs. Specifically, with the exception of the blackhole nexthop, all nexthops must reference a netdevice (or have a gateway that resolves to a device) and the device must be admin up with carrier. Prefixes are then installed pointing to the nexthop by id: { prefix } --> { nexthop } --> { gateway, device } The nexthop object contains the gateway and device reference. Benchmarks The following data shows the route insert time for 720,022 routes (a full IPv4 internet feed from August 28th). "current" means the current code where a route insert specifies the device and gateway inline with the prefix; the "nexthop" columns mean use of the nexthop objects. 1-hop 1-hop | 2-hops 2-hops current nexthop | current nexthop --------------------------|------------------------- real 0m21.872s 0m12.982s | 0m28.723s 0m12.406s user 0m2.929s 0m1.816s | 0m3.966s 0m1.935s sys 0m13.469s 0m6.010s | 0m18.992s 0m5.913s With nexthop objects the time to insert the routes is reduced by more than 30% with the kernel time cut in half. The current model has a route insertion rate of about 32,000 prefixes / second and with nexthop objects that increases to a little over 55,000 prefixes/second. For routes with multiple nexthops the install time is cut by more than half with system time reduce by a factor of 3. Further, with nexthop objects insert times for multipath routes drops down to the same as single path routes since the multipath spec is given once (ie., with the current model, the time to insert routes increases with the number of paths in the route compared to nexthop objects where the number of paths is handled once and the prefixes referencing it are installed in constant time. The difference between real and system times shows there is room for improvement with the trie implementation. As an example, increasing the sync_pages from 128 to 1024 delays the call to synchronize_rcu increasing the insert rate to more than 78,000 prefixes/sec! Some key features: 1. Allows atomic replace of any nexthop object - a nexthop or a group. This allows existing route entries to have their nexthop updated without the overhead of removing and re-inserting (or replacing) them. Instead, one update of the nexthop object implicitly updates all routes referencing it. One limitation with the atomic replace is that a nexthop group can only be replaced with a new group spec and similarly a nexthop can only be replaced by a nexthop spec. Specifically, a nexthop id can not move between a single nexthop and a group nexthop. 2. Blackhole nexthop: a nexthop object can be designated a blackhole which means any lookups that resolve to it, packets are dropped as if the lookup failed with the result RTN_BLACKHOLE. Blackhole nexthops can not be used with nexthop groups. Combined with atomic replace this allows routes to be installed pointing to a blackhole nexthop and then switched to an actual gateway with a single nexthop replace command (or vice versa, a gateway nexthop is flipped to a blackhole). 3. Nexthop groups for multipath routes. A nexthop group is a nexthop that references other nexthops. A multipath group can not be used as a nexthop in another nexthop group (ie., groups can not be nested). 4. Multipath routes for IPv6 with device only nexthops. There is a demonstrated need for this feature and the existing route semantics do not allow it. This series provides a means for that end - create a nexthop that has a device only specification. 5. Admin and carrier up are required. If the device goes down (admin or carrier) the nexthop is removed in which case routes referencing the nexthop are evicted and any nexthop groups referencing it are adjusted. 6. Follow on patches will allow IPv6 nexthops with IPv4 routes for users wanting support of RFC 5549. 7. Future extensions: active / backup nexthop. The nexthop groups are structured to allow a new group type to be added. One example is a group where a nexthop has a preferred device and gateway, but should the device go down or the gateway not resolve, the backup nexthop is used. Additional Benefits - smaller route notifications - messages contain a single nexthop id versus the detailed nexthop specification. This is especially noticeable as the number of paths increases. Smaller messages have a reduced load on userspace as well. - smaller memory footprint for IPv6 routes. Examples 1. Single path $ ip nexthop add id 1 via 10.99.1.2 dev veth1 $ ip route add 10.1.1.0/24 nhid 1 $ ip next ls id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link $ ip ro ls 10.1.1.0/24 nhid 1 scope link ... 2. ECMP $ ip nexthop add id 2 via 10.99.3.2 dev veth3 $ ip nexthop add id 1001 group 1/2 --> creates a nexthop group with 2 component nexthops: id 1 and id 2 both the same weight $ ip route add 10.1.2.0/24 nhid 1001 $ ip next ls id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link id 1001 group 1/2 $ ip ro ls 10.1.1.0/24 nhid 1 scope link 10.1.2.0/24 nhid 1001 scope link ... 3. Weighted multipath $ ip nexthop add id 1002 group 1,10/2,20 --> creates a nexthop group with 2 component nexthops: id 1 with a weight of 10 and id 2 with a weight of 20 $ ip route add 10.1.3.0/24 nhid 1002 $ ip next ls id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link id 1001 group 1/2 id 1002 group 1,10/2,20 $ ip ro ls 10.1.1.0/24 nhid 1 scope link 10.1.2.0/24 nhid 1001 scope link 10.1.3.0/24 nhid 1002 scope link ... Open Items There is long to-do list before this is ready (e.g., IPv6 multipath, lwt encap, and updating mlxsw). The point of this RFC is to get comments on the API and overall idea. Specifically, any interested parties should think about the API, the objects, the workflow, how it fits and possibility for future extensions. David Ahern (18): net: Rename net/nexthop.h net/rtnh.h net: ipv4: export fib_good_nh and fib_flush net/ipv4: export fib_info_update_nh_saddr net/ipv4: export fib_check_nh net/ipv4: Define fib_get_nhs when CONFIG_IP_ROUTE_MULTIPATH is disabled net/ipv4: Create init and release helpers for fib_nh net: ipv4: Add fib_nh to fib_result net/ipv4: Move device validation to helper net/ipv6: Create init and release helpers for fib6_nh net/ipv6: Make fib6_nh optional at the end of fib6_info net: Initial nexthop code net/ipv4: Add nexthop helpers for ipv4 integration net/ipv4: Convert existing use of fib_info to new helpers net/ipv4: Allow routes to use nexthop objects net/ipv6: Use helpers to access fib6_nh data net/ipv6: Allow routes to use nexthop objects net: Add support for nexthop groups net/ipv4: Optimization for fib_info lookup .../net/ethernet/mellanox/mlxsw/spectrum_router.c | 4 +- drivers/net/ethernet/rocker/rocker_ofdpa.c | 20 +- include/net/addrconf.h | 5 + include/net/ip6_fib.h | 22 +- include/net/ip6_route.h | 12 +- include/net/ip_fib.h | 39 +- include/net/net_namespace.h | 2 + include/net/netns/nexthop.h | 18 + include/net/nexthop.h | 253 +++- include/net/rtnh.h | 34 + include/trace/events/fib6.h | 15 +- include/uapi/linux/nexthop.h | 56 + include/uapi/linux/rtnetlink.h | 8 + net/core/filter.c | 13 +- net/core/lwtunnel.c | 2 +- net/decnet/dn_fib.c | 2 +- net/ipv4/Makefile | 2 +- net/ipv4/fib_frontend.c | 60 +- net/ipv4/fib_rules.c | 3 +- net/ipv4/fib_semantics.c | 433 ++++-- net/ipv4/fib_trie.c | 54 +- net/ipv4/ipmr.c | 2 +- net/ipv4/nexthop.c | 1541 ++++++++++++++++++++ net/ipv4/route.c | 34 +- net/ipv6/addrconf.c | 5 +- net/ipv6/addrconf_core.c | 9 + net/ipv6/af_inet6.c | 1 + net/ipv6/ip6_fib.c | 27 +- net/ipv6/ndisc.c | 15 +- net/ipv6/route.c | 474 +++--- net/mpls/af_mpls.c | 2 +- security/selinux/nlmsgtab.c | 5 +- 32 files changed, 2690 insertions(+), 482 deletions(-) create mode 100644 include/net/netns/nexthop.h create mode 100644 include/net/rtnh.h create mode 100644 include/uapi/linux/nexthop.h create mode 100644 net/ipv4/nexthop.c -- 2.11.0