When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed from more or less being destination-based into being quasi-random per-packet scheduling. This increases the risk of out-of-order packets and makes it impossible to use multipath together with anycast services.
This patch series replaces the old implementation with flow-based load balancing based on a hash over the source and destination addresses. Distribution of the hash is done with thresholds as described in RFC 2992. This reduces the disruption when a path is added/remove when having more than two paths. To futher the chance of successful usage in conjuction with anycast, ICMP error packets are hashed over the inner IP addresses. This ensures that PMTU will work together with anycast or load-balancers such as IPVS. Port numbers are not considered since fragments could cause problems with anycast and IPVS. Relying on the DF-flag for TCP packets is also insufficient, since ICMP inspection effectively extracts information from the opposite flow which might have a different state of the DF-flag. This is also why the RSS hash is not used. These are typically based on the NDIS RSS spec which mandates TCP support. Benchmarking on a Xeon X3550 (4 cores, 2.66GHz) showed that it was desireable to move the ICMP handling to a separate method. The reason for this being that the standard hash function can work without using the stack, and the ICMP function cannot (due to skb_header_pointer), causing 4 additional hits on the cache. By separating the two, the fast path (non-ICMP) only requires three reads from cache. Two-path benchmarks (ip_mkroute_input excl. __mkroute_input): Original per-packet: ~394 cycles/packet L3 hash w/o noinline: ~128 cycles/packet L3 hash w/ noinline: ~97 cycles/packet Changes in v3: - Multipath algorithm is no longer configurable (always L3) - Added random seed to hash - Moved ICMP inspection to isolated function - Ignore source quench packets (deprecated as per RFC 6633) Changes in v2: - Replaced 8-bit xor hash with 31-bit jenkins hash - Don't scale weights (since 31-bit) - Avoided unnecesary renaming of variables - Rely on DF-bit instead of fragment offset when checking for fragmentation - upper_bound is now inclusive to avoid overflow - Use a callback to postpone extracting flow information until necessary - Skipped ICMP inspection entirely with L4 hashing - Handle newly added sysctl ignore_routes_with_linkdown Best Regards Peter Nørlund Peter Nørlund (2): ipv4: L3 hash-based multipath ipv4: ICMP packet inspection for multipath include/net/ip_fib.h | 11 +++- include/net/route.h | 12 +++- net/ipv4/fib_semantics.c | 137 +++++++++++++++++++++++-------------------- net/ipv4/icmp.c | 16 +++++ net/ipv4/route.c | 73 +++++++++++++++++++++-- 5 files changed, 177 insertions(+), 72 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html