On Thu, Oct 10, 2019 at 11:31:04AM +0300, Ido Schimmel wrote: > On Wed, Oct 09, 2019 at 11:00:07AM -0500, Jesse Hathaway wrote: > > We have been experiencing a route lookup race condition on our internet > > facing > > Linux routers. I have been able to reproduce the issue, but would love more > > help in isolating the cause. > > > > Looking up a route found in the main table returns `*` rather than the > > directly > > connected interface about once for every 10-20 million requests. From my > > reading of the iproute2 source code an asterisk is indicative of the kernel > > returning and interface index of 0 rather than the correct directly > > connected > > interface. > > > > This is reproducible with the following bash snippet on 5.4-rc2: > > > > $ cat route-race > > #!/bin/bash > > > > # Generate 50 million individual route gets to feed as batch input to `ip` > > function ip-cmds() { > > route_get='route get 192.168.11.142 from 192.168.180.10 iif > > vlan180' > > for ((i = 0; i < 50000000; i++)); do > > printf '%s\n' "${route_get}" > > done > > > > } > > > > ip-cmds | ip -d -o -batch - | grep -E 'dev \*' | uniq -c > > > > Example output: > > > > $ ./route-race > > 6 unicast 192.168.11.142 from 192.168.180.10 dev * table main > > \ cache iif vlan180 > > > > These routers have multiple routing tables and are ingesting full BGP > > routing > > tables from multiple ISPs: > > > > $ ip route show table all | wc -l > > 3105543 > > > > $ ip route show table main | wc -l > > 54 > > > > Please let me know what other information I can provide, thanks in advance, > > I think it's working as expected. Here is my theory: > > If CPU0 is executing both the route get request and forwarding packets > through the directly connected interface, then the following can happen: > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
Sorry, only output path is per-CPU. See commit d26b3a7c4b3b ("ipv4: percpu nh_rth_output cache"). I indeed see the issue regardless of the CPU on which I run the route get request. > is found. Not yet dumped to user space > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the > cache by bumping 'net->ipv4.rt_genid' > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The > cached dst entry is found to be invalid. Therefore, it is replaced by a > newer dst entry. dst_dev_put() is called on old entry which assigns the > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because > it is not registered. > > <CPU0, t3> - After softirq finished executing, your route get request > from t0 is resumed and the old dst entry is dumped to user space with > ifindex of 0. > > I tested this on my system using your script to generate the route get > requests. I pinned it to the same CPU forwarding packets through the > nexthop. To constantly invalidate the cache I created another script > that simply adds and removes IP addresses from an interface. > > If I stop the packet forwarding or the script that invalidates the > cache, then I don't see any '*' answers to my route get requests. > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that > with older kernel versions you'll see 'lo' instead of '*'.