Hello, I’m running into a rather complex issue where BIRD tries to reinsert routes into the kernel routing table that are already in there. I’m at the point of suspecting a netlink bug in the 5.x linux kernel. However, I may be entirely off, and would like to hear what people with more BIRD and/or netlink expertise think about it.
My setup is bird 2.0.7 on Ubuntu 20.04. I’ve been able to reproduce this with the default 5.4.0-42-generic kernel as well as the ubuntu mainline 5.7.0-050700-generic kernel. This bird instance has an IBGP peering with another bird 2.0.7 instance, which sends an IPv4 full table. The IBGP peering is set up over a GRE tunnel. There is no other routing software running. I am seeing this issue on another server too, ubuntu 20.04 with 5.4.0-42-generic. A third one with ubuntu 18.04 and a 4.15.0-20-generic kernel, is fine. The problem presents as follows: a few thousand routes fail insertion into the kernel routing table, marked by bird as ‘!’: ------------------------- # birdc show route table master4|grep \!|wc -l 5838 ------------------------- The number varies, but is in the order of a few thousand. However, all routes seem to be in the kernel table: ------------------------- # birdc show route count BIRD 2.0.7 ready. 788748 of 788748 routes for 788747 networks in table master4 0 of 0 routes for 0 networks in table master6 Total: 788748 of 788748 routes for 788747 networks in 2 tables # ip route|grep bird|wc -l 788748 ------------------------- When I wait a few minutes and check again, the set of failed prefixes is almost entirely different. Apparently, the previously failed prefixes have been “fixed”, but now the same problem appears with different ones. When I check a particular sample, the route really is in the kernel routing table already: ------------------------- # birdc show route 81.200.176.0/20 BIRD 2.0.7 ready. Table master4: 81.200.176.0/20 unicast [fra1 14:47:20.892] ! (100) [AS50664i] via 10.195.6.1 on tun-fra1 # ip route|grep 81.200.176.0/20 81.200.176.0/20 via 10.195.6.1 dev tun-fra1 proto bird metric 32 # birdc show route 81.200.176.0/20 BIRD 2.0.7 ready. Table master4: 81.200.176.0/20 unicast [fra1 14:47:20.892] ! (100) [AS50664i] via 10.195.6.1 on tun-fra1 ------------------------- The logs are repeats of: ------------------------- Aug 01 09:50:23 fra2 bird[83586]: Netlink: File exists Aug 01 09:50:23 fra2 bird[83586]: Netlink: File exists Aug 01 09:50:23 fra2 bird[83586]: Netlink: File exists Aug 01 09:50:23 fra2 bird[83586]: Netlink: File exists Aug 01 09:50:23 fra2 bird[83586]: Netlink: File exists Aug 01 09:50:23 fra2 bird[83586]: ... Aug 01 09:50:23 fra2 bird[83586]: I/O loop cycle took 5110 ms for 1 events ------------------------- Seemingly, bird thinks routes are not yet in the kernel routing table, tries to insert them, which fails because they are already there. Later scans move the problem to different prefixes. If I remove the tunneled IBGP peering, and set up an upstream peering on a directly connected non-tunneled peer, everything works fine. If I enable both the tunneled IBGP peering and the non-tunneled upstream, the issue does appear. The routes that are affected by the failed insertion attempts are then from both peers. The prefixes that are affected are almost entirely cases where multiple prefixes exist in the DFZ with the same network address, but different lengths. Above I mentioned seeing 81.200.176.0/20 at some point, and 81.200.176.0/24 is also in the DFZ. From two samples of failed prefixes at different points in time, about 98% are prefixes of this kind, where the DFZ contains the same network address with a different length too. This distribution is not reflected in the whole v4 full table, suggesting that routes that meet this case, are most likely to fail insertion. I have done some deeper digging with strace. https://p.6core.net/p/RtQ7xSSYd560i7FioG78ebsi is an strace of BIRD starting and loading the v4 full table from the IBGP peer. I have filtered this for “45.189.104.0” “GETROUTE” to trim down. I have the full strace, but it’s 8GB: 1. Initially, bird does GETROUTE a few times, before the BGP session is established. 2. Near the start, at 09:06:33.208648 and 09:06:33.209003, the BGP session seems established, and 45.189.104.0/24 and 45.189.104.0/22 are inserted correctly. (Acknowledgement not in paste as it does not contain the prefix.) 3. At 09:06:39.774454 bird sends to birdc that 45.189.104.0/24 is best route and inserted correctly. 4. At 09:06:49.915838, bird sends a GETROUTE. Only the reply lines that contain “45.189.104.0” are in the paste. Apparently, only 45.189.104.0/24 is included in the reply. 45.189.104.0/22 does not seem to appear in the reply to GETROUTE. 5. At 09:07:08.455767, bird tries to insert 45.189.104.0/22 into the kernel routing table. This fails, because it already exists. 6. At 09:07:43.144207 bird replies to birdc that 45.189.104.0/22 failed to insert. 7. At 09:07:49.874282, bird does another GETROUTE. Both the /22 and the /24 are included in the response. 8. At 09:08:25.623269, bird tries to update 45.189.104.0/22 which succeeds. Note the different flags from 09:07:08.455767, specifically NLM_F_EXCL vs NLM_F_REPLACE, presumably because bird is aware there is an existing route. 9. Future GETROUTEs return both the /24 and /22, bird does nothing. Working theory: netlink GETROUTE in at least some 5.x kernels may not return all routes, when at least some routes in the table have a next hop that is a tunnel interface, and this is almost entirely contained to cases where multiple prefixes exist with the same network address. Thoughts? And if that theory is correct, can we work around it in BIRD? A few things I’ve changed that had no effect: - Changing the tunnel type. - Changing the peering from IBGP to EBGP. - Using IPv6 addresses for the tunnel endpoints, and removing the static route in the config below. This is the full config I’m currently using: ------------------------- log syslog all; router id 141.98.136.36; protocol kernel { scan time 20; ipv4 { export all; import none; }; } # never route the tunnel through the tunnel protocol static tunnel_endpoints { ipv4; route 45.12.69.14/32 via 141.98.136.33; } protocol device { } # ibgp peering over GRE tunnel, ipv4 full table protocol bgp fra1 { local as 213279; source address 10.195.6.2; neighbor 10.195.6.1 as 213279; direct; ipv4 { import all; export none; }; } ------------------------- Sasha