I have a handful of FreeBSD 14.2 machines acting as BGP routers using bird2.  
Three of them would drop BGP connections at least once a week, the others would 
either stay stable or sporadically drop off. Through a painful root cause 
exercise, I've narrowed the cause to the bird2 port changing from using a 
routing socket to netlink.  On the routers which failed most often, I've 
switched from bird2 to bird2-rtsock and the problem has disappeared.

I don't know the rtsock/netlink source well enough to debug deeper but I'm more 
than happy to go back to the netlink version in order to help debug if anyone 
is interested in tackling the problem.

I'll give some details on the router that failed most often to wet your 
appetite.

The problem will start with this appearing in the logs:

Jan 29 03:42:08 pesto kernel: [zone: mbuf] kern.ipc.nmbufs limit reached

After which, there are these every 30-40 seconds until I intercede:

Jan 29 03:51:35 pesto kernel: sonewconn: pcb 0xfffff8002c198540 ([::]:179 
(proto 6)): Listen queue overflow: 13 already in queue awaiting acceptance (1 
occurrences), euid 0, rgid 502, jail 0
Jan 29 03:52:38 pesto kernel: sonewconn: pcb 0xfffff8002c198540 ([::]:179 
(proto 6)): Listen queue overflow: 13 already in queue awaiting acceptance (17 
occurrences), euid 0, rgid 502, jail 0

After a while, IPv4 connections will start appearing too:

Jan 29 04:49:28 pesto kernel: sonewconn: pcb 0xfffff8002c198000 (0.0.0.0:179 
(proto 6)): Listen queue overflow: 13 already in queue awaiting acceptance (1 
occurrences), euid 0, rgid 502, jail 0
Jan 29 04:49:48 pesto kernel: sonewconn: pcb 0xfffff8002c198540 ([::]:179 
(proto 6)): Listen queue overflow: 13 already in queue awaiting acceptance (36 
occurrences), euid 0, rgid 502, jail 0

When I intercede, connections to bird2 (via birdc control) hang as do attempts 
to shut it down (/usr/local/etc/rc.d/bird stop).  If I try to look at the 
routing table (netstat -rnf inet6), it will start to list and then the system 
will reboot itself (likely a kernel crash, but nothing is logged between the 
sonewconn logs and the boot).  After the reboot, things are good for a few days.

My naive guess it the netlink code leaks mbufs and there is a kernel crash in 
this state triggered by listed the routing table with netstat.  Besides 
switching back to bird2 via netlink, if I have someone interested in debugging, 
I can run a debug kernel to catch the stack trace.  


System details:

FreeBSD pesto.gshapiro.net 14.2-RELEASE FreeBSD 14.2-RELEASE 
releng/14.2-n269506-c8918d6c7412 GENERIC amd64

bird2-rtsock-2.16.1  (which is stable, bird2-2.16.1 which uses netlink instead 
of rtsock causes the issue)

Since this is a router, it doesn't do anything except routing (bird2), 
networking (tailscale, vxlan, wiregaurd) and OS daemons (syslog, ntpd for log 
accuracy, cron, devd, getty for console).  

Only sysctl tuning done:

net.fibs=2
net.add_addr_allfibs=1

As far as system configuration, here is my rc.conf (happy to share redacted 
values with those working on this issue):

accounting_enable="YES"
bird_enable="YES"
cloned_interfaces="lo1 vxlan0 vxlan1"
create_args_vxlan0="vxlanid _redacted_ vxlanlocal _redacted_ vxlanremote 
_redacted_ vxlanttl 255 tunnelfib 1"
create_args_vxlan1="vxlanid _redacted_ vxlanlocal _redacted_ vxlanremote 
_redacted_ vxlanttl 255 tunnelfib 1"
defaultrouter="_redacted_"
defaultrouter_fib1="${defaultrouter}"
devd_flags="-q"
firewall_enable="YES"
firewall_logging="YES"
firewall_quiet="YES"
firewall_type="/etc/ipfw.rules"
fsck_y_enable="YES"
gateway_enable="YES"
hostname="pesto.gshapiro.net"
ifconfig_lo1_descr="Announced"
ifconfig_lo1_ipv6="inet6 _redacted_ prefer_source"
ifconfig_lo1_alias0="inet6 _redacted_"
ifconfig_vtnet0="inet _redacted_"
ifconfig_vtnet0_descr="HYEHOST Uplink"
ifconfig_vtnet0_ipv6="inet6 _redacted_ no_prefer_iface"
ifconfig_vtnet1_descr="F4IX"
ifconfig_vtnet1_ipv6="inet6 _redacted_ no_prefer_iface"
ifconfig_vxlan0="inet _redacted_"
ifconfig_vxlan0_descr="BGPExchange"
ifconfig_vxlan0_ipv6="inet6 _redacted_ no_prefer_iface"
ifconfig_vxlan1_descr="HNIX BGP Tunnel"
ifconfig_vxlan1_ipv6="inet6 _redacted_ no_prefer_iface"
ipv6_defaultrouter="_redacted_"
ipv6_defaultrouter_fib1="${ipv6_defaultrouter}"
ipv6_gateway_enable="YES"
ipv6_route_bgp6="_redacted_ -iface vtnet0"
ipv6_route_gw6="_redacted_ -iface vtnet0 -fib 0,1"
ipv6_static_routes="gw6 bgp6"
ntpd_enable="YES"
ntpd_sync_on_start="YES"
qemu_guest_agent_enable="YES"
qemu_guest_agent_flags="-d -v -l /var/log/qemu-ga.log"
sshd_enable="YES"
syslogd_flags="-ss"
tailscaled_enable="YES"
tailscaled_exitnode_enable="YES"
tailscaled_up_args="--accept-dns=false --timeout=30s"
wireguard_enable="YES"
wireguard_interfaces="wg0"

And wg0.conf:

# Route64
[Interface]
PrivateKey = _redacted_
Address = _redacted_, _redacted_
Table = off
PostUp = /sbin/ifconfig %i description "Route64 Upstream"
PostUp = /sbin/ifconfig %i inet6 auto_linklocal
PostUp = /sbin/ifconfig %i inet6 no_prefer_iface
PostUp = /sbin/ifconfig %i tunnelfib 1

[Peer]
PublicKey = _redacted_
AllowedIPs = ::/0, 0.0.0.0/0
Endpoint = _redacted_
PersistentKeepAlive = 30

Reply via email to