Hi,

We are an IXP running 2x route servers with BIRD, each running separate daemons 
for IPv4 and IPv6.

We are running BIRD 2.0.8-1 on Debian 10 and have around 250 peers, ~150k 
routes on v4 and ~50k routes on v6.

Since upgrading to BIRD 2 nearly 3 years ago, it was really stable until May 
this year. Since then we've had 3 crashes of the daemon for v4 on one of the 
servers. The v6 daemon on that server has been fine, as has the second route 
server, running the same, with the same peers and therefore in theory, the same 
routes.

The first two of these crashes happened a week apart, after which I rebooted 
the VM to ensure everything was clean and it was fine for 90 days, but then did 
the same yesterday.

Our BIRD configuration is generated by IXP Manager and updated hourly.

We then run a "bird re-validate" cron job every hour (at twenty past the hour):
/usr/sbin/birdc -s /run/bird-ipv6.ctl reload in all > /dev/null ; 
/usr/sbin/birdc -s /run/bird-ipv4.ctl reload in all

Interestingly all 3 crashes have happened at just after twenty past the hour, 
i.e soon after this cron job has run.

It looks like the following in the logs:

Aug 17 17:20:01 rs1 CRON[29229]: (root) CMD (/usr/sbin/birdc -s 
/run/bird-ipv6.ctl reload in all > /dev/null ; /usr/sbin/birdc -s 
/run/bird-ipv4.ctl reload in all > /dev/null)
Aug 17 17:20:01 rs1 bird: Reloading protocol device1
Aug 17 17:20:01 rs1 bird: Reloading protocol pp_0121_asxx
..etc..
Aug 17 17:20:01 rs1 bird: Reloading protocol pp_1082_asxxxxxx
Aug 17 17:20:01 rs1 bird: Reloading protocol pb_1082_asxxxxxx
Aug 17 17:20:01 rs1 bird: Tagging invalid ROA 2001:xxxx:xxxx::/48 for ASN xxxxx
..etc..
Aug 17 17:21:17 rs1 bird: Tagging invalid ROA x.x.x.x/23 for ASN xxxx
Aug 17 17:21:19 rs1 kernel: [7811815.959943] bird[586]: segfault at f30021 ip 
000055a1bf450fc3 sp 00007ffe64f3da98 error 4 in bird[55a1bf42a000+d8000]
Aug 17 17:21:19 rs1 kernel: [7811815.966760] Code: 95 78 01 00 00 5b 5d 41 5c 
c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 85 ff b8 01 00 00 00 74 15 48 85 
f6 0f 84 a6 00 00 00 <0f> b6 46 21 0f b6 57 21 29 d0 74 11 f3 c3 0f 1f 44 00 00 
66 2e 0f
Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Main process exited, 
code=killed, status=11/SEGV
Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Failed with result 'signal'.
Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Service RestartSec=100ms 
expired, scheduling restart.
Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Scheduled restart job, 
restart counter is at 1.
Aug 17 17:21:19 rs1 systemd[1]: Stopped BIRD - ipv4.
Aug 17 17:21:19 rs1 systemd[1]: Starting BIRD - ipv4...
Aug 17 17:21:22 rs1 systemd[1]: Started BIRD - ipv4.
Aug 17 17:21:22 rs1 bird: Started

When the second crash happened, we happened to be at RIPE84 so we chatted to 
Maria in person. She said that it was possible to debug it, but would need a 
core dump.

After looking in to this, I did:

ulimit -S -c unlimited
and installed the systemd-coredump package.

...which was supposed to dump a core file if a process crashed. I tested this 
by killing a sleep command from the shell with kill -s 6 and it worked.

When the crash happened again yesterday, I hoped to have a core file to send, 
but there is no sign of it having generated one :(

Testing on a test server, killing sleep generates a core file, but not killing 
bird.

So two things - has anyone experienced similar crashes or have any ideas why we 
might be seeing this?

Can anyone advise how to reliably get a core dump if bird crashes?

Thanks!

Ian

Reply via email to