Hi Ian, all,

Ian Chilton wrote on 18/08/2022 16:57:
We then run a "bird re-validate" cron job every hour (at twenty past the hour): /usr/sbin/birdc -s /run/bird-ipv6.ctl reload in all > /dev/null ; /usr/sbin/birdc -s /run/bird-ipv4.ctl reload in all

Interestingly all 3 crashes have happened at just after twenty past the hour, i.e soon after this cron job has run.

As you're running Bird 2.0.8 this should be no longer necessary. Per 2.0.8's release logs:

> Version 2.0.8 (2021-03-18)
>  o Automatic channel reloads based on RPKI changes

So given all three crashes appear linked to this, stopping those manual reloads should, hopefully, return you to stability.

You're also two bugfix releases behind. At INEX we've been running 2.0.9 for ~5/6 months now without issue.

There appears to be a lot of bugfixes between 2.0.8 and 2.0.10 so it might be worthwhile updating or checking the git commit logs to see if there's anything relevant to RPKI in there?

hth,
 - Barry


It looks like the following in the logs:

Aug 17 17:20:01 rs1 CRON[29229]: (root) CMD (/usr/sbin/birdc -s /run/bird-ipv6.ctl reload in all > /dev/null ; /usr/sbin/birdc -s /run/bird-ipv4.ctl reload in all > /dev/null)
Aug 17 17:20:01 rs1 bird: Reloading protocol device1
Aug 17 17:20:01 rs1 bird: Reloading protocol pp_0121_asxx
..etc..
Aug 17 17:20:01 rs1 bird: Reloading protocol pp_1082_asxxxxxx
Aug 17 17:20:01 rs1 bird: Reloading protocol pb_1082_asxxxxxx
Aug 17 17:20:01 rs1 bird: Tagging invalid ROA 2001:xxxx:xxxx::/48 for ASN xxxxx
..etc..
Aug 17 17:21:17 rs1 bird: Tagging invalid ROA x.x.x.x/23 for ASN xxxx
Aug 17 17:21:19 rs1 kernel: [7811815.959943] bird[586]: segfault at f30021 ip 000055a1bf450fc3 sp 00007ffe64f3da98 error 4 in bird[55a1bf42a000+d8000] Aug 17 17:21:19 rs1 kernel: [7811815.966760] Code: 95 78 01 00 00 5b 5d 41 5c c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 85 ff b8 01 00 00 00 74 15 48 85 f6 0f 84 a6 00 00 00 <0f> b6 46 21 0f b6 57 21 29 d0 74 11 f3 c3 0f 1f 44 00 00 66 2e 0f Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Main process exited, code=killed, status=11/SEGV Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Failed with result 'signal'. Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Service RestartSec=100ms expired, scheduling restart. Aug 17 17:21:19 rs1 systemd[1]: bird-ipv4.service: Scheduled restart job, restart counter is at 1.
Aug 17 17:21:19 rs1 systemd[1]: Stopped BIRD - ipv4.
Aug 17 17:21:19 rs1 systemd[1]: Starting BIRD - ipv4...
Aug 17 17:21:22 rs1 systemd[1]: Started BIRD - ipv4.
Aug 17 17:21:22 rs1 bird: Started

When the second crash happened, we happened to be at RIPE84 so we chatted to Maria in person. She said that it was possible to debug it, but would need a core dump.

After looking in to this, I did:

ulimit -S -c unlimited
and installed the systemd-coredump package.

...which was supposed to dump a core file if a process crashed. I tested this by killing a sleep command from the shell with kill -s 6 and it worked.

When the crash happened again yesterday, I hoped to have a core file to send, but there is no sign of it having generated one :(

Testing on a test server, killing sleep generates a core file, but not killing bird.

So two things - has anyone experienced similar crashes or have any ideas why we might be seeing this?

Can anyone advise how to reliably get a core dump if bird crashes?

Thanks!

Ian



--

Kind regards,
Barry O'Donovan
Consultant

For and on behalf of INEX

https://www.inex.ie/support/
+353 1 531 3339


Reply via email to