On 24/02/2017 04:12 μμ, Ondrej Zajicek wrote: > On Fri, Feb 24, 2017 at 01:13:55PM +0100, Pavlos Parissis wrote: >> Hi, >> >> We have observed some instability on BFD protocol, where upstream router >> and/or >> the server (Linux RedHat 7.3) declares the BFD session dead and as >> consequence >> upstream router stops forwarding traffic to the server (we utilize ECMP). >> >> Our current hypothesis is that Bird log messages (only BGP KEEPALIVE messages >> when there isn't any route change) via syslog glibc function, which connects >> to >> UNIX socket (/dev/log) and the sender (Bird daemon) may block when the >> receiver >> (rsyslogd) doesn't response fast enough or the buffer is full. >> >> On RedHat 7 servers there is a chain of daemons, which receive log messages >> via >> UNIX socket. >> >> systemd-journald.service listens on /dev/log UNIX SOCKET and forwards >> messages >> to /run/systemd/journal/syslog UNIX SOCKET where rsyslogd listens on. >> >> As far as I can see in the code and in the output of ps -eLl, Bird daemon is >> a >> single threaded process (please correct me if I am wrong), so it could be >> that a >> call to syslog blocks for X seconds when X is higher than the failure >> detection >> time. > > Hi > > BIRD is single-threaded with the exception of BFD, which runs in a > separate thread. Generally, interaction of BFD thread with the rest of > BIRD is designed in a way that BFD thread should not wait on the main > thread. So generally, the main thread blocked on syslog() should not > cause problems to the BFD thread. There are some exceptions, like when > the BFD thread wants to log itself (there is shared mutex around logging > subsystem), but that is usually not a problem, as BFD do not log anything > during regular operation (unless packet logging is enabled). > > I would suggest to decrease min rx/tx interval to 100 ms (to see if that > helps).
If the hypothesis holds true, that is Bird blocks for 1.2secs, then sending BFD messages at higher rate wont help. Do you think so ? I could try the opposite, configure the upstream router to declare the BFD down only after hasn't seen a BFD message for a period of 5seconds. > And you could try 'watchdog warning' / 'debug latency' options > (with appropriate values, like 500 ms) to track latency in the main > thread to see if BFD problems are related to eventual latency problems in > the main thread. > Unfortunately, I still run 1.4.5 version, which doesn't those options, thus I can't experiment with them. I guess this is yet another reason for upgrading to 1.6.3. Thanks a lot for your reply, it is very much appreciated. Cheers, Pavlos
signature.asc
Description: OpenPGP digital signature