On 24/02/2017 04:12 μμ, Ondrej Zajicek wrote:
> On Fri, Feb 24, 2017 at 01:13:55PM +0100, Pavlos Parissis wrote:
>> Hi,
>>
>> We have observed some instability on BFD protocol, where upstream router 
>> and/or
>> the server (Linux RedHat 7.3) declares the BFD session dead and as 
>> consequence
>> upstream router stops forwarding traffic to the server (we utilize ECMP).
>>
>> Our current hypothesis is that Bird log messages (only BGP KEEPALIVE messages
>> when there isn't any route change) via syslog glibc function, which connects 
>> to
>> UNIX socket (/dev/log) and the sender (Bird daemon) may block when the 
>> receiver
>> (rsyslogd) doesn't response fast enough or the buffer is full.
>>
>> On RedHat 7 servers there is a chain of daemons, which receive log messages 
>> via
>> UNIX socket.
>>
>> systemd-journald.service listens on /dev/log UNIX SOCKET and forwards 
>> messages
>> to /run/systemd/journal/syslog UNIX SOCKET where rsyslogd listens on.
>>
>> As far as I can see in the code and in the output of ps -eLl, Bird daemon is 
>> a
>> single threaded process (please correct me if I am wrong), so it could be 
>> that a
>> call to syslog blocks for X seconds when X is higher than the failure 
>> detection
>> time.
> 
> Hi
> 
> BIRD is single-threaded with the exception of BFD, which runs in a
> separate thread. Generally, interaction of BFD thread with the rest of
> BIRD is designed in a way that BFD thread should not wait on the main
> thread. So generally, the main thread blocked on syslog() should not
> cause problems to the BFD thread. There are some exceptions, like when
> the BFD thread wants to log itself (there is shared mutex around logging
> subsystem), but that is usually not a problem, as BFD do not log anything
> during regular operation (unless packet logging is enabled).
> 
> I would suggest to decrease min rx/tx interval to 100 ms (to see if that
> helps). 

If the hypothesis holds true, that is Bird blocks for 1.2secs, then sending BFD
messages at higher rate wont help. Do you think so ?

I could try the opposite, configure the upstream router to declare the BFD down
only after hasn't seen a BFD message for a period of 5seconds.

> And you could try 'watchdog warning' / 'debug latency' options
> (with appropriate values, like 500 ms) to track latency in the main
> thread to see if BFD problems are related to eventual latency problems in
> the main thread.
> 

Unfortunately, I still run 1.4.5 version, which doesn't those options, thus I
can't experiment with them. I guess this is yet another reason for upgrading to 
1.6.3.

Thanks a lot for your reply, it is very much appreciated.

Cheers,
Pavlos

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to