(mistakenly sent off-list)
-------- Original Message -------- From: Maria Matejka <maria.mate...@nic.cz> Sent: 13 April 2024 18:18:05 CEST To: Erin Shepherd <bird-us...@erinshepherd.net> Subject: Re: babel RTT metric false samples Just quick thought – I think both approaches (timestamping in kernel and in userspace) are actually useful for different purposes. Thus, we shall support both. Transferred to our internal issue: https://gitlab.nic.cz/labs/bird/-/issues/61 Maria On 13 April 2024 16:38:47 CEST, Erin Shepherd <bird-us...@erinshepherd.net> wrote: >I guess it might not fit with bird's abstractions (or perhaps the Babel >protocol), but has thought been given to using SO_TIMESTAMPING to have the >kernel compute TX/RX timestamps? > >- Erin > > >On Sat, 13 Apr 2024, at 16:14, Maria Matejka via Bird-users wrote: >> Hello Stephanie, Toke and list, >> >> On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via >> Bird-users wrote: >> >>> Stephanie Wilde-Hobbs via Bird-users bird-users@network.cz writes: >>> >>>> The babel RTT metric measurements provided by bird appears suspect for my >>>> setup. The metric through a tunnel with a latency of about 5ms is shown in >>>> babel as 150+ms. >>>> >> […] >> >>>> Debug logs show many RTT samples with approximately correct timestamps >>>> (4-6ms) then the occasional IHU with 800-1200ms calculated instead. >>>> Calculating the RTT metric by hand using babel packet logs shows that the >>>> calculations are correct. By correlating two packet dumps (the machines >>>> have <1ms NTP timekeeping) I can also see that the packets for which high >>>> RTT is calculated have similar transit times through the tunnel as other >>>> packets. Hence, I suspect the accuracy of the packet timestamps recorded >>>> by bird. Is the current packet timestamping system giving correct >>>> timestamps if the packet arrives while babel is processing another event? >>>> >>> Hmm, so Babel implementation in Bird tries to get a timestamp as early as >>> possible after receiving the packet, and set it as late as possible before >>> sending out the packet. However, the former in practice means after >>> returning from poll(), so if the packet has been sitting around in the OS >>> buffer for a while before Bird gets around to process it, the timestamp is >>> not set until Bird is done processing it. Likewise, if the packet sits >>> around in a socket buffer (or in a lower-level buffer on the sending side) >>> after Bird has sent it out, that time will also be counted as part of the >>> RTT. >>> >> I would suspect that the kernel table prune routine may be the case. It just >> runs from begin to end synchronously. >> >> I have just fast-tracked Babel in its own thread for BIRD 3, it may be worth >> checking. (There should be also artifacts from the build process for >> download available.) This should get you rid of most of the cases of >> suspiciously high RTT. >> >> `https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads` >> Just to be noted, updating a route in BIRD 3 is still a locking process so >> it may still tamper the RTT measurements. At least it should happen only in >> cases where Babel is doing the update. Anyway, with BIRD 3 internals, it >> should be possible to easily *detect* such situations and disregard these >> single measurements as unreliable. (Not implemented, though.) >> >> There are even some thoughts on implementing lockless import queues for >> routing tables, yet now we have to prioritize BIRD 3 stabilization to >> actually release it as a stable version. Import queues must wait. >> >> Also with this testing, feel free to report any weird behavior, notably >> crashes of BIRD 3, as bugs. That would be very helpful with stabilizing BIRD >> 3. Thanks a lot! >> >> Maria >> >> – Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o. >> -- Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o. -- Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.