Hi all, Recently I was doing some testing on XR6 and noticed interesting behavior.
I enabled OSPF adjacency traps to see how the router performs with these traps. I was getting only a handful of traps for a big router with only a few OSPF sessions. That behavior aligns with the RFC 1850 definition back from 1995: *4.4. Throttling Traps The mechanism for throttling the traps is similar to the mechanism explained in RFC 1224 [11], section 5. The basic idea is that there is a sliding window in seconds and an upper bound on the number of traps that may be generated within this window. Unlike RFC 1224, traps are not sent to inform the network manager that the throttling mechanism has kicked in. A single window should be used to throttle all OSPF traps types except for the ospfLsdbOverflow and the ospfLsdbApproachingOverflow trap which should not be throttled. For example, if the window time is 3, the upper bound is 3 and the events that would cause trap types 1,3,5 and 7 occur within a 3 second period, the type 7 trap should not be generated. Appropriate values are 7 traps with a window time of 10 seconds. * TAC mentioned that it is expected behavior targeted on protecting CPUs that can be changed tweaking how snmpd reacts to OL signal. While it kinda makes sense from an overall perspective, it seems very strange that a lab XR router running on a multicore Xeon CPU is unable to send out a handful of traps (i was expecting about 50 traps over a 1 minute period). At the same time, TAC didn't mention anything about OSPF's built-in OSPF trap throttling mechanisms. Further investigation showed that snmpd was silently dropping internal messages because of the OL condition. This is not the way I would expect a Linux based system to behave. Trying to diagnose XR, I collected some syslogs and with a quick script to compare what XR is sending out over it's SNMP interface and it's Syslog interface. To my surprise, while OSPF related SNMP traps were really bad, Syslog feed was clear and accurate. As if syslogd doesn't react to OL condition and relies on a different OS scheduling mechanism. I tried to put extra stress on a router and created a few hundreds of subinterfaces and only a few OSPF neighbors and flapped an interface. Yet another time, SNMP traps sent to the collector were crippled, syslog was clear, and reflected everything. Going deeper, I tested BGP reconvergence and tried to observe what is happening in there. Yet another time SNMP trap feeds were very bad at reflecting the status of BGP FASM transitions (I looked at cbgpFsmStateChange). However now even outgoing syslog messages were affected as well. This is a bit surprising as well as if BGP on XR uses CPUs differently. But I still not get it how an idling lab router is unable to send tramps and syslog messages indicating that it just experienced a big outage. On the one hand side, some of that behavior is actually derived from the fact that the router is trying to bring the connectivity up and operational asap, but on the other hand side, these CPU throttling aspects were drafted back in the days when routers ran using 800Mhz CPUs, while now we're running on server-grade multi-core x86s that should not only be able to do the whole SPF computation in milliseconds, but also swamp any alarm collector at the same time. However, it feels like either SNMP/syslog reporting function in XR didn't address any hw improvements that happened over time, or that XR's OS process scheduler has major deficiencies. And I feel like it is the later. I knew SNMP and syslogs are inherently not reliable being UDP based, but I never expected that even a router itself doesn't try to inform alarm collectors about potential large scale outages. If it is actually the OS scheduler, a lot of existing processes and any new upcoming features, like BGP-LS or telemetry may be affected. Later I remembered that XR6 is now based on Windriver Linux, now QNX as it was previously. While QNX is an RTOS, Linux is not, and it's kernel relies on a totally different scheduling principle. At the same time, it feels as if XR's internals were hard on-bolted onto a new OS without much of a thought about its architecture. Potentially that may lead to a large number of gray outages that are not properly detected by XR6 routers, and even not reported NOCs globally. Am I the only one seeing that behavior in XR? Did anyone else test how XR routers running on multicore CPUs handle concurrency? Maybe anyone compared how XR handles process concurrency for network events? Unfortunately, I am unable to share any data dumps, but I would be happy to share scripts and methods I used for data analysis. Hope I am just overreacting. Rgds, Nival _______________________________________________ cisco-nsp mailing list [email protected] https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/
