Hi James,

We see the same incredible noise from a few peers we have at RouteViews. So much so that they put quite a stress on our backend infrastructure (not only the collector itself, but the syncing of these updates back to our archive, and storage.) And, longer term, do researchers who use RouteViews data want us to keep this noise for ever and ever in our archive, as it consumes Tbytes of compressed disk space...

And yes, like you, we reach out to the providers of the incredible noise, and they are usually unresponsive (PeeringDB contact entries are for what exactly I wonder). I feel your pain!! So we have to take the (for us) drastic step of shutting down the peer - we have to keep RouteViews useful and usable for the community, as we have been trying to do for the (almost) last 30 years. Of course, some folks want to study the incredible noise too, but ultimately none of it really helps keep the Internet infrastructure stable.

I guess all we can do is keep highlighting the problem (I highlight Geoff's BGP Update report almost every BGP Best Practice training I run here in AsiaPac, for example) - but how to make noisy peers go away, longer term...? :-(

philip
--

James Bensley wrote on 9/2/2025 23:43:
Hi Romain,

I have been looking at prefixes with large numbers of updates for a few years 
now. As Geoff pointed out, this is a long running problem and it’s not one that 
is going to (ever?) go away.

I have auto-generated daily reports of general pollution seen in the DFZ from 
the previous day, which can be found here: 
https://github.com/DFZ-Name-and-Shame/dnas_stats [1].

Geoff pointed out “when a prefix is being updated 33,000 times in 145 days its 
basically being updated as fast as many BGP implementations will let you”, 
however, there are peers generating millions of updates *per day* for the same 
prefix! There are multiple issues here which need to be unpacked to fully 
understand what’s going on…

* Sometimes BGP updates a prefix as fast as BGP allows (this is what Geoff has 
pointed out). This could be for a range of reasons like a flapping link, or a 
redistribution issue.

* Sometimes there is a software bug in BGP which re-transmits the update as 
fast as TCP allows. Here is an example prefix in a daily report, which was 
present in 4M updates, from a single peer of a single route collector: 
https://github.com/DFZ-Name-and-Shame/dnas_stats/blob/main/2023/03/15/20230315.txt#L607

The ASN a few lines below has almost exactly the same number of BGP 
advertisements for that day:  AS394119 (also that name, 
EXPERIMENTAL-COMPUTING-FACILITY, feels like a bit of a smoking gun!). Using a 
looking glass we can confirm that 394119 is the origin of that prefix: 
https://stat.ripe.net/widget/looking-glass#w.resource=2602:fe10:ff4::/48

Here is a screenshot from RIPE Stat at the same time, where they are recording 
nearly 30M updates for the same prefix per day: 
https://null.53bits.co.uk/uploads/images/networking/internetworking/ripe-prefix-count.png

The difference is that the 30M number is across all RIPE RIS collectors. I have 
tried to de-dupe in my daily report and just choose the highest value from a 
single collector. ~4M updates per day is ~46 updates per second. So this is a 
BGP speaker stuck in an infinite loop sending an update as fast as it can and 
for some reason, and not registering in it’s internal state engine that the 
update has been sent.

I’ve reach out to a few ASNs who’ve shown up in my daily reports for excessive 
announcements, and what I have seen is that sometimes it’s the ASN peering with 
the RIS collector, or sometimes it’s an ASN the route collector peer  is 
peering with. In multiple cases, they have simply bounced their BGP session 
with their peer or with the collector, and the issues has gone away.

* Sometimes a software bug causes the state of a route to falsely change. As an 
example of this, there was a bug in Cisco’s IOS-XR. If you were running soft 
reconfiguration inbound and RPKI, I think any RPKI state changes (to any 
prefix) was causing a route refresh. Or something like that. It’s been a couple 
of years, but you needed this specific combination of features, and IOS-XR was 
just churning out updates like it’s life depended on it. I reached out to an 
ASN who showed up in my daily report, they confirmed they had this bug, and 
eventually they fixed in.

* These problems aren’t DFZ wide. Peer A might be sending a bajillion updates 
to peer B, but peer B sees there is no change in the route and correctly 
doesn’t forward the update onwards to it’s peers / public collectors. So this 
is probably happing a lot more than we see via RIS or RouteViews. Only some 
parts of the DFZ will be receiving the gratuitous updates/withdraws.

I recall there was a conversation either here on NANOG or maybe it was at the 
IETF, within the last few years, about different NOSes that were / were not 
correctly identifying route updates received with no changes to the existing 
RIB entry, and [not] forwarding the update onwards. I’m not sure what came of 
this.

* Some people aren’t monitoring and alerting based on CPU usage. There are plenty 
of old rust buckets connected to the DFZ who’s CPUs will be on fire as a result of 
a buggy peer and the operator is simply unaware. Some people are running cutting 
edge devices with 12 cores at 3Ghz and 64GBs of RAM, for which all this churn is 
no problem, so even if their CPU usage is monitored, it will be one core at 100% 
but the rest nearly idle, and crappy monitoring software will aggregate this and 
report the device has having <10% usage and operators think everything is fine.

* There are no knobs in existing BGP implementations to detect and limit this 
behaviour in anyway.

If you end contacting the operators of the ASNs in your original email, and 
getting this problem fixed, I’d be interested to know what the cause was in 
those cases. I’ve all but given up contacting operators that show up in my 
daily reports. It’s an endless endeavour, and some operators simply don’t 
respond to my multiple emails (also, I have a day job, and a personal life, and 
I also like sleeping and eating from time to time).

With kind regards,
James.

[1] It stopped working recently, so it’s now "catching-up", which is why the 
data is a few days behind.

Reply via email to