Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Fri, 9 Jul 2021 at 00:01, William Herrin wrote: > I would suggest that your customer does care, but as there is no Most don't. Somewhat recently we were dropping a non-trivial amount of packets from a well-known book store due to DMAC failure. This was unexpected, considering it was an L3 to

ATTN: Mentees, Mentors, and Future Students - We need your voice!

2021-07-08 Thread Nanog News
*So, What is it Actually like to Participate in a Hackathon?* *You may have heard about this thing called a "Hackathon."* The weekend-long event that provides for epic networking, as fellow engineers from all over the world get together and share

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread William Herrin
On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti wrote: > Network experiences gray failures all the time, and I almost never > care, unless a customer does. Greetings, I would suggest that your customer does care, but as there is no simple test to demonstrate gray failures, your customer rarely makes it

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Baldur Norddahl
We had a line card that would drop any IPv6 packet with bit #65 in the destination address set to 1. Turns out that only a few hosts have this bit set to 1 in the address, so nobody noticed until some debian mirrors started to become unreachable. Also webbrowser are very good at switching to IPv4 i

GAO Report: 25/3 Mbps is likely not fast enough

2021-07-08 Thread Sean Donelan
New report published by the Government Accountability Office. FCC Should Analyze Small Business Speed Needs https://www.gao.gov/assets/gao-21-494.pdf "FCC’s minimum speed benchmark of 25/3 Mbps is likely not fast enough to meet the needs of many small businesses, particularly with regard to

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Warren Kumari
On Thu, Jul 8, 2021 at 8:32 AM Saku Ytti wrote: > > On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent wrote: > > > Detecting whole-link and node failures is relatively easy nowadays (e.g., > > using BFD). But what about detecting gray failures that only affect a > > *subset* of the traffic, e.g. a

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 19:25, Lukas Tribus wrote: > More generally speaking, single link overloads causing PL or even full > blackholing affecting single links (and therefore in a load-balanced > environment: specific tuples) is something that is very frustrating to > troubleshoot and it happen

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Lukas Tribus
Hello, there is a large eyeball ASN in Southern Europe, single homed to a Tier1 running under the same corporate umbrella, which for about a decade suffered from periodic blackholing of specific src/dst tuples. The issue occurred every 6 - 18 months, completely breaking specific production traffic

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 17:59, Vanbever Laurent wrote: > Thanks for sharing! I guess this process working means the counters are > "standard" / close enough across vendors to allow for comparisons? Not at all I'm afraid, and not intended for user consumption so generally not available via SNMP or

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
> One method is collecting lookup exceptions. We scrape these: > > npu_triton_trapstats.py:command = "start shell sh command \"for > fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); > do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\"" > ptx1k_trapstats.py:c

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
Hi Jörg, Thanks for sharing your gray failure! With a few years of lifespan, it might well be the oldest gray failure ever monitored continuously :-) I'm pretty sure you guys exhausted all options already but... did you check for micro-bursts that may cause sudden buffer overflow? Or perhaps is

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Tom Beecher
> > If there is a network which does not > experience these, then it's likely due to lack of visibility rather > than issues not existing. > This. Full stop. I believe there are very few, if any, production networks in existence in which have a 0% rate of drops or 'weird shit'. Monitoring for s

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Jörg Kost
We have a similar gray issue, where switches in a virtual chassis configuration with layer3-configuration seem to lose transit ICMP messages like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let alone variances, or errors in measuring ). We noticed this when we replaced

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 16:13, Vanbever Laurent wrote: > Thanks for chiming in. That's also my feeling: a *lot* of gray failures > routinely happen, a small percentage of which end up being really damaging > (the ones hitting customer traffic, as you pointed out). For this small > percentage tho

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Mark Tinka
On 7/8/21 15:22, Vanbever Laurent wrote: Did you folks manage to understand what was causing the gray issue in the first place? Nope, still chasing it. We suspect a FIB issue on a transit device, but currently building a test to confirm. Mark.

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread colin johnston
Uucp using tcp does work to overcome packet size problems but limited usage but did work in the past Col

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
> On 8 Jul 2021, at 14:59, Mark Tinka wrote: > > On 7/8/21 14:29, Saku Ytti wrote: > >> Network experiences gray failures all the time, and I almost never >> care, unless a customer does. If there is a network which does not >> experience these, then it's likely due to lack of visibility rathe

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
> On 8 Jul 2021, at 14:29, Saku Ytti wrote: > > On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent wrote: > >> Detecting whole-link and node failures is relatively easy nowadays (e.g., >> using BFD). But what about detecting gray failures that only affect a >> *subset* of the traffic, e.g. a rout

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Mark Tinka
On 7/8/21 14:29, Saku Ytti wrote: Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing. Fixing these can take months o

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Mark Tinka
On 7/8/21 14:29, Saku Ytti wrote: Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing. Fixing these can take months o

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent wrote: > Detecting whole-link and node failures is relatively easy nowadays (e.g., > using BFD). But what about detecting gray failures that only affect a > *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? > Does your n

Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
Dear NANOG, Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are t