> On Aug 1, 2023, at 2:18 PM, Mike Hammett <na...@ics-il.net> wrote:
>
> I have a wave transport vendor that suffered issues twice about ten days
> apart, causing my link to flap a bunch. I put in a ticket on the second set
> of occurrences. I was told that there was a card issue identified and would
> be notified when the replacement happened. Ticket closed.
>
> Three weeks later, I opened a new ticket asking for the status. The new card
> arrived the next day, but since no more flaps were happening, the card would
> not be replaced. Ticket closed.
>
>
> A) It doesn't seem like they actually did anything to fix the circuit.
> B) They admitted a problem and sent a new card.
> C) They later decided to not do anything.
>
>
> Is that normal?
> Is that acceptable?
>
>
> To avoid issues flapping causes, I disabled that circuit until repaired, but
> it seems like they're not going to do anything and I only know that because I
> asked.
With passive components like amplifiers and such, or they might have had
someone do work that they don’t want to fess up to (which is kinda silly) I get
that.
I have our junipers configured with a 5 second up timer, eg: "hold-time up 5000”
This way a flapping circuit must be stable for at least a few seconds before it
can be placed back into service, otherwise if you have a prefix that comes from
connected/direct/static/qualified-next-hop it won’t go into another protocol
and possibly cause a globally visible BGP event.
Some providers have a much more disruptive layer-1 infrastructure and will ask
you to configure a 1s+ up timer. I think there’s an interesting question that
could go either way, do you want transport side faults to be exposed to you, or
should the client interface in a system be held up so you don’t have that fault
condition forward (sometimes called FDI) to the client interface.
They may have had the system misconfigured so you saw a fault on a protected
path when there was a switch. It can take some time for the transponder to
re-tune if the timing is different if your A path is 25km and B side is 5km and
you have a optical switch, with the higher PHY rates it will take some extra
time.
I know that Cisco also has these interface timers, but some of the others may
not (eg: I don’t know if Mikrotik has them, but queue the wiki in a reply).
If it’s stable for 48 hours, I would place it back into service, but you should
escalate at the same time and determine if they were truly hands off. It may
be a fiber was bent and is now fixed and that actually was the root cause.
Hope this helps you and a few others.
- jared