LAG - Micro BFD (RFC7130) provides per constituent livability. MLAG is much more complicated (there’s a proposal in IETF but not progressing), so LACP is pretty much the only option. ECMP could use old/good single hop BFD per pair. Practically - if you introduce enough flows with one of the hash keys monotonically changing, eventually you’d exercise every path available; on itself would not help for end2end testing, usually integrated with a form of s/net flow to provide “proof of transit. Inband telemetry (chose your poison) does provide basic device ID it has traversed as well as in some cases POT. Finally - there are public Microsoft presentations how we use IPinIP encap to traverse a particular path on wide radix ECMP fabrics.
Cheers, Jeff > On Nov 12, 2021, at 07:55, Adam Thompson <athomp...@merlin.mb.ca> wrote: > > > Hello all. > Over time, we've run into occurrences of both bugs and human error, both in > our own gear and in our partner networks' gear, specifically affecting > multi-path forwarding, at pretty much all layers: Multi-chassis LAG, ECMP, > and BGP MP. (Yes, I am a corner-case magnet. Lucky me.) > > Some of these issues were fairly obvious when they happened, but some were > really hard to pin down. > > We've found that typical network monitoring tools (Observium & Smokeping, not > to mention plain old ping and traceroute) can't really detect a > hashing-related or multi-path-related problem: either the packets get through > or they don't. > > Can anyone recommend either tools or techniques to validate that multi-path > forwarding either is, or isn't, working correctly in a production network? > I'm looking to add something to our test suite for when we make changes to > critical network gear. Almost all the scenarios I want to test only involve > two paths, if that helps. > > The best I've come up with so far is to have two test systems (typically VMs) > that use adjacent IP addresses and adjacent MAC addresses, and test both > inbound and outbound to/from those, blindly trusting/hoping that hashing > algorithms will probably exercise both paths. > > Some of the problems we've seen show that merely looking at interface > counters is insufficient, so I'm trying to find an explicit proof, not > implicit. > > Any suggestions? Surely other vendors and/or admins have screwed this up in > subtle ways enough times that this knowledge exists? (My Google-fu is > usually pretty good, but I'm striking out - maybe I'm using the wrong terms.) > > -Adam > > Adam Thompson > Consultant, Infrastructure Services > > 100 - 135 Innovation Drive > Winnipeg, MB, R3T 6A8 > (204) 977-6824 or 1-800-430-6404 (MB only) > athomp...@merlin.mb.ca > www.merlin.mb.ca